Gabriela Cohen Freue: Associate Professor at Department of Statistics, UBC Faculty of Science

Associate Professor

Faculty of Science

Relevant Thesis-Based Degree Programs

View all programs

Affiliations to Research Centres, Institutes & Clusters

Data Science Institute

Open All

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Identifying predictive gene expression signatures of sepsis severity (2022)

Sepsis is a common and very heterogenous syndrome defined as the life-threatening organdysfunction caused by an aberrant host response to infection. In the earliest stages, sepsisdiagnoses are often missed due to non-specific symptomatology resulting in a rapid progression tosevere sepsis. Gene expression signatures that measure host immune responses have been shownto provide more sensitive prognostic tools than existing clinical criteria, permitting early predictionof high-risk patients. We recruited, from six global cohorts, 266 suspected sepsis patients in theemergency room and 82 suspected pulmonary sepsis patients in the intensive care unit with varyingdisease severity, and 44 healthy controls. Most recently, I analyzed 135 patients with Covid-19disease that showed immune responses overlapping with sepsis. From this, I identified candidategene expression signatures reflecting endotypes and severity markers, using the transcriptomicsmethod, RNA-Seq, and statistical and computational methods.I determined that early sepsis patients could be stratified into five endotypes defined by distinctpathobiological mechanisms, including unique gene expression differences and accurate,predictive gene expression pairs. Two of the five endotypes were associated with a higher tendencytowards severe sepsis and mortality, two demonstrated much lower severity, and one was relativelybenign. Diverse molecular responses were also observed independently of endotypes; thus,concomitant cross-cutting severity signatures that directly predicted sepsis-induced organdysfunction and mortality were identified, in addition to dysregulated and co-expressed modulegenes. The endotype signatures were often consistent with cellular shifts in neutrophil numbersand function, whereas dysregulated molecular responses like cellular reprogramming and hyperinflammation reflected prognoses. These signatures were assessed in other conditions (e.g.,pancreatitis, appendicitis, myocardial infarction), which indicated the signatures capturedmechanisms specific to early sepsis/sepsis. A compendium of dysregulated genes and signaturesin sepsis was curated from the literature, confirming that these signatures involved wellcharacterized genes.This study demonstrated that signatures relevant to the development of life-threatening sepsis canbe observed as early as first entry into the ER. These signatures will enable the development ofdiagnostics and targeted therapeutics, and importantly, when used early in the sepsis diseasecourse, could prevent rapid patient deterioration, mortality, and poor long-term outcomes.

View record

Robust estimation and variable selection in high-dimensional linear regression models (2020)

Linear regression models are commonly used statistical models for predicting a response from a set of predictors.Technological advances allow for simultaneous collection of many predictors, but often only a small number of these is relevant for prediction.Identifying this set of predictors in high-dimensional linear regression models with emphasis on accurate prediction is thus a common goal of quantitative data analyses.While a large number of predictors promises to capture as much information as possible, it bears a risk of containing contaminated values.If not handled properly, contamination can affect statistical analyses and lead to spurious scientific discoveries, jeopardizing the generalizability of findings.In this dissertation I propose robust regularized estimators for sparse linear regression with reliable prediction and variable selection performance under the presence of contamination in the response and one or more predictors.I present theoretical and extensive empirical results underscoring that the penalized elastic net S-estimator is robust towards aberrant contamination and leads to better predictions for heavy tailed error distributions than competing estimators.Especially in these more challenging scenarios, competing robust methods reliant on an auxiliary estimate of the residual scale, are more affected by contamination due to the high finite-sample bias introduced by regularization.For improved variable selection I propose the adaptive penalized elastic net S-estimator.I show this estimator identifies the truly irrelevant predictors with high probability as sample size increases and estimates the parameters of the truly relevant predictors as accurately as if these relevant predictors were known in advance.For practical applications robustness of variable selection is essential.This is highlighted by a case study for identifying proteins to predict stenosis of heart vessels, a sign of complication after cardiac transplantation.High robustness comes at the price of more taxing computations.I present optimized algorithms and heuristics for feasible computation of the estimates in a wide range of applications. With the software made publicly available, the proposed estimators are viable alternatives to non-robust methods, supporting discovery of generalizable scientific results.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

A new data driven framework for simulating mendelian randomization data (2023)

Mendelian randomization (MR) is a causal inference method that allows biostatisticians to leverage DNA measurements to study causal effects with only observed data. Recent advancements including two-sample summary-level mendelian randomization (TSSLMR) and the data source IEU OpenGWAS database have lowered the barrier for conducting MR studies and opened the opportunity to mine causal effects. In the first part of the thesis, I show that there is a mismatch between the characteristics of modern TSSLMR data and how articles that propose popular TSSLMR models conduct their simulations. Next, I propose my solution: a data driven simulation framework for MR data that aims to be realistic, interpretable and easy to use thanks to a complementary R package implementation. As for the results, I show that models perform far better in literature-based simulations compared to more realistic simulations based on my proposed framework. Lastly, I warn that the mismatch between simulated and real data along with the obtained results may lead researchers to have over optimistic expectations about models performance in real applications.

View record

Enhancing the robustness of instrumental variable estimation with potentially invalid instruments and its application to Mendelian randomization (2023)

In epidemiology and medicine, identifying the causal relationship between an exposure and an outcome is crucial to gain valuable insights of disease mechanisms and to improve patient care. However, a common problem when attempting to extract the causal relationship between an exposure and an outcome in observational studies is the presence of unmeasured confounding. To address this issue, instrumental variable (IV) estimation methods have been proposed to capture the relationship between the exposure and the outcome that is unaffected by the confounding variables. In particular, Mendelian Randomization (MR) is a statistical methodology that uses genetic variants as instruments. However, the validity of these genetic variants as instruments is often questionable due to the presence of pleiotropy and linkage disequilibrium. In practice, it is often difficult to ascertain the validity of the instruments as it requires complete knowledge of the involved genes' function. Furthermore, exposure and outcome data are often contaminated by outlying observations. In this thesis, we propose a novel two-step robust and penalized IV estimator and an algorithm to compute it, called the Robustified Some Valid Some Invalid Instrumental Variable Estimator (rsisVIVE), based on the sisVIVE method of Kang et al. (2016). The rsisVIVE estimates the causal effect of an exposure on an outcome using observational data in the presence of invalid instruments while tolerating large proportions of outlying observations. Simulation results show that the rsisVIVE more accurately estimates the causal parameter than the sisVIVE when instruments are weak and when there are no outlying observations. The rsisVIVE also outperforms competitor IV estimators in all cases when there are large proportions of outlying observations.

View record

Penalized competing risks analysis using casebase sampling (2023)

In biomedical studies, quantifying the association of prognostic genes/mark- ers on the time-to-event is crucial for predicting a patient’s risk of disease based on their specific covariate profile. Modelling competing risks is es- sential in such studies, as patients may be susceptible to multiple mutually exclusive events, such as death from alternative causes. Existing methods for competing risks analyses often yield coefficient estimates that lack inter- pretability, as they cannot be associated with the event rate. Moreover, the high dimensionality of genomic data, where the number of variables exceeds the number of subjects, presents a significant challenge. In this work, we propose a novel approach that involves fitting an elastic-net penalized multi- nomial model using the case-base sampling framework to model competing risks survival data. Furthermore, we develop a two-step method, known as the de-biased case-base, to enhance the prediction performance of the risk of disease. Through a comprehensive simulation study that emulates biomedical data, we show that the case-base method is competent in terms of variable selection and survival prediction, particularly in scenarios such as non-proportional hazards. We additionally showcase the flexibility of this approach in providing smooth-in-time incidence curves, which improve the accuracy of patient risk estimation.

View record

Regularized relative risk regression : a non-GLM approach with emphasis on large p, small N simulations (2023)

In clinical research, the determination of the association's strength between two events is paramount. This may involve probing the relationship between a risk factor and a health outcome, or evaluating the link between a treatment and its efficacy. The Odds Ratios (OR) and Relative Risks (RR) stand out as the predominant measures for such evaluations. While logistic regression is commonly employed for OR modeling, and Poisson regression for RR, each has its set of limitations in practical applications. In light of these limitations, Richardson et al. (2017) introduced a novel non-GLM binary regression approach for direct RR estimation using a log odds-product nuisance model. This technique elegantly sidesteps the intertwined dependence of RR on baseline risk. However, this method encountered challenges in high-dimensional and sparse model estimation (p > N). To address these issues, this study introduces a novel estimator founded on the binary regression model, which is further refined with an algorithm using Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) to solve the optimization problem. This algorithm encourages sparsity in the solution and enables variable selection, thereby improving the utility for high-dimensional and sparse models. This thesis examines the properties of the estimator through simulation studies and discusses the potential for future enhancements and applications. The presented work represents a step forward in creating alternative methodologies for estimating relative risks in diverse data landscapes.

View record

Robust sparse covariance-regularized regression for high-dimensional data with Casewise and Cellwise outliers (2023)

Modern biomedical datasets, such as those found in genomic and proteomic studies, often involve a large number of predictor variables relative to the number of observations, pointing to the need for statistical methods specifically designed to handle high-dimensional data. In particular, for a regression task, regularized methods are needed to select a sparse model, that is, one that uses only a subset of the large number of features available to predict a response. The presence of outliers in the data further complicates this task. Many existing robust and sparse regression methods are computationally expensive when the dimensionality of the data is high. Furthermore, most of these previously developed methods were developed under the assumption that outliers occur casewise, which is not always a realistic assumption in high-dimensional settings. We propose a sparse and robust regression method for high-dimensional data that is based on regularized precision matrix estimation. Our method can handle both casewise and cellwise outliers in low- and high-dimensional settings. Through simulation studies, we also compare our method to existing sparse and robust methods by evaluating computational efficiency, prediction performance, and variable selection capabilities.

View record

Temporal adjusted prediction for predicting Indian reserve populations in Canada (2018)

In order to predict the population of Indian reserves in Canada for the 2016 Census, we can construct a suitable model using data from the Indian Register and past censuses. Linear mixed effects models are a popular method for predicting values of responses on longitudinal data. However, linear mixed effects models require repeated measures in order to fit a model. Alternative methods such as linear regression only require data from a single time point in order to fit a model, but it does not directly account for within-individual correlation when predicting. Since we are predicting the responses of the same set of individuals, we can expect responses at the next time point to be strongly correlated with past responses for an individual.We introduce a new method of prediction, temporal adjusted prediction (TAP), that addresses the issue of within-individual correlation in predictions and only requires data from a single time point to estimate model parameters. Predictions are based on the last recorded response of an individual and adjusted based on changes to the values of their covariates and estimated regression coefficients that relate the response and the covariates. Predictions are made using a random intercept model rather than a linear regression model. It is shown that if the random intercept accounts for a larger proportion of the random variation in the data than the random error term, then temporal adjusted prediction achieves a lower mean squared prediction error than linear regression.TAP performs better than linear regression when predicting on the same set of individuals at different time points. It also shows similar prediction performance compared to linear mixed effects models estimated with maximum likelihood estimation despite only requiring data from one time point in order to fit a model.

View record

Instrumental Variables Selection: A Comparison between Regularization and Post-Regularization Methods (2015)

Instrumental variables are commonly used in statistics, econometrics, and epidemiology to obtain consistent parameter estimates in regression models when some of the predictors are correlated with the error term. However, the properties of these estimators are sensitive to the choice of valid instruments. Since in many applications, valid instruments come in a bigger set that includes also weak and possibly irrelevant instruments, the researcher needs to select a smaller subset of variables that are relevant and strongly correlated with the predictors in the model. This thesis reviews part of the instrumental variables literature, examines the problems caused by having many potential instruments, and uses different variables selection methods in order to identify the relevant instruments. Specifically, the performance of different techniques is compared by looking at the number of relevant variables correctly detected, and at the root mean square error of the regression coefficients’ estimate. Simulation studies are conducted to evaluate the performance of the described methods.

View record