Gabriela Cohen Freue
Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - May 2021)
Linear regression models are commonly used statistical models for predicting a response from a set of predictors.Technological advances allow for simultaneous collection of many predictors, but often only a small number of these is relevant for prediction.Identifying this set of predictors in high-dimensional linear regression models with emphasis on accurate prediction is thus a common goal of quantitative data analyses.While a large number of predictors promises to capture as much information as possible, it bears a risk of containing contaminated values.If not handled properly, contamination can affect statistical analyses and lead to spurious scientific discoveries, jeopardizing the generalizability of findings.In this dissertation I propose robust regularized estimators for sparse linear regression with reliable prediction and variable selection performance under the presence of contamination in the response and one or more predictors.I present theoretical and extensive empirical results underscoring that the penalized elastic net S-estimator is robust towards aberrant contamination and leads to better predictions for heavy tailed error distributions than competing estimators.Especially in these more challenging scenarios, competing robust methods reliant on an auxiliary estimate of the residual scale, are more affected by contamination due to the high finite-sample bias introduced by regularization.For improved variable selection I propose the adaptive penalized elastic net S-estimator.I show this estimator identifies the truly irrelevant predictors with high probability as sample size increases and estimates the parameters of the truly relevant predictors as accurately as if these relevant predictors were known in advance.For practical applications robustness of variable selection is essential.This is highlighted by a case study for identifying proteins to predict stenosis of heart vessels, a sign of complication after cardiac transplantation.High robustness comes at the price of more taxing computations.I present optimized algorithms and heuristics for feasible computation of the estimates in a wide range of applications. With the software made publicly available, the proposed estimators are viable alternatives to non-robust methods, supporting discovery of generalizable scientific results.
Master's Student Supervision (2010 - 2020)
In order to predict the population of Indian reserves in Canada for the 2016 Census, we can construct a suitable model using data from the Indian Register and past censuses. Linear mixed effects models are a popular method for predicting values of responses on longitudinal data. However, linear mixed effects models require repeated measures in order to fit a model. Alternative methods such as linear regression only require data from a single time point in order to fit a model, but it does not directly account for within-individual correlation when predicting. Since we are predicting the responses of the same set of individuals, we can expect responses at the next time point to be strongly correlated with past responses for an individual.We introduce a new method of prediction, temporal adjusted prediction (TAP), that addresses the issue of within-individual correlation in predictions and only requires data from a single time point to estimate model parameters. Predictions are based on the last recorded response of an individual and adjusted based on changes to the values of their covariates and estimated regression coefficients that relate the response and the covariates. Predictions are made using a random intercept model rather than a linear regression model. It is shown that if the random intercept accounts for a larger proportion of the random variation in the data than the random error term, then temporal adjusted prediction achieves a lower mean squared prediction error than linear regression.TAP performs better than linear regression when predicting on the same set of individuals at different time points. It also shows similar prediction performance compared to linear mixed effects models estimated with maximum likelihood estimation despite only requiring data from one time point in order to fit a model.
Instrumental variables are commonly used in statistics, econometrics, and epidemiology to obtain consistent parameter estimates in regression models when some of the predictors are correlated with the error term. However, the properties of these estimators are sensitive to the choice of valid instruments. Since in many applications, valid instruments come in a bigger set that includes also weak and possibly irrelevant instruments, the researcher needs to select a smaller subset of variables that are relevant and strongly correlated with the predictors in the model. This thesis reviews part of the instrumental variables literature, examines the problems caused by having many potential instruments, and uses different variables selection methods in order to identify the relevant instruments. Specifically, the performance of different techniques is compared by looking at the number of relevant variables correctly detected, and at the root mean square error of the regression coefficients’ estimate. Simulation studies are conducted to evaluate the performance of the described methods.