Paul Gustafson

Faculty of Science

Research Classification

Statistics

Research Interests

meta-analysis

Parametric and Non-Parametric Inference

Theoretical Statistics

Pharmacoepidemiology

Bayesian statistical methods

Biostatistics and Epidemiology

Causal inference

Evidence synthesis

Partial Identification

Relevant Thesis-Based Degree Programs

View all programs

Recruitment

Looking to recruit:

Master's students, Doctoral students, Postdoctoral Fellows

Desired start dates:

Any time / year round

Complete these steps before you reach out to a faculty member!

Focus your search

Make a good impression

ADVICE AND INSIGHTS FROM UBC FACULTY ON REACHING OUT TO SUPERVISORS

These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.

If you have reviewed some of this faculty member's publications, understand their research interests and have reviewed the admission requirements, you may submit a contact request to this supervisor.

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Bayesian causal inference for discrete data (2024)

Causal inference provides a framework for estimating how a response changes when a given cause of interest changes. When all data are discrete we can use saturated nonparametric models to avoid unnecessary assumptions in our causal inference modelling, where we specify unique parameters for all possible combinations of treatments and confounders when estimating an outcome. Bayesian methods allow us to incorporate prior information into these saturated models, making them usable beyond simple settings with low dimensional confounders. In this thesis we propose two new nonparametric Bayes methods for causal inference based on saturated modelling. The first method combines a parametric model with a nonparametric saturated outcome model to estimate treatment effects in observational studies with longitudinal data. By conceptually splitting the data, we can combine these models while maintaining a conjugate framework, allowing us to avoid the use of Markov chain Monte Carlo methods. Approximations using the central limit theorem and random sampling allows our method to be scaled to high-dimensional confounders. The second method uses prior restrictions of the parameter space of a saturated model to partially identify causal effect estimates in scenarios with nonignorable missing outcome data. We focus on two common restrictions, instrumental variables and the direction of missing data bias, and investigate how these restrictions narrow the identification region for parameters of interest. Additionally, we propose a rejection sampling algorithm that allows us to quantify the evidence for these assumptions in the data. Saturated models require discrete data so continuous data must be discretized to use these methods, which can introduce residual confounding. We conclude by proposing a new soft-thresholding technique to discretize continuous confounders in the context of frequentist linear regression. We show using a triangular distribution weighting function can reduce the bias induced by discretization, while maintaining the interpretability benefits typically associated with discrete variables.

View record

Bayesian models for hierarchical clustering of network data (2023)

Network data represent relational information between interacting entities. They can be described by graphs, where vertices denote the entities and edges mark the interactions. Clustering is a common approach for uncovering hidden structure in network data. The objective is to find related groups of vertices, through their shared edges. Hierarchical clustering identifies groups of vertices across multiple scales, where a dendrogram represents the full hierarchy of clusters. Often, Bayesian models for hierarchical clustering of network data aim to infer the posterior distribution over dendrograms.The Hierarchical Random Graph (HRG) is likely the most popular Bayesian approach to hierarchical clustering of network data. Due to simplifications made in its inference scheme, we identify some potentially undesirable model behaviour. Mathematically, we show that this behaviour presents in two ways: symmetry of the likelihood for graphs and their complements, and non-uniformity of the prior. The latter is exposed by finding an equivalent interpretation of the HRG as a proper Bayesian model, with normalized likelihood. We show that the amount of non-uniformity is exacerbated as the size of the network data increases. In rectifying the issues with the HRG, we propose a general class of models for hierarchical clustering of network data. This class is characterized by a sampling construction, defining a generative process for simple graphs, on fixed vertex sets. It permits a wide range of probabilistic models, via the choice of distribution over edge counts between clusters. We present four Bayesian models from this class, and derive their respective properties, like expected edge density and independence. For three of these models, we derive a closed-form expression for the marginalized posterior distribution over dendrograms, to isolate the problem of inferring a hierarchical clustering. We implement these models in a probabilistic programming language that leverages state-of-the-art approximate inference methods. As our class of models use a uniform prior over dendrograms, we construct an algorithm for sampling from this prior. Finally, the empirical performance of our models is demonstrated on examples of real network data.

View record

Bayesian adjustments for disease misclassification in epidemiological studies of health administrative databases, with applications to multiple sclerosis research (2019)

With disease information routinely established from diagnostic codes or prescriptions in health administrative databases, the topic of outcome misclassification is gaining importance in epidemiological research. Motivated by a Canada-wide observational study into the prodromal phase of multiple sclerosis (MS), this thesis considers the setting of a matched exposure-disease association study where the disease is measured with error.We initially focus on the special case of a pair-matched case-control study. Assuming non-differential misclassification of study participants, we give a closed-form expression for asymptotic biases in odds ratios arising under naive analyses of misclassified data, and propose a Bayesian model to correct association estimates for misclassification bias. For identifiability, the model relies on information from a validation cohort of correctly classified case-control pairs, and also requires prior knowledge about the predictive values of the classifier. In a simulation study, the model shows improved point and interval estimates relative to the naive analysis, but is also found to be overly restrictive in a real data application.In light of these concerns, we propose a generalized model for misclassified data that extends to the case of differential misclassification and allows for a variable number of controls per case. Instead of prior information about the classification process, the model relies on individual-level estimates of each participant's true disease status, which were obtained from a counting process mixture model of MS-specific healthcare utilization in our motivating example.Lastly, we consider the problem of assessing the non-differential misclassification assumption in situations where the exposure is suspected to impact the classification accuracy of cases and controls, but information on the true disease status is unavailable. Motivated by the non-identified nature of the problem, we consider a Bayesian analysis and examine the utility of Bayes factors to provide evidence against the null hypothesis of non-differential misclassification. Simulation studies show that for a range of realistic misclassification scenarios, and under mildly informative prior distributions, posterior distributions of the exposure effect on classification accuracy exhibit sufficient updating to detect differential misclassification with moderate to strong evidence.

View record

If journals embraced conditional equivalence testing, would research be better? (2019)

The Gene-Environment Independence Assumption in the Analysis of Case-Control Data (2017)

In this thesis, we consider the problem of exploiting the gene-environment independence assumption in a case-control study inferring the joint effect of genotype and environmental exposure on disease risk. We first take a detour and develop the constrained maximum likelihood estimation theory for parameters arising from a partially identified model, where some parameters of the model may only be identified through constraints imposed by additional assumptions. We show that, under certain conditions, the constrained maximum likelihood estimator exists and locally maximizes the likelihood function subject to constraints. Moreover, we study the asymptotic distribution of the estimator and propose a numerical algorithm for estimating parameters. Next, we use the frequentist approach to analyze case-control data under the gene-environment independence assumption. By transforming the problem into a constrained maximum likelihood estimation problem, we are able to derive the asymptotic distribution of the estimator in a closed form. We then show that exploiting the gene-environment independence assumption indeed improves estimation efficiency. Also, we propose an easy-to-implement numerical algorithm for finding estimates in practice. Furthermore, we approach the problem in a Bayesian framework. By introducing a different parameterization of the underlying model for case-control data, we are able to define a prior structure reflecting the gene-environment independence assumption and develop an efficient numerical algorithm for the computation of the posterior distribution. The proposed Bayesian method is further generalized to address the concern about the validity of the gene-environment independence assumption. Finally, we consider a special variant of the standard case-control design, the case-only design, and study the analysis of case-only data under the gene-environment independence assumption and the rare disease assumption. We show that the Bayesian method for analyzing case-control data is readily applicable for the analysis of case-only data, allowing the flexibility of incorporating different prior beliefs on disease prevalence.

View record

Causal Inference Approaches for Dealing with Time-Dependent Confounding in Longitudinal Studies, with Applications to Multiple Sclerosis Research (2015)

Marginal structural Cox models (MSCMs) have gained popularity in analyzing longitudinal data in the presence of 'time-dependent confounding', primarily in the context of HIV/AIDS and related conditions. This thesis is motivated by issues arising in connection with dealing with time-dependent confounding while assessing the effects of beta-interferon drug exposure on disease progression in relapsing-remitting multiple sclerosis (MS) patients in the real-world clinical practice setting. In the context of this chronic, yet fluctuating disease, MSCMs were used to adjust for the time-varying confounders, such as MS relapses, as well as baseline characteristics, through the use of inverse probability weighting (IPW). Using a large cohort of 1,697 relapsing-remitting MS patients in British Columbia, Canada (1995-2008), no strong association between beta-interferon exposure and the hazard of disability progression was found (hazard ratio 1.36, 95% confidence interval 0.95, 1.94). We also investigated whether it is possible to improve the MSCM weight estimation techniques by using statistical learning methods, such as bagging, boosting and support vector machines. Statistical learning methods require fewer assumptions and have been found to estimate propensity scores with better covariate balance. As propensity scores and IPWs in MSCM are functionally related, we also studied the usefulness of statistical learning methods via a series of simulation studies. The IPWs estimated from the boosting approach were associated with less bias and better coverage compared to the IPWs estimated from the conventional logistic regression approach. Additionally, two alternative approaches, prescription time-distribution matching (PTDM) and the sequential Cox approach, proposed in the literature to deal with immortal time bias and time-dependent confounding respectively, were compared via a series of simulations. The PTDM approach was found to be not as effective as the Cox model (with treatment considered as a time-dependent exposure) in minimizing immortal time bias. The sequential Cox approach was, however, found to be an effective method to minimize immortal time bias, but not as effective as a MSCM, in the presence of time-dependent confounding. These methods were used to re-analyze the MS dataset to show their applicability. The findings from the simulation studies were also used to guide the data analyses.

View record

Model and Inference Issues Related to Exposure-Disease Relationships (2014)

The goal of my thesis is to make contributions on some statistical issues related to epidemiological investigations of exposure-disease relationships. Firstly, when the exposure data contain missing values and measurement errors, we build a Bayesian hierarchical model for relating disease to a potentially harmful exposure while accommodating these flaws. The traditional imputation method, called the group-based exposure assessment method, uses the group exposure mean to impute the individual exposure in that group, where the group indicator indicates that the exposure levels tend to vary more across groups and less within groups. We compare our method with the traditional method through simulation studies, a real data application, and theoretical calculation. We focus on cohort studies where a logistic disease model is appropriate and where group exposure means can be treated as fixed effects. The results show a variety of advantages of the fully Bayesian approach, and provide recommendations on situations where the traditional method may not be suitable to use. Secondly, we investigate a number of issues surrounding inference and the shape of the exposure-disease relationship. Presuming that the relationship can be expressed in terms of regression coefficients and a shape parameter, we investigate how well the shape can be inferred in settings which might typify epidemiologic investigations and risk assessment. We also consider a suitable definition of the average effect of exposure, and investigate how precisely this can be inferred. We also examine the extent to which exposure measurement error distorts inference about the shape of the exposure-disease relationship. All these investigations require a family of exposure-disease relationships indexed by a shape parameter. For this purpose, we employ a family based on the Box-Cox transformation.Thirdly, matching is commonly used to reduce confounding due to lack of randomization in the experimental design. However, ignoring measurement errors in matching variables will introduce systematically biased matching results. Therefore, we recommend to fit a trajectory model to the observed covariate and then use the estimated true values from the model to do the matching. In this way, we can improve the quality of matching in most cases.

View record

Bayesian methods for alleviating identification issues with applications in health and insurance areas (2013)

In areas such as health and insurance, there can be data limitations that may cause an identification problem in statistical modeling. Ignoring the issues may result in bias in statistical inference. Bayesian methods have been proven to be useful in alleviating identification issues by incorporating prior knowledge. In health areas, the existence of hard-to-reach populations in survey sampling will cause a bias in population estimates of disease prevalence, medical expenditures and health care utilizations. For the three types of measures, we propose four Bayesian models based on binomial, gamma, zero-inflated Poisson and zero-inflated negative binomial distributions. Large-sample limits of the posterior mean and standard deviation are obtained for population estimators. By extensive simulation studies, we demonstrate that the posteriors are converging to their large-sample limits in a manner comparable to that of an identified model. Under the regression context, the existence of hard-to-reach populations will cause a bias in assessing risk factors such as smoking. For the corresponding regression models, we obtain theoretical results on the limiting posteriors. Case studies are conducted on several well-known survey datasets. Our work confirms that sensible results can be obtained using Bayesian inference, despite the nonidentifiability caused by hard-to-reach populations.In insurance, there are specific issues such as misrepresentation on risk factors that may result in biased estimates of insurance premiums. In particular, for a binary risk factor, the misclassification occurs only in one direction. We propose three insurance prediction models based on Poisson, gamma and Bernoulli distributions to account for the effect. By theoretical studies on the form of posterior distributions and method of moment estimators, we confirm that model identification depends on the distribution of the response. Furthermore, we propose a binary model with the misclassified variable used as a response. Through simulation studies for the four models, we demonstrate that acknowledging the misclassification improves the accuracy in parameter estimation. For road collision modeling, measurement errors in annual traffic volumes may cause an attenuation effect in regression coefficients. We propose two Bayesian models, and theoretically confirm that the gamma models are identified. Simulation studies are conducted for finite sample scenarios.

View record

Modeling dependencies in multivariate DAta (2013)

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Integrating representative and non-representative survey data for efficient inference (2024)

The consequences of prior misspecification in Bayesian adjustment for confounders (2023)

Quantifying the utility of personalized treatment decision rules: extending and comparing two metrics for summarizing the heterogeneity of treatment effects (2021)

Incorporating partial adherence into the principal stratification analysis framework (2019)

Approximation of the formal Bayesian model comparison using the extended conditional predictive ordinate criterion (2017)

Poisson Process Infinite Relational Model: A Bayesian nonparametric model for transactional data (2016)

Extensions to the Multiplier Method for Inferring Population Size (2014)

Costs and Benefits of Environmental Data in Investigations of Gene-Disease Associations (2012)

Topics on the Effect of Non-Differential Exposure Misclassification (2012)

Time-varying exposure subject to misclassification: bias characterization and adjustment (2010)

Measurement error occurs frequently in observational studies investigating the relationship between exposure variables and a clinical outcome. Error-prone observations on the explanatory variable may lead to biased estimation and loss of power in detecting the impact of an exposure variable. When the exposure variable is time-varying, the impact of misclassification is complicated and significant. This increases uncertainty in assessing the consequences of ignoring measurement error associated with observed data, and brings difficulties to adjustment for misclassification.In this study we considered situations in which the exposure is time-varying and nondifferential misclassification occurs independently over time. We determined how misclassification biases the exposure outcome relationship through probabilistic arguments and then characterized the effect of misclassification as the model parameters vary. We show that misclassification of time-varying exposure measurements has a complicated effect when estimating the exposure-disease relationship. In particular the bias toward the null seen in the static case is not observed.After misclassification had been characterized we developed a means to adjust for misclassification by recreating, with greatest likelihood, the exposure path of each subject. Our adjustment uses hidden Markov chain theory to quickly and efficiently reduce the number of misclassified states and reduce the effect of misclassification on estimating the disease-exposure relationship.The method we propose makes use of only the observed misclassified exposure data and no validation data needs to be obtained. This is achieved by estimated switching probabilities and misclassification probabilities from the observed data. When these estimates are obtained the effect of misclassification can be determined through the characterization of the effect of misclassification presented previously. We can also directly adjust for misclassification by recreating the most likely exposure path using the Viterbi algorithm.The methods developed in this dissertation allow the effect of misclassification, on estimating the exposure-disease relationship, to be determined. It accounts for misclassification by reducing the number of misclassified states and allows the exposure-disease relationship to be estimated significantly more accurately. It does this without the use of validation data and is easy to implement in existing statistical software.

View record