Paul Gustafson: Professor at Department of Statistics, UBC Faculty of Science

Professor

Faculty of Science

Research Classification

Statistics

Research Interests

meta-analysis

Parametric and Non-Parametric Inference

Theoretical Statistics

Pharmacoepidemiology

Bayesian statistical methods

Biostatistics and Epidemiology

Causal inference

Evidence synthesis

Partial Identification

Relevant Thesis-Based Degree Programs

View all programs

Open All

Recruitment

Looking to recruit:

Master's students

Doctoral students

Postdoctoral Fellows

Desired start dates: Any time / year round

Complete these steps before you reach out to a faculty member!

Check requirements

Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.

Focus your search

Identify specific faculty members who are conducting research in your specific area of interest.
Establish that your research interests align with the faculty member’s research interests.
- Read up on the faculty members in the program and the research being conducted in the department.
- Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.

Make a good impression

Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
- Do not send non-specific, mass emails to everyone in the department hoping for a match.
- Address the faculty members by name. Your contact should be genuine rather than generic.
Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
Demonstrate that you are familiar with their research:
- Convey the specific ways you are a good fit for the program.
- Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
Be enthusiastic, but don’t overdo it.

Attend an information session

G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.

ADVICE AND INSIGHTS FROM UBC FACULTY ON REACHING OUT TO SUPERVISORS

These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.

Supervision Enquiry

If you have reviewed some of this faculty member's publications, understand their research interests and have reviewed the admission requirements, you may .

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Bayesian causal inference for discrete data (2024)

Causal inference provides a framework for estimating how a response changes when a given cause of interest changes. When all data are discrete we can use saturated nonparametric models to avoid unnecessary assumptions in our causal inference modelling, where we specify unique parameters for all possible combinations of treatments and confounders when estimating an outcome. Bayesian methods allow us to incorporate prior information into these saturated models, making them usable beyond simple settings with low dimensional confounders. In this thesis we propose two new nonparametric Bayes methods for causal inference based on saturated modelling. The first method combines a parametric model with a nonparametric saturated outcome model to estimate treatment effects in observational studies with longitudinal data. By conceptually splitting the data, we can combine these models while maintaining a conjugate framework, allowing us to avoid the use of Markov chain Monte Carlo methods. Approximations using the central limit theorem and random sampling allows our method to be scaled to high-dimensional confounders. The second method uses prior restrictions of the parameter space of a saturated model to partially identify causal effect estimates in scenarios with nonignorable missing outcome data. We focus on two common restrictions, instrumental variables and the direction of missing data bias, and investigate how these restrictions narrow the identification region for parameters of interest. Additionally, we propose a rejection sampling algorithm that allows us to quantify the evidence for these assumptions in the data. Saturated models require discrete data so continuous data must be discretized to use these methods, which can introduce residual confounding. We conclude by proposing a new soft-thresholding technique to discretize continuous confounders in the context of frequentist linear regression. We show using a triangular distribution weighting function can reduce the bias induced by discretization, while maintaining the interpretability benefits typically associated with discrete variables.

View record

Bayesian models for hierarchical clustering of network data (2023)

Network data represent relational information between interacting entities. They can be described by graphs, where vertices denote the entities and edges mark the interactions. Clustering is a common approach for uncovering hidden structure in network data. The objective is to find related groups of vertices, through their shared edges. Hierarchical clustering identifies groups of vertices across multiple scales, where a dendrogram represents the full hierarchy of clusters. Often, Bayesian models for hierarchical clustering of network data aim to infer the posterior distribution over dendrograms.The Hierarchical Random Graph (HRG) is likely the most popular Bayesian approach to hierarchical clustering of network data. Due to simplifications made in its inference scheme, we identify some potentially undesirable model behaviour. Mathematically, we show that this behaviour presents in two ways: symmetry of the likelihood for graphs and their complements, and non-uniformity of the prior. The latter is exposed by finding an equivalent interpretation of the HRG as a proper Bayesian model, with normalized likelihood. We show that the amount of non-uniformity is exacerbated as the size of the network data increases. In rectifying the issues with the HRG, we propose a general class of models for hierarchical clustering of network data. This class is characterized by a sampling construction, defining a generative process for simple graphs, on fixed vertex sets. It permits a wide range of probabilistic models, via the choice of distribution over edge counts between clusters. We present four Bayesian models from this class, and derive their respective properties, like expected edge density and independence. For three of these models, we derive a closed-form expression for the marginalized posterior distribution over dendrograms, to isolate the problem of inferring a hierarchical clustering. We implement these models in a probabilistic programming language that leverages state-of-the-art approximate inference methods. As our class of models use a uniform prior over dendrograms, we construct an algorithm for sampling from this prior. Finally, the empirical performance of our models is demonstrated on examples of real network data.

View record

Hidden at the root : statistical methods for population size estimation on trees (2023)

The full abstract for this thesis is available in the body of the thesis, and will be available when the embargo expires.

View record

Bayesian adjustments for disease misclassification in epidemiological studies of health administrative databases, with applications to multiple sclerosis research (2019)

With disease information routinely established from diagnostic codes or prescriptions in health administrative databases, the topic of outcome misclassification is gaining importance in epidemiological research. Motivated by a Canada-wide observational study into the prodromal phase of multiple sclerosis (MS), this thesis considers the setting of a matched exposure-disease association study where the disease is measured with error.We initially focus on the special case of a pair-matched case-control study. Assuming non-differential misclassification of study participants, we give a closed-form expression for asymptotic biases in odds ratios arising under naive analyses of misclassified data, and propose a Bayesian model to correct association estimates for misclassification bias. For identifiability, the model relies on information from a validation cohort of correctly classified case-control pairs, and also requires prior knowledge about the predictive values of the classifier. In a simulation study, the model shows improved point and interval estimates relative to the naive analysis, but is also found to be overly restrictive in a real data application.In light of these concerns, we propose a generalized model for misclassified data that extends to the case of differential misclassification and allows for a variable number of controls per case. Instead of prior information about the classification process, the model relies on individual-level estimates of each participant's true disease status, which were obtained from a counting process mixture model of MS-specific healthcare utilization in our motivating example.Lastly, we consider the problem of assessing the non-differential misclassification assumption in situations where the exposure is suspected to impact the classification accuracy of cases and controls, but information on the true disease status is unavailable. Motivated by the non-identified nature of the problem, we consider a Bayesian analysis and examine the utility of Bayes factors to provide evidence against the null hypothesis of non-differential misclassification. Simulation studies show that for a range of realistic misclassification scenarios, and under mildly informative prior distributions, posterior distributions of the exposure effect on classification accuracy exhibit sufficient updating to detect differential misclassification with moderate to strong evidence.

View record

If journals embraced conditional equivalence testing, would research be better? (2019)

We consider the reliability of published science: the probability that scientific claims put forth are true. Low reliability within many scientific fields is of major concern for researchers, scientific journals and the public at large. In the first part of this thesis, we introduce a publication policy that incorporates ''conditional equivalence testing'' (CET), a two-stage testing scheme in which standard null-hypothesis significance testing is followed, if the null hypothesis is not rejected, by testing for equivalence. The idea of CET has the potential to address recent concerns about reproducibility and the limited publication of null results. We detail the implementation of CET, investigate similarities with a Bayesian testing scheme, and outline the basis for how a scientific journal could proceed to reduce publication bias while remaining relevant.In the second part of this thesis, we consider proposals to adopt measures of ''greater statistical stringency,'' including suggestions to require larger sample sizes and to lower the highly criticized ''p
View record

The Gene-Environment Independence Assumption in the Analysis of Case-Control Data (2017)

In this thesis, we consider the problem of exploiting the gene-environment independence assumption in a case-control study inferring the joint effect of genotype and environmental exposure on disease risk. We first take a detour and develop the constrained maximum likelihood estimation theory for parameters arising from a partially identified model, where some parameters of the model may only be identified through constraints imposed by additional assumptions. We show that, under certain conditions, the constrained maximum likelihood estimator exists and locally maximizes the likelihood function subject to constraints. Moreover, we study the asymptotic distribution of the estimator and propose a numerical algorithm for estimating parameters. Next, we use the frequentist approach to analyze case-control data under the gene-environment independence assumption. By transforming the problem into a constrained maximum likelihood estimation problem, we are able to derive the asymptotic distribution of the estimator in a closed form. We then show that exploiting the gene-environment independence assumption indeed improves estimation efficiency. Also, we propose an easy-to-implement numerical algorithm for finding estimates in practice. Furthermore, we approach the problem in a Bayesian framework. By introducing a different parameterization of the underlying model for case-control data, we are able to define a prior structure reflecting the gene-environment independence assumption and develop an efficient numerical algorithm for the computation of the posterior distribution. The proposed Bayesian method is further generalized to address the concern about the validity of the gene-environment independence assumption. Finally, we consider a special variant of the standard case-control design, the case-only design, and study the analysis of case-only data under the gene-environment independence assumption and the rare disease assumption. We show that the Bayesian method for analyzing case-control data is readily applicable for the analysis of case-only data, allowing the flexibility of incorporating different prior beliefs on disease prevalence.

View record

Causal Inference Approaches for Dealing with Time-Dependent Confounding in Longitudinal Studies, with Applications to Multiple Sclerosis Research (2015)

Marginal structural Cox models (MSCMs) have gained popularity in analyzing longitudinal data in the presence of 'time-dependent confounding', primarily in the context of HIV/AIDS and related conditions. This thesis is motivated by issues arising in connection with dealing with time-dependent confounding while assessing the effects of beta-interferon drug exposure on disease progression in relapsing-remitting multiple sclerosis (MS) patients in the real-world clinical practice setting. In the context of this chronic, yet fluctuating disease, MSCMs were used to adjust for the time-varying confounders, such as MS relapses, as well as baseline characteristics, through the use of inverse probability weighting (IPW). Using a large cohort of 1,697 relapsing-remitting MS patients in British Columbia, Canada (1995-2008), no strong association between beta-interferon exposure and the hazard of disability progression was found (hazard ratio 1.36, 95% confidence interval 0.95, 1.94). We also investigated whether it is possible to improve the MSCM weight estimation techniques by using statistical learning methods, such as bagging, boosting and support vector machines. Statistical learning methods require fewer assumptions and have been found to estimate propensity scores with better covariate balance. As propensity scores and IPWs in MSCM are functionally related, we also studied the usefulness of statistical learning methods via a series of simulation studies. The IPWs estimated from the boosting approach were associated with less bias and better coverage compared to the IPWs estimated from the conventional logistic regression approach. Additionally, two alternative approaches, prescription time-distribution matching (PTDM) and the sequential Cox approach, proposed in the literature to deal with immortal time bias and time-dependent confounding respectively, were compared via a series of simulations. The PTDM approach was found to be not as effective as the Cox model (with treatment considered as a time-dependent exposure) in minimizing immortal time bias. The sequential Cox approach was, however, found to be an effective method to minimize immortal time bias, but not as effective as a MSCM, in the presence of time-dependent confounding. These methods were used to re-analyze the MS dataset to show their applicability. The findings from the simulation studies were also used to guide the data analyses.

View record

Model and Inference Issues Related to Exposure-Disease Relationships (2014)

The goal of my thesis is to make contributions on some statistical issues related to epidemiological investigations of exposure-disease relationships. Firstly, when the exposure data contain missing values and measurement errors, we build a Bayesian hierarchical model for relating disease to a potentially harmful exposure while accommodating these flaws. The traditional imputation method, called the group-based exposure assessment method, uses the group exposure mean to impute the individual exposure in that group, where the group indicator indicates that the exposure levels tend to vary more across groups and less within groups. We compare our method with the traditional method through simulation studies, a real data application, and theoretical calculation. We focus on cohort studies where a logistic disease model is appropriate and where group exposure means can be treated as fixed effects. The results show a variety of advantages of the fully Bayesian approach, and provide recommendations on situations where the traditional method may not be suitable to use. Secondly, we investigate a number of issues surrounding inference and the shape of the exposure-disease relationship. Presuming that the relationship can be expressed in terms of regression coefficients and a shape parameter, we investigate how well the shape can be inferred in settings which might typify epidemiologic investigations and risk assessment. We also consider a suitable definition of the average effect of exposure, and investigate how precisely this can be inferred. We also examine the extent to which exposure measurement error distorts inference about the shape of the exposure-disease relationship. All these investigations require a family of exposure-disease relationships indexed by a shape parameter. For this purpose, we employ a family based on the Box-Cox transformation.Thirdly, matching is commonly used to reduce confounding due to lack of randomization in the experimental design. However, ignoring measurement errors in matching variables will introduce systematically biased matching results. Therefore, we recommend to fit a trajectory model to the observed covariate and then use the estimated true values from the model to do the matching. In this way, we can improve the quality of matching in most cases.

View record

Bayesian methods for alleviating identification issues with applications in health and insurance areas (2013)

In areas such as health and insurance, there can be data limitations that may cause an identification problem in statistical modeling. Ignoring the issues may result in bias in statistical inference. Bayesian methods have been proven to be useful in alleviating identification issues by incorporating prior knowledge. In health areas, the existence of hard-to-reach populations in survey sampling will cause a bias in population estimates of disease prevalence, medical expenditures and health care utilizations. For the three types of measures, we propose four Bayesian models based on binomial, gamma, zero-inflated Poisson and zero-inflated negative binomial distributions. Large-sample limits of the posterior mean and standard deviation are obtained for population estimators. By extensive simulation studies, we demonstrate that the posteriors are converging to their large-sample limits in a manner comparable to that of an identified model. Under the regression context, the existence of hard-to-reach populations will cause a bias in assessing risk factors such as smoking. For the corresponding regression models, we obtain theoretical results on the limiting posteriors. Case studies are conducted on several well-known survey datasets. Our work confirms that sensible results can be obtained using Bayesian inference, despite the nonidentifiability caused by hard-to-reach populations.In insurance, there are specific issues such as misrepresentation on risk factors that may result in biased estimates of insurance premiums. In particular, for a binary risk factor, the misclassification occurs only in one direction. We propose three insurance prediction models based on Poisson, gamma and Bernoulli distributions to account for the effect. By theoretical studies on the form of posterior distributions and method of moment estimators, we confirm that model identification depends on the distribution of the response. Furthermore, we propose a binary model with the misclassified variable used as a response. Through simulation studies for the four models, we demonstrate that acknowledging the misclassification improves the accuracy in parameter estimation. For road collision modeling, measurement errors in annual traffic volumes may cause an attenuation effect in regression coefficients. We propose two Bayesian models, and theoretically confirm that the gamma models are identified. Simulation studies are conducted for finite sample scenarios.

View record

Modeling dependencies in multivariate DAta (2013)

In multivariate regression, researchers are interested in modeling a correlatedmultivariate response variable as a function of covariates. The response ofinterest can be multidimensional; the correlation between the elements ofthe multivariate response can be very complex. In many applications, theassociation between the elements of the multivariate response is typicallytreated as a nuisance parameter. The focus is on estimating efficiently theregression coefficients, in order to study the average change in the meanresponse as a function of predictors. However, in many cases, the estimation of the covariance and, where applicable, the temporal dynamics of themultidimensional response is the main interest, such as the case in finance,for example. Moreover, the correct specification of the covariance matrix isimportant for the efficient estimation of the regression coefficients. Thesecomplex models usually involve some parameters that are static and somedynamic. Until recently, the simultaneous estimation of dynamic and staticparameters in the same model has been difficult. The introduction of particle MCMC algorithms by Andrieu and Doucet (2002) has allowed for the possibility of considering such models. In this thesis, we propose a generalframework for jointly estimating the covariance matrix of multivariate dataas well as the regression coefficients. This is done under different settings,for different dimensions and measurement scales.

View record

Imperfect Variables: A Conceptual Framework for the Combined Problem of Missing Data and Mismeasured Variables with Application to Generalized Linear Models (2009)

No abstract available.

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Integrating representative and non-representative survey data for efficient inference (2024)

Non-representative surveys are commonly used and widely available but suffer from selection bias that generally cannot be entirely eliminated using weighting techniques. Instead, we propose a Bayesian method to synthesize longitudinal representative unbiased surveys with non-representative biased surveys by estimating the degree of selection bias over time. We show using a simulation study that synthesizing biased and unbiased surveys together out-performs using the unbiased surveys alone, even if the selection bias may evolve in a complex manner over time. Using COVID-19 vaccination data, we are able to synthesize two large sample biased surveys with an unbiased survey to reduce uncertainty in now-casting and inference estimates while simultaneously retaining the empirical credible interval coverage. Ultimately, we are able to conceptually obtain the properties of a large sample unbiased survey if the assumed unbiased survey, used to anchor the estimates, is unbiased for all time-points.

View record

The consequences of prior misspecification in Bayesian adjustment for confounders (2023)

Inferring the causal relationship between a treatment and a response is complicated in non-randomized studies owing to the effects of potentially confounding variables. Previous works have demonstrated that misspecifying the set of potential confounders in a causal analysis can have significant consequences for causal effect estimation. Bayesian Adjustment for Confounders (BAC) is a Bayesian approach to variable selection, whereby a mixture of posteriors is used to combine the causal effect estimates from each model corresponding to a combination of the potential confounders. Our work uses Monte Carlo simulation techniques in order to numerically compute the inflation in the average mean squared error due to prior misspecification in the BAC methodology over repeated experiments in a saturated probability model case study. Our findings shed light on future areas for research, and provide users of the BAC methodology with advice on selecting an appropriate prior model for their studies.

View record

Quantifying the utility of personalized treatment decision rules: extending and comparing two metrics for summarizing the heterogeneity of treatment effects (2021)

The treatment benefit prediction model is a type of clinical prediction model that quantifies the magnitude of treatment benefit given an individual's unique characteristics. As the topic of treatment effect modelling is relatively new, quantifying and summarizing the performance of treatment benefit models are not well studied. The "concordance-statistic for benefit" and the "concentration of benefit index" are two newly developed metrics that evaluate the discriminative ability of the treatment benefit prediction. However, the similarities and differences between these two metrics are not yet explored. We compare and contrast the metrics from conceptual, theoretical, and empirical perspectives and illustrate the application of the metrics. We consider the common scenario of a logistic regression model for a binary response developed based on data from a randomized controlled trial with two treatment arms. This dissertation provides two major contributions: first, the two metrics are expanded into three pairs of metrics, each having a particular scope; second, it provides results of theoretical and simulation studies that compare and contrast the construct and empirical behaviour of these metrics. We found that the heterogeneity of treatment effect appropriately influences these metrics. Metrics related to the "concordance-statistic for benefit" are sensitive to the unobservable correlation between counterfactual outcomes. In a case study, we quantify the metrics in a randomized controlled trial of acute myocardial infarction therapies on 30-day mortality. We conclude that these metrics help understand the heterogeneity of treatment effect and the consequent impact on treatment decision-making.

View record

Incorporating partial adherence into the principal stratification analysis framework (2019)

Participants in pragmatic clinical trials often partially adhere to treatment. In the presence of partial adherence, simple statistical analyses of binary adherence (receiving either full or no treatment) introduce biases. We developed a framework which expands the principal strati cation approach to allow partial adherers to have their own principal stratum and treatment level. We derived consistent estimates for bounds on population values of interest. A Monte Carlo posterior sampling method was derived that is computationally faster than Markov Chain Monte Carlo sampling, with con firmed equivalent results. Simulations indicate that the two methods agree with each other and are superior in most cases to the biased estimators created through standard principal strati cation. The results suggest that these new methods may lead to increased accuracy of inference in settings where study participants only partially adhere to assigned treatment.

View record

Approximation of the formal Bayesian model comparison using the extended conditional predictive ordinate criterion (2017)

The optimal method for Bayesian model comparison is the formal Bayes factor (BF), according to decision theory. The formal BF is computationally troublesome for more complex models. If predictive distributions under the competing models do not have a closed form, a cross-validation idea, called the conditional predictive ordinate (CPO) criterion can be used. In the cross-validation sense, this is a ''leave-out one'' approach. CPO can be calculated directly from theMonte Carlo (MC) outputs, and the resulting Bayesian model comparison is called the pseudo Bayes factor (PBF). We can get closer to the formal Bayesian model comparison by increasing the ''leave-out size'', and at ''leave-out all'' we recover the formal BF. But, the MC error increases with increasing ''leave-out size''. In this study, we examine this for linear and logistic regression models.Our study reveals that the Bayesian model comparison can favour a different model for PBF compared to BF when comparing two close linear models. So, larger ''leave-out sizes'' are preferred which provide result close to the optimal BF. On the other hand, MC samples based formal Bayesian model comparisons are computed with more MC error for increasing ''leave-out sizes''; this is observed by comparing with the available closed form results. Still, considering a reasonable error, we can use ''leave-out size'' more than one instead of fixing it at one. These findings can be extended to logistic models where a closed form solution is unavailable.

View record

Poisson Process Infinite Relational Model: A Bayesian nonparametric model for transactional data (2016)

Transactional data consists of instantaneously occurring observations made on ordered pairs of entities. It can be represented as a network---or more specifically, a directed multigraph---with edges possessing unique timestamps. This thesis explores a Bayesian nonparametric model for discovering latent class-structure in transactional data. Moreover, by pooling information within clusters of entities, it can be used to infer the underlying dynamics of the time-series data. Blundell, Beck, and Heller (2012) originally proposed this model, calling it the Poisson Process Infinite Relational Model; however, this thesis derives and elaborates on the necessary procedures to implement a fully Bayesian approximate inference scheme. Additionally, experimental results are used to validate the computational correctness of the inference algorithm. Further experiments on synthetic data evaluate the model's clustering performance and assess predictive ability. Real data from historical records of militarized disputes between nations test the model's capacity to learn varying degrees of structured relationships.

View record

Extensions to the Multiplier Method for Inferring Population Size (2014)

Estimating population size is an important task for epidemiologists and ecologists alike, for purposes of resource planning and policy making. One method is the "multiplier method" which uses information about a binary trait to infer the size of a population. The first half of this thesis presents a likelihood-based estimator which generalizes the multiplier method to accommodate multiple traits as well as any number of categories (strata) in a trait. The asymptotic variance of this likelihood-based estimator is obtained through the Fisher Information and its behaviour with varying study designs is determined. The statistical advantage of using additional traits is most pronounced when the traits are uncorrelated and of low prevalence, and diminishes when the number of traits becomes large. The use of highly stratified traits however, does not appear to provide much advantage over using binary traits. Finally, a Bayesian implementation of this method is applied to both simulated data and real data pertaining to an injection-drug user population. The second half of this thesis is a first systematic approach to quantifying the uncertainty in marginal count data that is an essential component of the multiplier method. A migration model that captures the stochastic mechanism giving rise to uncertainty is proposed. The migration model is applied, in conjunction with the multi-trait multiplier method, to real-data from the British Columbia Centre for Disease Control.

View record

Costs and Benefits of Environmental Data in Investigations of Gene-Disease Associations (2012)

The inclusion of environmental exposure data may be beneficial, in terms of statistical power, to investigation of gene-disease association when it exists. However, resources invested in obtaining exposure data could instead be applied to measure disease status and genotype on more subjects. In a cohort study setting, we consider the tradeoff between measuring only disease status and genotype for a larger study sample and measuring disease status, genotype, and environmental exposure for a smaller study sample, under the ‘Mendelian randomization’ assumption that the environmental exposure is independent of genotype in the study population. We focus on the power of tests for gene-disease association, applied in situations where a gene modifies risk of disease due to particular exposure without a main effect of gene on disease. Our results are equally applicable to exploratory genome-wide association studies and more hypothesis-driven candidate gene investigations. We further consider the impact of misclassification for environmental exposures. We find that under a wide range of circumstances research resources should be allocated to genotyping larger groups of individuals, to achieve a higher power for detecting presence of gene-environment interactions by studying genedisease association.

View record

Topics on the Effect of Non-Differential Exposure Misclassification (2012)

There is quite an extensive literature on the deleterious impact of exposure misclassification when inferring exposure-disease associations, and on statistical methods to mitigate this impact. When the exposure is a continuous variable or a binary variable, a general mismeasurement phenomenon is attenuation in the strength of the relationship between exposure and outcome. However, few have investigated the effect of misclassification on a polychotomous variable. Using Bayesian methods, we investigate how misclassification affects the exposure-disease associations under different settings of classification matrix. Also, we apply a trend test and understand the effect of misclassification according to the power of the test. In addition, since virtually all of work on the impact of exposure misclassification presumes the simplest situation where both the true status and the classified status are binary, my work diverges from the norm, in considering classification into three categories when the actual exposure status is simply binary. Intuitively, the classification states might be labeled as `unlikely exposed', `maybe exposed', and `likely exposed'. While this situation has been discussed informally in the literature, we provide some theory concerning what can be learned about the exposure-disease relationship, under various assumptions about the classification scheme. We focus on the challenging situation whereby no validation data is available from which to infer classification probabilities, but some prior assertions about these probabilities might be justified.

View record

Time-varying exposure subject to misclassification: bias characterization and adjustment (2010)

Measurement error occurs frequently in observational studies investigating the relationship between exposure variables and a clinical outcome. Error-prone observations on the explanatory variable may lead to biased estimation and loss of power in detecting the impact of an exposure variable. When the exposure variable is time-varying, the impact of misclassification is complicated and significant. This increases uncertainty in assessing the consequences of ignoring measurement error associated with observed data, and brings difficulties to adjustment for misclassification.In this study we considered situations in which the exposure is time-varying and nondifferential misclassification occurs independently over time. We determined how misclassification biases the exposure outcome relationship through probabilistic arguments and then characterized the effect of misclassification as the model parameters vary. We show that misclassification of time-varying exposure measurements has a complicated effect when estimating the exposure-disease relationship. In particular the bias toward the null seen in the static case is not observed.After misclassification had been characterized we developed a means to adjust for misclassification by recreating, with greatest likelihood, the exposure path of each subject. Our adjustment uses hidden Markov chain theory to quickly and efficiently reduce the number of misclassified states and reduce the effect of misclassification on estimating the disease-exposure relationship.The method we propose makes use of only the observed misclassified exposure data and no validation data needs to be obtained. This is achieved by estimated switching probabilities and misclassification probabilities from the observed data. When these estimates are obtained the effect of misclassification can be determined through the characterization of the effect of misclassification presented previously. We can also directly adjust for misclassification by recreating the most likely exposure path using the Viterbi algorithm.The methods developed in this dissertation allow the effect of misclassification, on estimating the exposure-disease relationship, to be determined. It accounts for misclassification by reducing the number of misclassified states and allows the exposure-disease relationship to be estimated significantly more accurately. It does this without the use of validation data and is easy to implement in existing statistical software.

View record

Paul Gustafson's Profile

Publications on Google Scholar

Membership Status

Member of G+PS
View explanation of statuses

Program Affiliations

Statistics

Academic Unit(s)

Department of Statistics

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

Paul Gustafson