Bruno Zumbo

Prospective Graduate Students / Postdocs

This faculty member is currently not looking for graduate students or Postdoctoral Fellows. Please do not contact the faculty member with any such requests.


Research Interests

Psychometrics and Test Theory; Mathematical sciences of measurement
Latent Variable Models, Item Response Theory, Factor Analysis, Mixed models
Validity Theory and Validation
Multivariate Analysis

Relevant Thesis-Based Degree Programs

Affiliations to Research Centres, Institutes & Clusters

Research Options

I am available and interested in collaborations (e.g. clusters, grants).
I am interested in and conduct interdisciplinary research.


Bruno D. Zumbo, Professor & Distinguished University Scholar
Canada Research Chair in Psychometrics and Measurement (Tier 1)
University of British Columbia

Research Methodology

Statistical Methods
Mathematical Sciences
Computational methods

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Impact of scoring and response format on inferences from electronic surveys for rating scales (2020)

No abstract available.

On the measurement of social belonging and its connection to migration background (2020)

No abstract available.

Violations of unidimensionality and local independence in measures intended as unidimensional: assessing levels of violations and the accuracy in unidimensional IRT model estimates (2019)

No abstract available.

On monte carlo simulation algorithms for research in psychometrics (2017)

No abstract available.

On the impact of negatively keyed items on the assessment of the unidimensionality of psychological tests and measures (2017)

No abstract available.

The Impact of Predictor Variable(s) with Skewed Cell Probabilities on the Wald Test in Binary Logistic Regression (2017)

No abstract available.

Ordinal Generalizability Theory Using an Underlying Latent Variable Framework (2015)

This dissertation introduces a method for estimating the variance components required in the use of generalizability theory (GT) with categorical ratings (e.g., ordinal variables). Traditionally, variance components in GT are estimated using statistical techniques that treat ordinal variables as continuous. This may lead to bias in the estimation of variance components and the resulting reliability coefficients (called G-coefficients). This dissertation demonstrates that variance components can be estimated using a structural equation modeling (SEM) technique called covariance structural modeling (CSM) of a polychoric or tetrachoric correlation matrix, which accounts for the metric of ordinal variables. The dissertation provides a proof of concept of this method, which will be called ordinal GT, using real data in the computation of a relative G-coefficient, and a simulation study presenting the relative merits of ordinal to conventional G-coefficients from ordinal data. The results demonstrate that ordinal GT is viable using CSM of the polychoric matrix of ordinal data. In addition, using a Monte Carlo simulation, the relative G-coefficients when ordinal data are naively treated as continuous are compared to when they are correctly treated as ordinal. The number of response categories, magnitude of the theoretical G-coefficient, and skewness of the item response distributions varied in experimental conditions for: (i) a two-facet crossed G-study design, and (ii) a one-facet partially nested G-study design. The results reveal that when ordinal data were treated as continuous, the empirical G-coefficients were consistently underestimates than their theoretical values. This was true regardless of the number of response categories, magnitude of the theoretical G-coefficient, and skewness. In contrast, the ordinal G-coefficients performed much better in all conditions. This dissertation shows that using CSM to model the polychoric correlation matrix provides better estimates of variance components in the GT of ordinal variables. It offers researchers a new statistical avenue for computing relative G-coefficients when using ordinal variables.

View record

Documenting the impact of outliers on decisions about the number of factors in exploratory factor analysis (2012)

The overall purpose of this dissertation is to investigate how outliers affect the decisions about the number of factors in exploratory factor analysis (EFA) as determined by four widely used and/or highly recommended methods. Very few studies have looked into this issue in the literature and the conclusions are contradictory— i.e., with studies disagreeing as to whether outliers result in extra factors or a reduced number of factors. For this dissertation I systematically studied the impact of outliers arising from different sources and matched outlier simulation models with different type of outliers. Chapter 1 provides an overview of the gap between statistical theory regarding outliers and researchers’ day-to-day practice and their understanding of the effects of outliers. Chapter 2 presents a review of EFA with an emphasis on the four commonly used or highly recommended decision methods on the number of factors as well as a review of outliers which includes the sources of outliers and problems of outliers in factor analysis. Chapter 3 examines the effects of outliers arising from errors using the deterministic and slippage models. The results revealed that outliers can inflate, deflate, or have no effects on the decisions about the number of factors, which depends on the decision method used and the magnitude and number of outliers. Chapter 4 investigates the effects of outliers arising from an unintended and unknowingly included subpopulation using the mixture contamination model. The general conclusions are similar to chapter 3, but chapter 4 also reveals that symmetric and asymmetric contamination has different effects on different decision methods and the effects of outliers do not depend on sample size. Chapter 5 provides a general discussion of the findings of this dissertation, describes four novel contributions, and points out the limitations of the present research as well as the future research directions. This dissertation aims to bridge the gap from day-to-day researchers’ practice and understanding of the effects of outliers to current outlier research that emphasizes robust statistics. The findings of this dissertation address the contradictory conclusions made in previous studies.

View record

Effectiveness of an integrated mindfulness-based anxiety group intervention with university students who self-report anxiety : a small-N, mixed method design (2011)

Anxiety is a common mental health challenge seen at a university counselling centre. The Integrated Mindfulness-based Anxiety Group (IMAG) was a 10-session therapy program designed for use at a university counselling centre to work with university students who struggle with anxiety. IMAG integrated core mindfulness components from three prominent therapy programs; mindfulness was trained through both mindfulness meditative practices and skills. A mixed-method, Small-N design study investigated the effectiveness of the IMAG. Seventeen university students grappling with self-reported anxiety participated in this study. The dependent variables of anxiety symptoms, general clinical symptoms, and mindfulness were monitored across the study. Eleven of these participants also were interviewed three to six months after the end of the IMAG. There were four data analytic strategies used to assess effectiveness and change. First, the Participant and Group Practice Analyses showed that formal meditation techniques were the top-practiced activities in both intervention and follow-up phases; it also was shown that participants making the most change were those who practiced the longest per practice day. Second, the Small-N Visual Analyses, the principle research analysis, showed very few functional relationships between the IMAG and the three dependent variables. Third, the Within-subject analyses showed many significant changes both at the intervention’s end and during follow-up, with the average effect sizes being in the medium range. Finally, the Thematic Analysis showed themes in the categories of change, challenge, and mindfulness. The Change category contained themes pertaining to (1) the types of change experienced by the participants and (2) the contexts and criteria that seemed to support change. The Challenge category contained themes about (1) the challenges related to the practices, (2) challenges related to the group, and (3) challenges related to the context of the participants. Although there were changes shown in the Within-subject analyses, the Small-N analysis provided only weak evidence, thus no effectiveness claim can be made for the IMAG. The study’s limitations as well as future research suggestions are provided. The study’s conclusions make recommendations to improve the IMAG to make it more robust and responsive to dealing with university students struggling with anxiety challenges.

View record

Impact of Differential Item Functioning on Statistical Conclusions (2010)

Differential item functioning (DIF), sometimes called item bias, has been widely studied in educational and psychological measurement; however, to date, research has focused on the definitions of, and the methods for, detecting DIF. It is well accepted that the presence of DIF may degrade the validity of a test. There is relatively little known, however, about the impact of DIF on later statistical decisions when one uses the observed test scores in data analyses and corresponding statistical hypothesis tests. This dissertation investigated the impact of DIF on later statistical decisions based on the observed total test (or scale) score. Very little is known in the literature about the impact of DIF on the Type I error rate and effect size of, for instance, the independent samples t-test on the observed total test scores. Five studies were conducted: studies one to three investigated the impact of unidirectional DIF (i.e., DIF amplification) on the Type I error rate and effect size of the independent samples t-test; studies four and five investigated the DIF cancellation effects on the Type I error rate and effect size of the independent samples t-test. The Type I error rate and effect size were defined in terms of latent population means rather than observed sample means. The results showed that the amplification and cancellation effects among uniform DIF items did transfer to the test level. Both the Type I error rate and effect size were inflated. The degree of inflation depends on the number of DIF items, magnitude of DIF, sample sizes, and interactions among these factors. These findings highlight the importance of screening DIF before conducting any further statistical analysis. It offers advice to practicing researchers about when and how much the presence of DIF will affect their statistical conclusions based on the total observed test scores.

View record

The Satisfaction with Life Scale adapted for Children : investigating the structural, external, and substantive aspects of construct validity (2010)

Measuring and monitoring children’s satisfaction with life is of great significance for improving children’s lives. In order to do this, validated measures to assess children’s satisfaction with life are necessary. This dissertation describes a program of research for the validation of the Satisfaction with Life Scale adapted for Children (SWLS-C). The introductory chapter provides a theoretical background for subjective well-being and validity/validation research and definitions of key terms. The first manuscript presents psychometric findings on the structural and external aspects of construct validity. A stratified random sample of 1233 students in grades 4 to 7 (48% girls, mean age of 11.7 years) provided data on the SWLS-C and measures of optimism, self-concept, self-efficacy, depression, empathic concern, and perspective taking. The SWLS-C demonstrated a unidimensional factor structure, high internal consistency, and evidence of convergent and discriminant validity. Furthermore, differential item functioning and differential scale functioning analyses indicated that the SWLS-C measures satisfaction with life in the same way for different groups of children. The second manuscript investigated the substantive aspect of construct validity for the SWLS-C by examining the cognitive processes of children when responding to the items. Think-aloud protocol interviews were conducted with 55 students in grades 4 to 7 (58 % girls, mean age of 11.0 years) and content analysis was used to analyze the data. In their responses, children mainly used an ‘absolute strategy’ (statements indicating the presence/absence of something they consider important for their satisfaction with life) or a ‘relative strategy’ (statements indicating comparative judgments). The absolute statements primarily referred to social relationships, personal characteristics, time use, and possessions. In the relative statements, children primarily compared what they have to what (a) they want, (b) they had in the past, (c) other people have, and (d) they feel they need. The results are in line with multiple discrepancies theory (Michalos, 1985) and previous empirical findings. These two studies provide evidence for the meaningfulness of the inferences of the SWLS-C scores. The concluding chapter highlights the contributions of the dissertation, discusses limitations of the presented research, and delineates a future validation program for the SWLS-C.

View record

Validating Policy Ratings: The substantive aspect of construct validity for ratings of school tobacco policies (2010)

This dissertation investigated the substantive aspect of construct validity in the context of Canadian school tobacco policy ratings. The objective was to provide a better understanding of score meaning via the process of expert rater responding while rating school tobacco policies. Study one described Canadian school tobacco policies and identified policy characteristics. Written tobacco policies (N=196) were obtained from schools and boards across 10 Canadian provinces that participated in the Youth Smoking Survey. Policies were coded to identify characteristics associated with effectiveness in preventing student tobacco use. Smoking prevention education and cessation access were identified as key policy components that need to be addressed more strongly. Policy characteristics identified in study one formed the basis for study two. The objective of study two was to examine the cognitive processes that generate raters’ responses, identify rating obstacles and how raters overcome them. A think-aloud protocol was conducted with two expert tobacco policy raters who rated 12 tobacco policies using the Stephens & English rubric. Policies were sampled to reflect characteristics (type, length and comprehensiveness) identified in study one. Transcripts were coded to identify super-categories (rater behaviors), main categories (major cognitive processes at the item level) and subcategories to describe main processes in more detail. Categories and their interrelationships, rating obstacles and raters’ coping strategies are presented and a series of cognitive process models of rating is proposed. Findings suggest that raters use similar main processes explainable by similar sub-processes regardless of policy type rated. There was variation in rating obstacles and rater coping when different policy types were rated. The cognitive process models contribute to the substantive aspect of construct validity by providing explanations for score variation and enhancing understanding of score meaning. Explanation is sufficient when policies are comprehensive but is limited if based on short, less comprehensive policies. Implications for practice and policy recommendations are discussed.

View record

Investigating tests for equal variances (2009)

One of the central messages of this dissertation is that (a) unequal variances may be more prevalent than typically recognized in educational and policy research, and (b) when considering tests of equal variances, one needs to be cautious about what is being referred to as “Levene’s test” because Levene’s test is actually a family of techniques. Depending on which of the Levene tests that are being implemented, and particularly the Levene’s test based on means which is found in widely used software like SPSS, one may be using a statistical technique that is as bad (if not worse) than the F test which the Levene test was intended to replace. The primary goals of this dissertation are to (a) demonstrate that the current statistical practice of testing for equality of variances in hypothesis testing (as prescribed by textbooks and statistical software programs) is insufficient, (b) introduce a new non-parametric statistical test for homogeneity of variances, and (c) investigate the Type I error rate and power of the non-parametric Levene test with that of the median version of the Levene test. Under all conditions investigated, both tests maintained their nominal Type I error rates. As population distributions become more skewed, the non-parametric Levene test becomes more powerful than the median version of the Levene test. These results promise to impact applied statistical practice by informing researchers about the relative efficiencies of the two tests.This dissertation concludes with remarks about the implications of the findings, and the future work that has arisen from the results.

View record

Predictors of grade 3 French immersion students' reading comprehension : the role of morphological awareness, vocabulary and second language cultural knowledge (2009)

Research findings point to reading comprehension as an important mediator of academic achievement for French immersion students (Hogan, Caffs, & Little, 2005). This research investigated the best predictors of word reading and reading comprehension in French as a second language in 72 Grade 3 students of an early French immersion programme. The present research is based on Bemhardt’s (2005) model of second language reading, which views reading comprehension as an interactive-compensatory process. Four main questions guided this program of study: (1) What is the best predictor of word reading among phonological awareness, spelling, verbal working memory, vocabulary and morphological awareness in Grade 3 French immersion students? (2) What is the best predictor of reading comprehension amongphonological awareness, spelling, verbal working memory, vocabulary and morphological awareness in Grade 3 French immersion students? (3) What is the relative role of second language cultural knowledge compared to phonological awareness, spelling, verbal working memory, vocabulary and morphological awareness in Grade 3 French immersion students’ reading comprehension? and (4) What do French immersion Grade 3 students perceive as different in a culturally less and more familiar text that affected their reading comprehension and which cultural context do they prefer and why? Results from hierarchical regression analyses showed that phonological awareness and spelling predicted word reading, whereas morphological awareness predicted readingcomprehension of isolated sentences. Reading comprehension of a narrative text with morefamiliar cultural emphasis was predicted by receptive vocabulary (EVIP). Readingcomprehension of a narrative text with less familiar cultural emphasis was predicted by second language cultural knowledge, followed by morphological awareness. However, participantsperceived the culturally more familiar passage easier and perceived the culturally less familiarpassage as more engaging. Thus, results from the study appear to confirm that reading is an interactive compensatory process. Several theoretical, pedagogical and programme development implications are drawnfrom the present research.

View record

Validation of multilevel constructs : methods and empirical findings for the Early Development Instrument (2009)

A growing number of assessment, testing and evaluation programs gather individual measures but, by design, do not make inferences or decisions about individuals but rather for an aggregate such as a school, school district, neighbourhood, or province. In light of this, a multilevel construct can be defined as a phenomenon that is potentially meaningful both at the level of individuals and at one or more levels of aggregation. The purposes of this dissertation are to highlight the foundations of multilevel construct validation, describe two methodological approaches and associated analytic techniques, and then apply these approaches and techniques to the multilevel construct validation of a widely used school readiness measure called the Early Development Instrument (EDI). Validation evidence is presented regarding the multilevel covariance structure of the EDI, the appropriateness of aggregation to classroom and neighbourhood levels, and the effects of teacher and classroom characteristics on these structural patterns. To appropriately assess the multilevel factor structure of the categorical EDI items, a new fit index was created. A good-fitting unidimensional model was found for each scale at the level of individual students, with no notable improvements after taking clustering into account. However, at the class and neighbourhood levels of aggregation, the physical and emotional EDI scales did not show essential unidimensionality. Teacher and/or classroom influences accounted for between 19% and 25% of the total variance. EDI emotional scores were higher for teachers with graduate training, while communications scores were higher for younger teachers. Teachers tended to rate students more absolutely, rather than relative to other children in the class, when class size was small. These results are discussed in the context of the theoretical framework of the EDI, with suggestions for future validation work.

View record

Pratt's Importance Measures in Factor Analysis: A New Technique for Interpreting Oblique Factor Models (2008)

No abstract available.

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

How does the uncaptured uniqueness of survey respondents impact the analysis of group differences? (2022)

No abstract available.

The concept of drift and operationalization of its detection in simulated data (2017)

No abstract available.

Decision Rules Based on Hypothesis Tests and Effect Sizes for Logistic Regression Differential Item Functioning (2015)

Logistic Regression (LR) has been a technique used for the detection of items exhibiting differential item functioning (DIF). When it was introduced in 1990, the LR was conceptualized as strictly a test of statistical significance. This led to the over-identification of items as DIF, generally not exhibiting practically (psychometrically) significant differences. The use of blended decision rules – where effect sizes are used in addition to statistical significance in the decision-making process – was proposed to address this issue. Previous work in the literature attempted to align a decision rule grounded in the Mantel-Haenszel (M-H) technique to LR. However, this work is unable to replicate previously recommended cut-offs, through the use of the same methodology on a different data set. It is possible that cut-off values may be dataset specific, which also opens the question of whether universal cut-off values for effect sizes for DIF are a realistic expectation.

View record

Research Design and Effect Size: A Meta-Analysis of Mood Disorder Experimental Trials (2015)

The design of experimental studies can have a significant influence on effect size; however, this influence is rarely given enough consideration during the interpretation and comparison of research results. This paper examines whether there is a significant difference between the effect sizes from placebo-controlled versus treatment-controlled trials. This issue was studied by conducting a meta-analysis of approximately 37 RCTs of mood disorder therapies. The results of this methodological investigation confirmed that there is a statistically significant difference between the weighted effect sizes from the two groups of studies that were compared. These results support the claim that the type of control group is an important factor to consider in the design and interpretation of experimental studies. This analysis is a methodological contribution as it addresses how the type of control group in a RCT impacts the outcome of a study, and more specifically the effect size. The outcome of this research also challenges the effectiveness of treatments that have been tested against only one type of control in experimental studies.

View record

On the estimation of the polychoric correlation coefficient via Markov Chain Monte Carlo methods (2013)

Bayesian statistics is an alternative approach to traditional frequentist statistics that is rapidly gainingadherents across different scientific fields. Although initially only accessible to statisticians ormathematically-sophisticated data analysts, advances in modern computational power are helping tomake this new paradigm approachable to the everyday researcher and this dissemination is helpingopen doors to problems that have remained unsolvable or whose solution was extremely complicatedthrough the use of classical statistics. In spite of this, many researchers in the behavioural oreducational sciences are either unaware of this new approach or just vaguely familiar with some of itsbasic tenets. The primary purpose of this thesis is to take a well-known problem in psychometrics, theestimation of the polychoric correlation coefficient, and solve it using Bayesian statistics through themethod developed by Albert (1992). Through the use of computer simulations this method is comparedto traditional maximum likelihood estimation across various sample sizes, skewness levels and numbersof discretisation points for the latent variable, highlighting the cases where the Bayesian approach issuperior, inferior or equally effective to the maximum likelihood approach. Another issue that isinvestigated is a sensitivity analysis of sorts of the prior probability distributions where a skewed(bivariate log-normal) and symmetric (bivariate normal) priors are used to calculate the polychoriccorrelation coefficient when feeding them data with varying degrees of skewness, helping demonstrateto the reader how does changing the prior distribution for certain kinds of data helps or hinders theestimation process. The most important results of these studies are discussed as well as futureimplications for the use of Bayesian statistics in psychometrics

View record

False positives in multiple regression: highlighting the consequences of measurement error in the independent variables (2012)

Type I error rates in multiple regression, and hence the chance for false positive research findings in the literature, can be drastically inflated when the analyses include independent variables measured with error. Although the bias caused by random measurement error in multiple regression is widely recognized, there has been little discussion of the impact on hypothesis tests outside of the statistical literature. The primary purpose of this thesis is to raise awareness of the problem among methodologists and researchers by demonstrating, in a non-technical manner, the nature and extent of the inflation in Type I error rates for educational and psychological research contexts. This thesis uses computer simulations to demonstrate that, for commonly encountered scenarios, the Type I error rate in a multiple regression model where the independent variables are correlated and measured with random error can approach 1.0, even if the nominal Type I error rate is 0.05. Because nearly all quantitative data in educational and psychological research contain some level of random measurement error, and because multiple regression is one of the most widely used data analytic techniques, this problem should be a serious concern for methodologists and applied researchers. The most important factors causing the problem are summarized, and the implications for research and pedagogy are discussed.

View record

Does the screening version of the Psychopathy Checklist measure the same disorder in males and females? (2011)

The Psychopathy Checklist: Screening Version (PCL:SV) is a tool used to measure the construct of psychopathy in males and females. From a psychometric standpoint, the PCL:SV is administered and scored in the same manner for male and female respondents. The aim of the current study is to investigate using scale and item level statistical techniques if the PCL:SV measures the construct of psychopathy equivalently for males and females. Given the Likert-type nature of the item responses, both Pearson and polychoric correlation techniques were employed in order to compare the more commonly used Pearson to the psychometrically correct polychoric. At the scale level, some PCL:SV items loaded differently for males and females, and not in a manner found in the literature. At the item level, only four items displayed DIF, and the DIF for these items was minimal. These findings suggest that the PCL:SV is measuring the same construct of psychopathy for males and females, but more clearly defines the construct of psychopathy for male respondents. This may, in part, be due to the ways in which males and females express psychopathy; in which case the construct of psychopathy itself needs to be revisited.

View record


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Explore our wide range of course-based and research-based program options!