Doctor of Philosophy in Computer Science (PhD)
Visualizing Public Health Data
G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.
Why do people visualize data? People visualize data either to consume or produce information relevant to a domain-specific problem or interest. Visualization design and evaluation involves a mapping between domain problems or interests and appropriate visual encoding and interaction design choices. This mapping translates a domain-specific situation into abstract visualization tasks, which allows for succinct descriptions of tasks and task sequences in terms of why data is visualized, what dependencies a task might have in terms of input and output, and how the task is supported in terms of visual encoding and interaction design choices. Describing tasks in this way facilitates the comparison and cross-pollination of visualization design choices across application domains; the mapping also applies in reverse, whenever visualization researchers aim to contextualize novel visualization techniques. In this dissertation, we present multiple instances of visualization task abstraction, each integrating our proposed typology of abstract visualization tasks. We apply this typology as an analysis tool in an interview study of individuals who visualize dimensionally reduced data in different application domains, in a post-deployment field study evaluation of a visual analysis tool in the domain of investigative journalism, and in a visualization design study in the domain of energy management. In the interview study, we draw upon and demonstrate the descriptive power of our typology to classify five task sequences relating to visualizing dimensionally reduced data. This classification is intended to inform the design of new tools and techniques for visualizing this form of data. In the field study, we draw upon and demonstrate the descriptive and evaluative power of our typology to evaluate Overview, a visualization tool for investigating large text document collections. After analyzing its adoption by investigative journalists, we characterize two abstract tasks relating to document mining and present seven lessons relating to the design of visualization tools for document data. In the design study, we demonstrate the descriptive, evaluative, and generative power of our typology and identify matches and mismatches between visualization design choices and three abstract tasks relating to time series data. Finally, we reflect upon the impact of our task typology.
In this thesis, we explore ways to make practical extensions to Dimensionality Reduction, or DR algorithms with the goal of addressing challenging, real-world cases. The first case we consider is that of how to provide guidance to those users employing DR methods in their data analysis. We specifically target users who are not experts in the mathematical concepts behind DR algorithms. We first identify two levels of guidance: global and local. Global user guidance helps non-experts select and arrange a sequence of analysis algorithms. Local user guidance helps users select appropriate algorithm parameter choices and interpret algorithm output. We then present a software system, DimStiller, that incorporates both types of guidance, validating it on several use-cases. The second case we consider is that of using DR to analyze datasets consisting of documents. In order to modify DR algorithms to handle document datasets effectively, we first analyze the geometric structure of document datasets. Our analysis describes the ways document datasets differ from other kinds of datasets. We then leverage these geometric properties for speed and quality by incorporating ideas from text querying into DR and other algorithms for data analysis. We then present the Overview prototype, a proof-of-concept document analysis system. Overview synthesizes both the goals of designing systems for data analysts who are DR novices, and performing DR on document data. The third case we consider is that of costly distance functions, or when the method used to derive the true proximity between two data points is computationally expensive. Using standard approaches to DR in this important use-case can result in either unnecessarily protracted runtimes or long periods of user monitoring. To address the case of costly distances, we develop an algorithm framework, Glint, which efficiently manages the number of distance function calculations for the Multidimensional Scaling class of DR algorithms. We then show that Glint implementations of Multidimensional Scaling algorithms achieve substantial speed improvements or remove the need for human monitoring.
A graph consists of a set and a binary relation on that set. Each elementof the set is a node of the graph, while each element of the binary relationis an edge of the graph that encodes a relationship between two nodes.Graph are pervasive in many areas of science, engineering, and the socialsciences: servers on the Internet are connected, proteins interact in largebiological systems, social networks encode the relationships between people,and functions call each other in a program. In these domains, the graphscan become very large, consisting of hundreds of thousands of nodes andmillions of edges.Graph drawing approaches endeavour to place these nodes in two orthree-dimensional space with the intention of fostering an understandingof the binary relation by a human being examining the image. However,many of these approaches to drawing do not exploit higher-level structuresin the graph beyond the nodes and edges. Frequently, these structures canbe exploited for drawing. As an example, consider a large computer networkwhere nodes are servers and edges are connections between those servers.If a user would like understand how servers at UBC connect to the rest ofthe network, a drawing that accentuates the set of nodes representing thoseservers may be more helpful than an approach where all nodes are drawn inthe same way. In a feature-based approach, features are subgraphs exploitedfor the purposes of drawing. We endeavour to depict not only the binaryrelation, but the high-level relationships between features.This thesis extensively explores a feature-based approach to graph visualization and demonstrates the viability of tools that aid in the visualization of large graphs. Our contributions lie in presenting and evaluatingnovel techniques and algorithms for graph visualization. We implement fivesystems in order to empirically evaluate these techniques and algorithms,comparing them to previous approaches.
Large data sets are difficult to analyze. Visualization has been proposed to assist exploratory data analysis (EDA) as our visual systems can process signals inparallel to quickly detect patterns. Nonetheless, designing an effective visualanalytic tool remains a challenge.This challenge is partly due to our incomplete understanding of how commonvisualization techniques are used by human operators during analyses, either inlaboratory settings or in the workplace.This thesis aims to further understand how visualizations can be used to support EDA. More specifically, we studied techniques that display multiple levels of visual information resolutions (VIRs) for analyses using a range of methods.The first study is a summary synthesis conducted to obtain a snapshot ofknowledge in multiple-VIR use and to identify research questions for the thesis:(1) low-VIR use and creation; (2) spatial arrangements of VIRs. The next twostudies are laboratory studies to investigate the visual memory cost of imagetransformations frequently used to create low-VIR displays and overview usewith single-level data displayed in multiple-VIR interfaces.For a more well-rounded evaluation, we needed to study these techniques inecologically-valid settings. We therefore selected the application domain of websession log analysis and applied our knowledge from our first three evaluationsto build a tool called Session Viewer. Taking the multiple coordinated viewand overview + detail approaches, Session Viewer displays multiple levels ofweb session log data and multiple views of session populations to facilitate dataanalysis from the high-level statistical to the low-level detailed session analysisapproaches.Our fourth and last study for this thesis is a field evaluation conducted atGoogle Inc. with seven session analysts using Session Viewer to analyze theirown data with their own tasks. Study observations suggested that displayingweb session logs at multiple levels using the overview + detail technique helped bridge between high-level statistical and low-level detailed session analyses, andthe simultaneous display of multiple session populations at all data levels usingmultiple views allowed quick comparisons between session populations. We alsoidentified design and deployment considerations to meet the needs of diversedata sources and analysis styles.
Path tracing is a common task in many real world uses of graphs that display networks of relationships. Despite previous work in the evaluation of how factors, such as edge-edge crossings, impact the readability of graph layouts, what makes one path-tracing task more difficult than another is not well understood.To address this question we conducted an observational user study with 12 participants completing a path-tracing task. Our extensive qualitative analysis of the study data led to a detailed characterization of common path-tracing behaviours. We then created a predictive model of the paths that users are most likely to search, which we name the search set, based on the behaviours we observed. To validate our predictive behavioural model, and to demonstrate how the search set could be used, we conducted a careful comparison of graph readability factors through a hierarchical multiple regression analysis.
Scientists use DNA sequence differences between an individual's genome and a standard reference genome to study the genetic basis of disease. Such differences are called sequence variants, and determining their impact in the cell is difficult because it requires reasoning about both the type and location of the variant across several levels of biological context. In this design study, we worked with four analysts to design a visualization tool supporting variant impact assessment for three different tasks. We contribute data and task abstractions for the problem of variant impact assessment, and the carefully justified design and implementation of the Variant View tool. Variant View features an information-dense visual encoding that provides maximal information at the overview level, in contrast to the extensive navigation required by currently-prevalent genome browsers. We provide initial evidence that the tool simplified and accelerated workflows for these three tasks through three case studies. Finally, we reflect on the lessons learned in creating and refining data and task abstractions that allow for concise overviews of sprawling information spaces that can reduce or remove the need for the memory-intensive use of navigation.