Ivan Beschastnikh: Associate Professor at Department of Computer Science, UBC Faculty of Science

Prospective Graduate Students / Postdocs

This faculty member is currently not actively recruiting graduate students or Postdoctoral Fellows, but might consider co-supervision together with another faculty member.

Associate Professor

Faculty of Science

Research Classification

Computer and information sciences

Research Interests

software engineering

distributed systems

cloud computing

software analysis

Machine Learning

Relevant Thesis-Based Degree Programs

View all programs

Affiliations to Research Centres, Institutes & Clusters

Blockchain at UBC

Data Science Institute

Designing for People (DFP)

Institute for Computing, Information and Cognitive Systems (ICICS)

Research Options

I am available and interested in collaborations (e.g. clusters, grants).

I am interested in and conduct interdisciplinary research.

I am interested in working with undergraduate students on research projects.

Open All

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Datacenter resource scheduling for networked cloud applications (2021)

Cloud computing is an integral part of modern life, which became increasingly apparent during the COVID-19 pandemic. Applications that run on the cloud facilitate many of our daily activities, including education, retail, and high quality video calls that keep us connected. These applications run on one or more Virtual Machines (VM), where networked cloud applications can benefit from inter-VM network bandwidth guarantees. For example, an entire class of network-intensive big-data processing applications run more quickly with sufficient network bandwidth guarantees.However, offering inter-VM bandwidth guarantees creates challenges both for resource allocation latency and datacenter utilization, because the resource scheduler must satisfy per-VM resource demands and inter-VM bandwidth requirements.This dissertation demonstrates that it is feasible to offer inter-VM bandwidth guarantees as a first class cloud service. We develop several algorithms that allow efficient sharing of datacenter network bandwidth across tenants. These algorithms maintain high datacenter utilization while offering low allocation latency. Specifically, we propose constraint-solver-based algorithms that scale well to datacenters with hundreds of servers and heuristic-based algorithms that scale well to large-scale datacenters with thousands of servers. We demonstrate the practicality of these algorithms by integrating them into the OpenStack cloud management framework. We also construct a realistic cloud workload with bandwidth requirements, which we use to evaluate the efficiency of our resource scheduling algorithms.We demonstrate that selling inter-VM network bandwidth guarantees as a service increases cloud provider revenue. Furthermore, it is possible to do so without changing cloud affordability for the tenants due to shortened job completion times for the tenant applications. Savings from the shortened VM lifetimes can be used to cover the network bandwidth guarantees service cost, which allows tenants to complete their job faster without paying extra. For example, we show that cloud providers can generate up to 63% extra revenue compared to the case when they do not offer network bandwidth guarantees.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

GlueFL : reconciling client sampling and model masking for bandwidth efficient federated learning (2023)

Federated learning (FL) is an effective technique to directly involve edge devices in machine learning (ML) training while preserving client privacy. However, the substantial communication overhead of FL makes training challenging when edge devices have limited network bandwidth. Existing work to optimize FL bandwidth overlooks downstream transmission and does not account for FL client sampling. We propose GlueFL, a framework that incorporates new client sampling and model compression algorithms to mitigate low download bandwidths of FL clients. GlueFL prioritizes recently used clients and bounds the number of changed positions in compression masks in each round. We analyse FL convergence under GlueFL’s sticky sampling, and show that our proposed weighted aggregation preserves unbiasedness of updates and convergence. We evaluate GlueFL empirically, and demonstrate downstream bandwidth and training time savings on three public datasets. On average, our evaluation shows that GlueFL spends 29% less training time with a 27% less downstream bandwidth overhead as compared to three state-of-the-art strategies.

View record

Privacy and conflicting identities in the context of Punjabi Canadians (2023)

Many of us have experienced the need to filter the information we share with the world. There is a dearth of research in the field of privacy in the context of non-WEIRD populations and immigrants living in WEIRD (Western, Educated, Industrialized, Rich, and Democratic) societies. In this thesis,I explore how Punjabi Canadian adults navigate the potential conflicts between their two cultural identities and how this affects their digital privacy decisions. I identify the social and cultural factors that influence the privacy choices of Punjabi Canadians, and how these individuals react to the resulting social sanctions. These sanctions may limit their freedom and their right to privacy both in physical and virtual settings. I conducted 12 interviews with Punjabi Canadians and used constructivist grounded theory to elicit themes related to the “What,” “Why,” and “How” of their privacy choices and behaviours. I found that the collectivist nature of the Punjabi culture is reflected in their privacy regulation practices, which involve the family as well as the individual. There was an asymmetry of information flow between the Punjabi and Canadian sides of an individual’s social circle. This was highlighted by the norm of physical and social gender segregation among Punjabi Canadians, where the Punjabi side was more often kept in the dark about activities that involved gender mixing. I also found that women participants were especially fluent in privacy-preserving practices, which may be due to their being held to a higher standard than men, facing greater surveillance,and having more familial responsibilities and restrictions on their clothing and mobility. The Punjabi culture represents a significant portion of South Asian culture while the Canadian culture represents North American culture. Hence, the findings of this study are relevant to immigrants in North America, especially from South Asia. Apart from immigrants, they are also transferable to other populations experiencing conflicting identities.

View record

Attacking transaction relay in MimbleWimble blockchains (2021)

Blockchain-based networks are often concerned with privacy. Two common types of privacy in blockchain networks are (1) transaction source privacy, and (2) transaction content privacy. Research has shown that Bitcoin, the most prominent cryptocurrency, cannot easily provide these privacy types. Hence, new protocols have been proposed. For example, Dandelion++ is a solution to the source privacy vulnerability in Bitcoin. Practical systems, however, need to provide multiple privacy guarantees at the same time. To the best of our knowledge, source privacy and content privacy have not been considered simultaneously in the literature. We conjecture that cryptocurrencies that use Dandelion++ for transaction relay could be susceptible to attacks against both types of privacy and also to performance attacks. Our focus in this project is on the implementations of the MimbleWimble cryptocurrency protocol such as Beam. We have designed and implemented three different attacks against these existing privacy-focused protocols. In the first attack, the adversary uses information obtained from an incoming transaction for improved detection of the transaction source. In the second attack, to increase the latency of an incoming transaction, the adversary adds an excessive delay before forwarding the transaction. In the third attack, the adversary exploits the aggregation protocol in MimbleWimble to launch a denial of service attack on an incoming transaction. We have validated our proposed attacks in a private test network of Beam nodes and a network simulator.

View record

Large scale federated analytics and differential privacy budget preservation (2021)

This thesis presents two contributions. The first contribution deals with the problem of siloed data collection and prohibitive data acquisition costs. These costs limit the size and diversity of datasets used in health research. Access to larger and more diverse datasets improves the understanding of disease heterogeneity and facilitates inference of relationships between surgical and pathological findings with symptomatic indicators and outcomes. Unfortunately, freely enabling access to these datasets has the potential of leaking private information, such as medical records, even when these datasets have been stripped of personally identifiable information.In the first part of this thesis, we present LEAP, a data analytics platform with support for federated learning. LEAP allows users to analyze data distributed across multiple institutions in a private and secure manner, without leaking sensitive patient information. LEAP achieves this through an infrastructure that maintains privacy by design and brings the computation to the data, instead of bringing the data to the computation. LEAP adds an overhead of up to 2.5X, training Resnet-18 with 15 participating sites, when compared to a centralized model. Despite this overhead, LEAP achieves convergence of the model’s accuracy within 20% of the time taken for the centralized model to converge.One of the techniques used by LEAP to preserve the privacy of sensitive queries is differential privacy. Successive DP queries to a dataset depletes the privacy budget. When the privacy budget is depleted, data curators must block access to the underlying dataset to prevent private information from leaking. In the second part of this thesis, we present a system called the SmartCache. The SmartCache optimizes the use of the privacy budget by interpolating old query results to help answer new queries using a synthetic dataset. Queries answered from the synthetic dataset have a smaller privacy cost, so more queries can be answered before the budget runs out. For statistical queries, the SmartCache saved 30%-50% of the budget for threshold values of 0.99 and 0.999, and for gradient queries it consumed 70% less of the privacy budget when training a fully connected model.

View record

An indexed type system for (2020)

Downloading and executing untrusted code is inherently unsafe, but also something that happens often on the internet. Therefore, untrusted code often requires run-time checks to ensure safety during execution. These checks compromise performance and may be unnecessary. We present the Wasm-prechk language, an assembly language based on WebAssembly that is intended to ensure safety while justifying the static elimination of run-time checks.

View record

Biscotti - a ledger for private and secure peer to peer machine learning (2020)

Federated Learning is the current state of the art in supporting secure multi-party machine learning (ML): data is maintained on the owner's device and the updates to the model areaggregated through a secure protocol. However, this process assumes a trusted centralized infrastructure for coordination, and clients must trust that the central service does not use the byproducts of client data. In addition to this, a group of malicious clients could also harm the performance of the model by carrying out a poisoning attack.As a response, we propose Biscotti: a fully decentralized peer to peer (P2P) approach to multi-party ML, which uses blockchain and cryptographic primitives to coordinate a privacy-preserving ML process between peering clients. Our evaluation demonstrates that Biscotti is scalable, fault tolerant, and defends against known attacks. For example, Biscotti is able to protect the privacy of an individual client's update and the performance of the global model at scale when 30% ofadversaries are trying to poison the model.

View record

Dara the explorer: coverage based exploration for model checking of distributed systems in Go (2020)

Distributed systems form the backbone of moderncloud systems. Failures in distributed systems can cause massive lossesin the form of service outages and loss of revenue impacting real users ofsystems. Thus, it is imperative to find bugs in distributed systemsbefore they are used in production systems.However, debugging distributed systems continues toelude us. Use of abstract modelling languages such asTLA+, PlusCal, Coq, and SPIN that check the correctness of models of distributedsystems have become popular in recent years butthey require a considerable amount of developer effort and do not necessarilyfind all the bugs in the implementation of the system.Model checkers that explore all possible executionsof the implementation of a distributed system suffer from state space explosion,rendering them impractical as they are inefficiently scalable.To alleviate this, we propose Dara, a model checker designedfor Go systems that uses a novel coverage-based strategy forordering exploration of paths in the state space of the systemaccording to the amount of code coveredacross nodes.Dara can find and reproduce concurrency bugs in go systems.

View record

Compiling distributed system specifications into implementations (2019)

Distributed systems are notoriously difficult to get right: the inherently asynchronous nature of their execution gives rise to a fundamentally non-deterministic system. Previous work has shown that the use of formal methods to reason about distributed systems is a promising area of research. In particular, model-checking allows developers to verify system models against a correctness specification by performing an exhaustive search over the system's possible executions. However, the transition from a trusted specification to a valid implementation is a manual, error-prone process that could introduce subtle, hard to find bugs.In this work, we aim to bridge this gap by automatically translating a specification into an implementation that refines it. Specifically, we leverage the clear separation between application-specific logic and abstract components in the Modular PlusCal language to define an interface that concrete implementations must follow in order to implement the abstractions. Our evaluation shows that this approach is able to handle complex specifications and generate systems that preserve the correctness properties verified via model-checking.

View record

Corresponding formal specifications with distributed systems (2019)

As the need for computing resources grows, providers are increasingly relying ondistributed systems to render their services. However, distributed systems are hardto design and implement. As an aid for design and implementation, formal verifica-tion has seen a growing interest in industry. For example, Amazon uses TemporalLogic of Actions plus (TLA⁺) and PlusCal specification languages and tool chainto formally verify manually created specifications of their web services [8].Nevertheless, there is currently no tool to automatically establish a correspon-dence between a PlusCal specification with a concrete implementation. Further-more, PlusCal was not designed with modularity in mind, so a large PlusCal spec-ification cannot be decomposed into smaller ones for ease of modification. Thisthesis proposes an extension to PlusCal, named Modular PlusCal, as well as acompiler, named PGo, which compiles Modular PlusCal and PlusCal specificationsinto Go programs. Modular PlusCal introduces new constructs, such as archetypesand mapping macros, to provide isolation and, as a result, modularity. By auto-matically compiling PlusCal and Modular PlusCal specifications into distributedsystem implementations, PGo reduces the burden on programmers trying to ensurethe correctness of their distributed systems.

View record

Cross-device access control with Trusted Capsules (2019)

Users desire control over their data even as they share them across deviceboundaries. At the moment, they rely on ad-hoc solutions such as sending self destructible data with ephemeral messaging apps such as SnapChat. We presentTrusted Capsules, a general cross-device access control abstraction for files. Itbundles sensitive files with the policies that govern their accesses into units we callcapsules. Capsules appear as regular files in the system. When an app opens one,its policy is executed in ARM TrustZone, a hardware-based trusted execution environment, to determine if access should be allowed or denied. As Trusted Capsulesis based on a pragmatic threat model, it works with unmodified apps that users havecome to rely on, unlike existing work. We show that policies in Trusted Capsulesare expressible and that the slowdowns in our approach are limited to the openingand closing of capsules. Once an app opens a capsule, its read throughput of thefile is identical to regular non-capsule files.

View record

Data-driven data center traffic control (2019)

Recent networking research has identified that data-driven congestion control(CC) can be more efficient than traditional CC in data centers (DCs). Deep reinforcementlearning (RL), in particular, has the potential to learn optimal networkpolicies. However, RL suffers from instability and over-fitting, deficiencieswhich so far render it unacceptable for use in DC networks. We analyzethe requirements for data-driven policies to succeed in the DC context.And, we present a new emulator, Iroko, which supports different networktopologies, DC traffic engineering algorithms, and deploymentscenarios. Iroko interfaces with the OpenAI gym toolkit, which allows for fastand fair evaluation of RL against traditional algorithms under equal conditions.We present initial benchmarks of three deep RL algorithms against TCP New Vegasand DCTCP. The results show that out-of-the-box algorithms are able to learn aCC policy with comparative performance to TCP in our emulator. We make Irokoopen-source and publicly available: https://github.com/dcgym/iroko.

View record

Erlay: efficient transaction relay in Bitcoin (2019)

Bitcoin is a top-ranked cryptocurrency that has experienced huge growth and survived numerous attacks. The protocols making up Bitcoin must therefore accommodate the growth of the network and ensure security. However, Bitcoin’s transaction dissemination protocol has mostly evaded optimization. This protocol is based on flooding and though it is secure and fault-tolerant, it is also highly inefficient. Specifically, our measurements indicate that 43% of the traffic generated by transaction dissemination in the Bitcoin network is redundant.In this paper we introduce a new transaction dissemination protocol called Erlay. Erlay is a hybrid protocol that combines limited flooding with intermittent reconciliation. We evaluated Erlay in simulation and by implementing and deploying it at scale. Compared to Bitcoin’s current protocols, Erlay reduces the bandwidth used to announce transactions by 84% without significantly affecting privacy or propagation speed. In addition, Erlay retains the existing Bitcoin security guarantees and is more scalable relative to the number of nodes in the network and their connectivity. Erlay is currently being investigated by the Bitcoin community for future use with the Bitcoin protocol.

View record

Priority-based parameter propagation for distributed deep neural network training (2019)

Data parallel training is commonly used for scaling distributed Deep Neural Network ( DNN ) training. However, the performance benefits are often limited by the communication-heavy parameter synchronization step. In this work, we take advantage of the domain specific knowledge of DNN training and overlap parameter synchronization with computation in order to improve the training performance. We make two key observations: (1) the optimal data representation granularity for the communication may differ from that used by the underlying DNN model implementation and (2) different parameters can afford different synchronization delays. Based on these observations, we propose a new synchronization mechanism called Priority-based Parameter Propagation (P3). P3 synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay. We show that P3 can improve the training throughput of ResNet-50, Sockeye and VGG-19 by as much as 25%, 38% and 66% respectively on clusters with realistic network bandwidth.

View record

The unbalancing act: proxy preservation for censorship resistance systems (2019)

Internet censorship is a form of digital authoritarianism in certain countries that restrict access to the internet. Internet freedom, to a degree, is possible even in these countries by means of proxies maintained outside of the censor's boundaries. These proxies can be compromised by censors who pose as legitimate users to discover proxies. Censors are powerful adversaries and may block access to any proxy once they know about it.We propose a novel technique to address the proxy distribution problem in this thesis. We introduce the needle algorithm that preserves proxies by limiting their distribution. We show that it is a useful mechanism for both preserving proxies and maintaining client service under a censorship threat model. We examine characteristics of the needle algorithm in a simulation. Three measures are important under the censorship threat model; the enumeration or discovery of all proxies, load balancing guarantees, and the collateral damage of innocent bystanders. We compare the results of these experiments with two well-known algorithms, uniform random and power of 2 choices, as well as Tor's bridgedb proxy assignment mechanism.

View record

XSnare: application-specific, cross-site scripting protection (2019)

We present XSnare, a fully client-side Cross-site Scripting (xss) solution,implemented as a Firefox extension. Our approach takes advantage of availableprevious knowledge of a web application’s Hypertext Markup Language(html) template content, as well as the rich context available inthe Document Object Model (dom) to block xss attacks. XSnare preventsxss exploits by using a database of exploit descriptions, which are writtenwith the help of previously recorded Common Vulnerabilities and Exposuress(cves). cves for xss are widely available and are one of the mainways to tackle zero-day exploits. XSnare effectively singles out potentialinjection points for exploits in the html and sanitizes content to preventmalicious payloads from appearing in the dom.XSnare can protect application users before application developers releasepatches and before server operators apply them.We evaluate our approach by studying 105 recent cves related to xss attacks,and find that our tool defends against 94.2% of these exploits. To thebest of our knowledge, XSnare is the first protection mechanism for xss thatis application-specific, and based on publicly available cve information. Weshow that XSnare’s specificity protects users against exploits which evadeother, more generic, anti-xss approaches.Our performance evaluation shows that our extension’s overhead on webpage loading time is less than 10% for 72.6% of the sites in the Moz Top500 list.

View record

Dancing in the dark: private multi-party machine learning in an untrusted setting (2018)

The problem of machine learning (ML) over distributed data sourcesarises in a variety of domains. Unfortunately, today's distributed MLsystems use an unsophisticated threat model: data sources must trust acentral ML process.We propose a brokered learning abstraction that provides datasources with provable privacy guarantees while allowing them tocontribute data towards a globally-learned model in an untrustedsetting. We realize this abstraction by building on the state of theart in multi-party distributed ML and differential privacy methods toconstruct TorMentor, a system that is deployed as a hiddenservice over an anonymous communication protocol.We define a new threat model by characterizing, developing andevaluating new attacks in the brokered learning setting, along witheffective defenses for these attacks. We show that TorMentoreffectively protects data sources against known ML attacks whileproviding them with a tunable trade-off between model accuracy andprivacy.We evaluate TorMentor with local and geo-distributed deployments onAzure. In an experiment with 200 clients and 14 megabytes of data perclient our prototype trained a logistic regression model usingstochastic gradient descent in 65 seconds.

View record

Inferring and asserting distributed invariants (2018)

Distributed systems are difficult to debug and understand. A key reason for this isdistributed state, which is not easily accessible and must be pieced together fromthe states of the individual nodes in the system.We propose Dinv, an automatic approach to help developers of distributed sys-tems uncover the runtime distributed state properties of their systems. Dinv usesstatic and dynamic program analyses to infer relations between variables at differ-ent nodes. For example, in a leader election algorithm, Dinv can relate the variableleader at different nodes to derive the invariant ∀ nodes i, j, leader i = leader j .This can increase the developer’s confidence in the correctness of their system.The developer can also use Dinv to convert an inferred invariant into a distributedruntime assertion on distributed state.We applied Dinv to several popular distributed systems, such as etcd Raft,Hashicorp Serf, and Taipei-Torrent, which have between 1.7K and 144K LOC andare widely used. Dinv derived useful invariants for these systems, including invari-ants that capture the correctness of distributed routing strategies, leadership, andkey hash distribution. We also used Dinv to assert correctness of the inferred etcdRaft invariants at runtime, using these asserts to detect injected silent bugs.

View record

Inter-process communication in disaggregated datacenters (2018)

Disaggregation is a promising new datacenter (DC) architecture which aims to mit- igate mounting DC costs. Disaggregated datacenters (DDCs) disaggregate tradi- tional server components into distinct resources. Disaggregation also poses an interesting paradigm shift. Namely, a DDC possesses traits akin to a distributed system, as resources no longer fate-share: a CPU can fail independently of another CPU. It is not unreasonable to assume that these disaggregated resources will still be presented to a user as a single machine. This requirement has implications for disaggregated system design. For example, what happens if a CPU fails during a remote cross-processor procedure call?This is not a new question, as distributed systems, multi-processor systems, and high performance computing (HPC) systems, have grappled with this challenge. We look at how this challenge translates to a disaggregated context, in particular, focusing on the remote procedure call (RPC) abstraction. We design a disaggregated system, Bifröst, to ensure exactly-once semantics for procedure calls under failure scenarios and provide strict memory consistency. We analyze the overhead of Bifröst compared to an equivalent RPC implementation in Thrift. Although, we find that Bifröst has a higher overhead than Thrift, its results are still promising, showing that we can achieve greater functionality than Thrift with a slightly higher overhead.

View record

"Not able to resist the urge" : social insider attacks on Facebook (2017)

Facebook accounts are secured against unauthorized access through passwords, and through device-level security. Those defenses, however, may not be sufficient to prevent social insider attacks, where attackers know their victims, and gain access to their accounts using the victim's device. To characterize these attacks, we ran two Amazon Mechanical Turk studies geographically restricting participant pool to US only. Our major goal was to establish social insider attack prevalence and characteristics to justify a call to action for better protective and preventative countermeasures against it. In the first study involving 1308 participants, we used the list experiment, a quantitative method to estimate that 24% of participants had perpetrated social insider attacks, and that 21% had been victims to it (and knew about it). In the second, qualitative study with 45 participants, we collected stories detailing personal experiences with such attacks. Using thematic analysis, we typified attacks around 5 motivations (fun, curiosity, jealousy, animosity and utility), and explored dimensions associated with each type. Our combined findings indicate a number of trends in social insider attacks. We found that they are common, they can be perpetrated by almost all social relations and often have serious emotional consequences. Effective mitigation would require a variety of approaches as well as better user awareness. Based on the results of our experiments, we propose methodological steps to study the perception of severity of social insider attacks. In this procedure, we include an experimental design of the study and its possible limitations. The study consists of presenting stories collected in the previously mentioned second study to a new cohort of participants. It the asks them to provide a Likert Scale rating and justification for how severe they perceive the attack in the story to be if they were the victim as well as how likely they feel they might be a victim to such an attack. Lastly, we discuss possible future work in creating countermeasures to social insider attacks, their viability and limitations. We conclude that no single technique is complete solution. Instead mitigation will require a number of techniques in combination to be effective.

View record

Cross-platform data integrity and confidentiality with graduated access control (2017)

Security of data is tightly coupled to its access policy. However, in practice, a data owner has control of his data’s access policies only as far as the boundaries of his own systems. We introduce graduated access control, which provides mobile, programmable, and dynamically-resolving policies for access control that extends a data owner’s policies across system boundaries. We realize this through a novel data-centric abstraction called trusted capsules and its associated system, the trusted data monitor. A trusted capsule couples data and policy into a single mobile unit. A capsule is backwards-compatible and is indistinguishable from any regular file to applications. In coordination with the trusted data monitor, a capsule provides data integrity and confidentiality on remote devices, strong authentication to a trusted capsule service, and supports nuanced and dynamic access control decisions on remote systems. We implemented our data monitor using ARM TrustZone. We show that graduated access control can express novel and useful real world policies, such as revocation, remote monitoring, and risk-adaptable disclosure. We illustrate trusted capsules for different file formats, including JPEG, FODT, MP4 and PDF. Wealso show compatibility with unmodified applications such as LibreOffice Writer, Evince, GpicView and VLC. In general, we found that applications operating on trusted capsules have varying performance, which depends on file size, application access patterns, and policy complexity.

View record

Qualitative Repository Analysis with RepoGrams (2015)

The availability of open source software projects has created an enormous opportunity for empirical evaluations in software engineering research. However, this availability requires that researchers judiciously select an appropriate set of evaluation targets and properly document this rationale. This selection process is often critical as it can be used to argue for the generalizability of the evaluated tool or method.To understand the selection criteria that researchers use in their work we systematically read 55 research papers appearing in six major software engineering conferences. Using a grounded theory approach we iteratively developed a codebook and coded these papers along five different dimensions, all of which relate to how the authors select evaluation targets in their work. Our results indicate that most authors relied on qualitative and subjective features to select their evaluation targets. Building on these results we developed a tool called RepoGrams, which supports researchers in comparing and contrasting source code repositories of multiple software projects and helps them in selecting appropriate evaluation targets for their studies. We describe RepoGrams's design and implementation, and evaluate it in two user studies with 74 undergraduate students and 14 software engineering researchers who used RepoGrams to understand, compare, and contrast various metrics on source code repositories. For example, a researcher interested in evaluating a tool might want to show that it is useful for both software projects that are written using a single programming language, as well as ones that are written using dozens of programming languages. RepoGrams allows the researcher to find a set of software projects that are diverse with respect to this metric. We also evaluate the amount of effort required by researchers to extend RepoGrams for their own research projects in a case study with 2 researchers. We find that RepoGrams helps software engineering researchers understand and compare characteristics of a project's source repository and that RepoGrams can be used by non-expert users to investigate project histories. The tool is designed primarily for software engineering researchers who are interested in analyzing and comparing source code repositories across multiple dimensions.

View record