Margo Seltzer
Research Interests
Relevant Thesis-Based Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision
Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.
In an era of hardware diversity, the management of applications' allocated memory is a complex task that can have significant performance repercussions. Non-uniformity in the memory hierarchy, along with heterogeneity and asymmetry of chip designs, make the costs of memory accesses unpredictable if the allocated memory is not managed carefully. Poor memory allocation and placement on symmetric, non-uniform memory access (NUMA) server systems can cause interconnect link congestion and memory controller contention, which can drastically impact performance. Furthermore, asymmetric systems consisting of CPU and GPU cores suffer similarly depending on how memory is allocated and what types of cores are accessing it. Furthermore, modern chip designs are integrating compute units into the memory modules to bring compute closer to memory and eliminate the high costs of transferring data. Hence, accessing memory efficiently is becoming increasingly challenging as systems evolve toward heterogeneity.Our contribution is a detailed analysis and insight into software and hardware memory management techniques on modern symmetric server-class and asymmetric heterogeneous processors consisting of integrated CPU and GPU cores, and to recently commercially available Processing In Memory (PIM) systems. In particular, we examine three types of systems that are affected by memory access bottlenecks. First, we look at NUMA systems, then integrated CPU-GPU systems, and finally PIM systems. We analyze these systems and suggest possible solutions to mitigate memory access issues.For NUMA systems, we propose a holistic memory management algorithm that intelligently distributes memory pages to reduce congestion and improve performance. Then, we provide a detailed analysis of memory management methods on integrated CPU-GPU systems focusing on performance and functionality trade-offs. Our goal is to expose the performance impact of memory management techniques and assess the viability and advantage of running applications with complex behaviors on such integrated CPU-GPU systems. Finally, we examine PIM systems with a case study of image decoding, which is a critical stage for many deep learning applications. We show that the extreme compute scalability of PIM systems can be utilized to accelerate image decoding with performance potential that can surpass CPUs and GPUs.
View record
People store increasing amounts of personal data digitally, from emails to credit cards. Two prevalent places this data is stored are on cloud platforms hosted by third parties and on mobile devices, which are easily lost or stolen and which run any of millions of untrusted third-party applications.We explore security through isolation as a means to protect the sensitive data residing on cloud and mobile platforms. We carefully consider the attributes of each platform and the specifics of the attacks we are trying to protect against to select isolation mechanisms that provide the necessary security benefit without incurring an undue performance penalty.Today's cloud platforms provide isolation through virtualization boundaries, which are typically managed by a monolithic control VM. We decompose such monolithic entities to reduce the attack surface. We break apart the control VM of Xen, a mature virtualization platform, into least-privilege components. We leverage this disaggregation to restart these components frequently, reducing the time window for attacks.Today's mobile platforms provide isolation through passwords and process boundaries. However, these protection mechanisms do little once an attacker can access the physical memory directly. We encrypt sensitive data while it is in memory to prevent direct, physical access to it. We leverage cache locking to provide a safe location embedded within the system chip itself to decrypt application data as it is required.Sharing data between applications is crucial for mobile platforms and is achieved using inter-process communication (IPC). An attacker that gains control of the OS also gains access to all this shared data. We encrypt IPC using a security monitor that operates outside the OS. Leveraging previous work on strong application boundaries, we provide end-to-end encrypted IPC, preventing a compromised OS from being able to access this sensitive data.We demonstrate three systems. First, we disaggregate Xen's monolithic control VM, improving security and reducing performance by 2% or less for most benchmarks. Second, we protect sensitive data on mobile devices from physical memory attacks while preserving performance within 5% for normal Android application usage. Third, we protect all IPC on Android devices incurring no noticeable performance overhead.
View record
Master's Student Supervision
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
Graph partitioning plays a pivotal role in various distributed graph processing applications, including graph analytics, graph neural network training, and distributed graph databases. A good graph partitioner reduces workload execution time, worker imbalance, and network overhead. Graphs that require distributed settings are often too large to fit in the main memory of a single machine. This challenge renders traditional in-memory graph partitioners infeasible, leading to the emergence of streaming solutions. Streaming partitioners produce lower-quality partitions, because they work from partial information and must make premature decisions before they have a complete view of a vertex's neighborhood. We introduce CUTTANA, a streaming graph partitioner that partitions massive graphs (Web/Twitter scale) with superior quality compared to existing streaming solutions. CUTTANA uses a novel buffering technique that prevents the premature assignment of vertices to partitions and a scalable coarsening and refinement technique that enables a complete graph view, improving the intermediate assignment made by a streaming partitioner. We implemented a parallel version of CUTTANA that offers nearly the same partitioning latency as existing streaming partitioners. Our experimental analysis shows that CUTTANA consistently yields better partitioning quality than existing state-of-the-art streaming vertex partitioners in terms of both edge-cut and communication volume metrics. We also evaluate the workload latencies that result from using CUTTANA and other partitioners in distributed graph analytics and databases. CUTTANA outperforms the other methods in most scenarios (algorithms, datasets). In analytics applications, CUTTANA improves runtime performance by up to 59% compared to various streaming partitioners (HDRF, Fennel, Ginger, HeiStream). In graph database tasks, CUTTANA results in higher query throughput by up to 23%, without hurting tail latency.
View record
Temporal graphs have emerged as a fundamental tool for modeling dynamic systems across diverse domains. However, existing research predominantly centers on pairwise interactions, while many real-world complex systems involve interactions among multiple entities. Hypergraphs allow edges to connect any number of vertices, enabling the representation of higher-order structures in data. Consequently, extracting and learning patterns of temporal, higher-order interactions is important in domains such as social network analysis, neuroscience, and finance. This thesis makes two contributions. First, we introduce CAT-WALK, which is a method for representation learning on temporal hypergraphs. Then, we illustrate the value of temporal hypergraph modeling and CAT-WALK in neuroscience applications. We introduce HYPERBRAIN, a framework for detecting abnormal co-activations in brain networks. CAT-WALK is an inductive method that uses a novel higher-order random walk to learn hyperedge representations. It uses a novel permutation-invariant pooling strategy in conjunction with a set-based anonymization process to hide the identity of hyperedges. Additionally, we present a straightforward, yet effective, neural network model for encoding hyperedges. Through extensive experiments, we demonstrate the efficacy of CAT-WALK in 1) predicting future hyperedges and 2) classifying nodes. The second part of this thesis demonstrates how we can encode brain networks as hypergraphs and use CAT-WALK for analyzing them. We introduce HYPERBRAIN, an anomaly detection framework for temporal hypergraph brain networks. HYPERBRAIN first represents fMRI time series data as temporal hypergraphs and subsequently uses CAT-WALK for hypergraph representation learning. Customizing both the temporal higher-order walk and the training approach for the analysis of brain networks, HYPERBRAIN can effectively learn the structural and temporal properties of these brain networks and identify anomalous hyperedges in the brain of individuals with disorders. We evaluate the performance of HYPERBRAIN in both synthetic and real-world settings for Attention Deficit Hyperactivity Disorder (ADHD), and Autism Spectrum Disorder.
View record
Graph partitioning is a crucial problem for processing and analyzing large graphs. The two main classes of graph partitioners are in-memory partitioners and streaming partitioners. In-memory partitioners require that the entire graph be read into memory before being partitioned into some number of partitions. Conversely, a streaming partitioner reads the graph one edge at a time and immediately places the source and destination vertices in a partition, unless they have already been placed. While in-memory partitioners produce higher-quality partitions, they require a large amount of memory and are slow, which makes them less practical for larger graphs. Streaming partitioners can partition large graphs quickly. However, since they lack more information about the overall graph, their placement decisions are not as good as in-memory partitioners, leading to lower-quality partitions. We designed and evaluated a two-phase partitioning algorithm (2GP) that combines the best of both worlds. 2GP has two phases: first, it read a portion of a graph into memory and partitions it using an in-memory partitioner. Then, it partitions the remaining graph using a streaming partitioner. 2GP achieves better partition quality than a state-of-the-art streaming partitioner with significantly better performance.
View record
The problem of identifying anomalies in dynamic networks is a fundamental task with a wide range of applications from understanding brain diseases/disorders to fraud detection in financial networks. However, it raises critical challenges due to the complex nature of anomalies, lack of ground truth knowledge, and complex and dynamic interactions in the network. Most existing approaches usually study simple networks with a single type of connection. However many complex systems exhibit natural relationships with different types of connections, yielding multiplex networks. We first propose ANOMULY, a graph neural network-based unsupervised edge anomaly detection framework for multiplex dynamic networks. ANOMULY learns node encodings in different relation types, separately, and then uses an attention mechanism that incorporates information across different types of relations. To improve generalizability and scalability, we further propose ADMIRE, an inductive and unsupervised anomaly detection method that extracts the causality of the existence of connections by temporal network motifs. To extract the temporal network motifs, ADMIRE uses two different casual multiplex walks, inter-view and intra-view that automatically extract and learn temporal multiplex network motifs. Despite the outstanding performance of ADMIRE, using it in sensitive decision-making tasks requires explanations for the model's predictions. Accordingly, we introduce an interpretable, weighted optimal sparse decision tree model, ADMIRE++, that mimics ADMIRE, to provide explanations for ADMIRE's predictions. With extensive experiments, we show the efficiency and effectiveness of our approaches in detecting anomalous connections in various domains, including social and financial networks. We further focus on understanding abnormal human brain activity of people living with Parkinson’s Disease, Attention Deficit Hyperactivity Disorder, and Autism Spectrum Disorder to show how these methods can assist in understanding the biomarkers for these diseases.
View record
A self-driving laboratory is a cyber-physical system that uses software-controlled laboratory equipment, such as robot arms and smart devices, to permit autonomous experimentation. Intelligent systems in these laboratories can independently conduct experiments, analyze results, and identify a subsequent experiment to run. However, self-driving laboratories are vulnerable to security attacks due to their dependence on networked communication. Further, a naive researcher could make human errors while prototyping new experiments. Both an attacker or a naive researcher could potentially create dangerous situations that pose risks to the safety and security of self-driving laboratories. For instance, they could make a robot arm crash into expensive equipment or launch dangerous experiments. We present Sarwat, a rule-based intrusion detection system (IDS) for self-driving laboratories, designed to prevent unsafe behavior. Sarwat uses a set of rules for defining the actions that are allowed in a self-driving laboratory. If an action inside the laboratory violates any of the rules, Sarwat raises an alarm. Sarwat achieves an overall detection rate of 75%, making it effective for most of the unsafe scenarios we identified. We conducted a pilot study to evaluate the user-friendliness of Sarwat and found that the initial setup of Sarwat in a self-driving laboratory requires our assistance. However, once configured, it is easy to maintain, making it valuable for training new users and prototyping new experiments. Additionally, Sarwat introduces minimal latency overhead of 1.5% to the ongoing experiment workflows of a self-driving laboratory. Therefore, Sarwat allows researchers in a self-driving laboratory to perform experiments safely and securely.
View record
Graph structured data, which models complex relationships, describes data in myriad domains, such as social network analysis, protein structure analysis, and supply chain management. As the size of graph data grows, researchers develop modern systems to handle massive datasets with trillions of entries. To accelerate processing, systems optimize the data layout of vertices and edges in memory or on disk, but, to date, researchers have treated the performance improvements due to vertex and edge orderings separately. This work investigates the interaction between vertex and edgeorderings. We explore whether combining these orderings improves performance. An extensiveperformance study of different orderings finds that a specific combination, the SlashBurn vertexorder and the Hilbert edge order, consistently provides the fastest runtimes for scale-free graphs.These results motivate our development of a parallelized SlashBurn and the Rhubarb edge orderingand blocking technique. Using 14 cores, Parallel SlashBurn is up to 11.96× faster than the sequentialimplementation. Rhubarb enables the scalable implementation of parallel edge-centric algorithmsusing the Hilbert curve and offers end-to-end performance speedup for the Collaborative Filteringapplication of up to more than 2.5× over modern parallel Graph Processing Systems.
View record
Reactive programs are programs that process external events, such as signals andmessages. Device drivers and cloud microservices are examples of reactiveprograms. Systems built from reactive programs are concurrent and exhibit a highdegree of nondeterminism, making non-exhaustive testing inadequate. Ahigher-level language can be used to write a specification: a formal definitionof a program’s desired behaviour. Such specifications may be easier to reasonabout, but proofs of the specification say nothing of a low-levelimplementation.Program synthesis is one way to bridge this gap. A synthesizer searches a spaceof candidate programs for an implementation that satisfies the specification. Ingeneral, the number of candidate programs is infinite, making synthesisundecidable. Even a bounded search, e.g., on program length, soon becomesintractable as that search space grows exponentially.This work introduces COMET, a system for synthesizing reactive programs fromunbounded nondeterministic iterative transformations (UNITY) specifications.Recent work in symbolic execution and solver-aided synthesis has advanced thestate of the art, but symbolic techniques also lead to exponential blowup.COMET seeks to avoid exponential blowup by constraining the search space andsynthesizing in small steps. COMET synthesizes non-trivial programs forsequential and concurrent languages by applying three techniques: symbolicexecution of the specification into guarded traces, intermediate target tracesynthesis, and recursive synthesis of target expressions. To evaluate thisapproach, I synthesize Arduino and Verilog programs from UNITY specifications ofthe Paxos consensus proposer and acceptor roles.
View record
Recent studies demonstrated that the reproducibility of previously published computational experiments is inadequate. Many of these published computational experiments are not reproducible, because they never recorded or preserved their computational environment. This environment consists of artifacts such as packages installed in the language, libraries installed on the host system, file names, and directory hierarchy. Researchers have created reproducibility tools to help mitigate this problem, but they do nothing for the experiments that already exist in online repositories. This situation is not improving, as researchers continue to publish results every year without using reproducibility tools, likely due to benign neglect as it is common to believe publishing the code and data is sufficient for reproducibility. To clarify the gap between what existing reproducibility tools are capable of and this issue with published experiments, we define a framework to distinguish between actions taken by a researcher to facilitate reproducibility in the presence of a computational environment and actions taken by a researcher to enable reproduction of an experiment when that environment has been lost. The difference between these approaches in reproducibility lies in the availability of a computational environment. Researchers that provide access to the original computational environment perform proactive reproducibility, while those who do not enable only retroactive reproducibility. We present Reproducibility as a Service (RaaS), which is, to our knowledge, the first reproducibility tool explicitly designed to facilitate retroactive reproducibility. We demonstrate how RaaS can fix many of the common errors found in R scripts on Harvard's Dataverse and preserve the recreated computational environment.
View record
The advent of the Internet of Things (IoT) makes it possible for tiny devices with sensing and communication capabilities to be interconnected and interact with the cyber physical world. However, these tiny devices are typically powered by batteries and have limited memory, so they cannot run commodity operating systems that are designed for general-purpose computers, such as Windows and Linux. Embedded operating systems addressed this issue and established a solid foundation for developers to write applications on these tiny devices. IoT devices are deployed everywhere, from smart home appliances to self-driving vehicles, and their applications impose ever-increasing and more heterogeneous demands on software architecture. There are many special-purpose and embedded operating systems built to satisfy these wildly different requirements, from early sensor network operating systems, such as TinyOS and Contiki, to more modern robot and real-time control systems, such as FreeRTOS and Zephyr. However, the rapid evolution and heterogeneity of IoT applications call for a different solution. Specifically, this work introduces Tinkertoy, a set of standard operating system components from which developers can assemble a custom system. Not only does the custom system provide precisely the functionality needed by an application, but it does so in up to four time less memory than other IoT operating systems and still has comparable performance to them.
View record
If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.