Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - Nov 2020)
Fundamental data analytics tasks are often simple -- many useful and actionable insights can be garnered by simply filtering, grouping, and summarizing data. However the sheer volume of data to be analyzed, demands of a multi-user operating environment, and limitations of general purpose processors make it challenging to perform these operations efficiently at scale. This thesis presents two techniques that address these challenges to improve the response time of data analytics tasks: (1) Newly emerging programmable network processors can perform data analytics tasks at terabits per second. However, existing data analytics systems, like Apache Spark, cannot readily use network processors because network processors are very limited and cannot execute the tasks generated by existing analytics systems. Using network processors for analytics requires re-designing how existing systems compile and execute data processing tasks. This thesis introduces Jumpgate, a system that enables existing data processing systems to execute relational queries using network processors. Jumpgate compiles client requests to a novel execution model that coordinates execution on heterogeneous network processors. Jumpgate can already improve performance by 1.12-3x on industry standard benchmarks, and paves the way for adopting network processors for data analytics tasks. (2) Analytics systems often process similar queries, either submitted by the same or different users. Cross program memoization (CPM) is a technique to re-use results of prior computations across programs and users. However, CPM is often not enabled because prior implementations have high overhead and are unable to re-use the output of user-defined functions (UDFs). This thesis presents KeyChain, a CPM implementation that identifies equivalent UDFs and has low overhead so CPM can be always be enabled.
Parallel programming is hard and programmers still struggle to write code for shared memory multicore architectures that is both free of concurrency errors and efficient. Tools have advanced, but for tasks that are not embarrassingly parallel, or suitable for a limited model such as map/reduce, there is little help. We aim to address some major aspects of this still underserved area.We construct a model for parallelism, Data not Code (DnC), by starting with the observation that a majority of performance and problems in parallel programming are rooted in the manipulation of data, and that a better approach is to schedule data, not code. Data items don’t exist in a vacuum but are instead organized into collections, so we focus on concurrent access to these collections from both task and data parallel operations. These concepts are already embraced by many programming models and languages, such as map/reduce, GraphLab and SQL. We seek to bring the excellent principles embodied in these models, such as declarative data-centric syntax and the myriad of optimizations that it enables, to conventional programming languages, like C++, making them available in a larger variety of contexts.To make this possible, we define new language constructs and augment proven techniques from databases for accessing arbitrary parts of a collection in a familiar and expressive manner. These not only provide the programmer with constructs that are easy to use and reason about, but simultaneously allow us to better extract and analyze programmer intentions to automatically produce code with complex runtime optimizations.We present Cadmium, a proof of concept DnC language to demonstrate the effectiveness of our model. We implement a variety of programs and show that, without explicit parallel programming, they scale well on multicore architectures. We show performance competitive with, and often superior to, fine-grained locks, the most widely used method of preventing error-inducing data access in parallel operations.
Over the past decades, core speeds have been improving at a much higher rate than memory bandwidth. This has caused the performance bottlenecks in modern software to shift from computation to data transfers. Hardware caches were designed to mitigate this problem, based on the principles of temporal and spatial locality. However, with the increasingly irregular access patterns in software, locality is difficult to preserve. Researchers and practitioners devote a lot of time and effort to improving memory performance from the software side. This is done either by restructuring the code to make access patterns more regular, or by changing the layout of data in memory to better accommodate caching policies. Experts often use correlations between the access pattern of an algorithm and properties of the objects it operates on to devise new ways to lay data out in memory. Prior work has shown the memory layout design process to be largely manual and difficult enough to result in high level publications. Our contribution is a set of tools, techniques and algorithms for automatic extraction of correlations between data and access patterns of programs. In order to collect a sufficient level of details about memory accesses, we present a compiler-based access instrumentation framework called DINAMITE. Further, we introduce access graphs, a novel way of representing spatial locality properties of programs which are generated from memory access traces. We use access graphs as a basis for Hierarchical Memory Layouts -- a novel algorithm for estimating performance improvements to be gained from better data layouts. Finally, we present our Data-Driven Spatial Locality techniques which use the information available from previous steps to automatically extract the correlations between data and access patterns commonly used by experts to inform better layout design.
The problem of placement of threads, or virtual cores, on physical cores in a multicore system has been studied for over a decade. Despite this effort, we still do not know how to assign virtual to physical cores on a non-uniform memory access (NUMA) system so as to meet a performance target while minimizing resource consumption. Prior work has made large strides in this area, but these solutions either addressed hardware with specific properties, leaving us unable to generalize the models to other systems, or modeled much simpler effects than the actual performance in different placements.An interdependent problem is how to place memory on NUMA systems. Poor memory placement causes congestion on interconnect links, contention for memory controllers, and ultimately long memory access times and poor performance. Commonly used operating system techniques for NUMA memory placement fail to achieve optimal performance in many cases.Our contribution is a general framework for reasoning about workload placement and memory placement on machines with shared resources. This framework enables us to automatically build an accurate performance model for any machine with a hierarchy of known shared resources. Using our methodology, data center operators can minimize the number of NUMA (CPU+memory) nodes allocated for an application or a service, while ensuring that it meets performance objectives. More broadly, the methodology empowers them to efficiently “pack” virtual containers on the physical hardware. We also present an effective solution for placing memory that avoids congestion on interconnects due to memory traffic and additionally selects the best page size that balances translation lookaside buffer (TLB) effects against more granular memory placement. The solutions proposed can significantly improve performance and work at the operating system level so they do not require changes to applications.