Relevant Degree Programs
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - May 2019)
Over the past decades, core speeds have been improving at a much higher rate than memory bandwidth. This has caused the performance bottlenecks in modern software to shift from computation to data transfers. Hardware caches were designed to mitigate this problem, based on the principles of temporal and spatial locality. However, with the increasingly irregular access patterns in software, locality is difficult to preserve. Researchers and practitioners devote a lot of time and effort to improving memory performance from the software side. This is done either by restructuring the code to make access patterns more regular, or by changing the layout of data in memory to better accommodate caching policies. Experts often use correlations between the access pattern of an algorithm and properties of the objects it operates on to devise new ways to lay data out in memory. Prior work has shown the memory layout design process to be largely manual and difficult enough to result in high level publications. Our contribution is a set of tools, techniques and algorithms for automatic extraction of correlations between data and access patterns of programs. In order to collect a sufficient level of details about memory accesses, we present a compiler-based access instrumentation framework called DINAMITE. Further, we introduce access graphs, a novel way of representing spatial locality properties of programs which are generated from memory access traces. We use access graphs as a basis for Hierarchical Memory Layouts -- a novel algorithm for estimating performance improvements to be gained from better data layouts. Finally, we present our Data-Driven Spatial Locality techniques which use the information available from previous steps to automatically extract the correlations between data and access patterns commonly used by experts to inform better layout design.
The problem of placement of threads, or virtual cores, on physical cores in a multicore system has been studied for over a decade. Despite this effort, we still do not know how to assign virtual to physical cores on a non-uniform memory access (NUMA) system so as to meet a performance target while minimizing resource consumption. Prior work has made large strides in this area, but these solutions either addressed hardware with specific properties, leaving us unable to generalize the models to other systems, or modeled much simpler effects than the actual performance in different placements.An interdependent problem is how to place memory on NUMA systems. Poor memory placement causes congestion on interconnect links, contention for memory controllers, and ultimately long memory access times and poor performance. Commonly used operating system techniques for NUMA memory placement fail to achieve optimal performance in many cases.Our contribution is a general framework for reasoning about workload placement and memory placement on machines with shared resources. This framework enables us to automatically build an accurate performance model for any machine with a hierarchy of known shared resources. Using our methodology, data center operators can minimize the number of NUMA (CPU+memory) nodes allocated for an application or a service, while ensuring that it meets performance objectives. More broadly, the methodology empowers them to efficiently “pack” virtual containers on the physical hardware. We also present an effective solution for placing memory that avoids congestion on interconnects due to memory traffic and additionally selects the best page size that balances translation lookaside buffer (TLB) effects against more granular memory placement. The solutions proposed can significantly improve performance and work at the operating system level so they do not require changes to applications.