Tor Aamodt | Graduate and Postdoctoral Studies

Professor

Faculty of Applied Science

Research Classification

Computer architecture

Machine learning

Research Interests

Computer Architecture

graphics processor units (GPUs)

Machine Learning

Relevant Thesis-Based Degree Programs

View all programs

Affiliations to Research Centres, Institutes & Clusters

CAIDA: UBC ICICS Centre for Artificial Intelligence Decision-making and Action

Institute for Computing, Information and Cognitive Systems (ICICS)

Quantum Computing Research Cluster

Research Options

I am available and interested in collaborations (e.g. clusters, grants).

I am interested in and conduct interdisciplinary research.

I am interested in working with undergraduate students on research projects.

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Accelerating deep learning with lossy compression (2022)

Parallel hardware accelerators, for example Graphics Processor Units, have limited on-chip memory capacity and off-chip bandwidth. Neural network workloads can generally be separated into the preparation (training) and usage (inference) of the network. The training of deep neural networks is especially hampered by memory limitations, as increasing model size (and thus memory) is a common technique to improve accuracy. Previous approaches have examined lossy compression and offloading to reduce activation memory consumption, the primary memory consumer for most models. However, prior works use techniques without utilizing the spatial properties of many neural networks. As well, there is no known relationship between the trained accuracy of a network, and the compression rate, leading to expensive searches for compression rates. In this dissertation, we begin by examining lossy spatial compression using JPEG for ACTivations (JPEG-ACT). JPEG-ACT uses custom-tuned error sensitivities to target machine perception and a hardware accelerator to offload compressed activation values during training. Using the JPEG-ACT accelerator results in an 8.5x memory reduction over uncompressed training and 2.3x performance versus the next-best offload accelerator. Following this, we examine the theoretical relationship between compression errors and accuracy. Prior approaches used expensive tuning to determine the compression/accuracy trade-off after training. Thus, in this dissertation, accuracy guarantees can be set before training, which have a lower runtime, and provide knowledge about how much accuracy is given up for compression.Compression is set using Activation Compression with Guaranteed Convergence (AC-GC) error bounds, with a performance overhead of 0.4% over compressed training. Combining these bounds with various compression methods results in 15.1x compression on average, with theoretical guarantees of convergence. These guarantees provide better potential for AC-GC to function well on current and yet-to-be-developed models. By reducing activation memory consumption, JPEG-ACT and AC-GC allow faster iteration through possible neural network models, advancing the field to new applications and better models.

View record

Models and techniques for designing mobile system-on-chip devices (2020)

Mobile SoCs have become ubiquitous computing platforms, and, in recent years, they have become increasingly heterogeneous and complex. A typical SoC today includes CPUs, GPUs, image processors, video encoders/decoders, and AI engines. This dissertation addresses some of the challenges associated with SoCs in three pieces of work. The first piece of work develops a cycle-accurate model, Emerald, which provides a platform for studying system-level SoC interactions while including the impact of graphics. Our cycle-accurate infrastructure builds upon well-established tools, GPGPU-Sim and gem5, with support for graphics and GPGPU workloads, and full system simulation with Android. We present two case studies using Emerald. First, we use Emerald's full-system mode to highlight the importance of system-wide interactions by studying and analyzing memory organization and scheduling in SoCs. Second, we use Emerald's standalone mode to evaluate a dynamic mechanism for balancing the shading work assigned to GPU cores. Our dynamic mechanism speeds up frame rendering by 7.3-19% compared to static load-balancing. The second work highlights the time-variant traffic asymmetry in heterogeneous SoCs. We analyze the impact of this asymmetry on network performance and propose interleaved source injection (ISI), an interconnect topology and associated flow control mechanism to manage time-varying asymmetric network traffic. We evaluate ISI using stochastic traffic patterns and a set of traces that emulate mobile use cases with traffic from various IP blocks. We show that ISI increases saturation throughput by 80-184% for 12% increase in NoC area. In the last piece of work, we study the compression properties of framebuffer surfaces and highlight the characteristics of surfaces generated by different applications. We use our analysis to propose Dynamic Color Palettes (DCP), a hardware scheme that dynamically constructs color palettes and employs them to efficiently compress framebuffer surfaces. We evaluated DCP against a set of 124 workloads and found that DCP improves compression rates by 91% for UI and 20% for 2D applications compared to previous proposals. We also propose a hybrid scheme (HDCP) that combines DCP with a generic compression scheme. HDCP outperforms previous proposals by 161%, 124% and 83% for UI, 2D, and 3D applications, respectively.

View record

Software-hardware co-design for energy efficient datacenter computing (2019)

Datacenters have become commonplace computing environments used to offload applications from distributed local machines to centralized environments. Datacenters offer increased performance and efficiency, reliability and security guarantees, and reduced costs relative to independently operating the computing equipment. The growing trend over the last decade towards server-side (cloud) computing in the datacenter has resulted in increasingly higher demands for performance and efficiency. Graphics processing units (GPUs) are massively parallel, highly efficient accelerators, which can provide significant improvements to applications with ample parallelism and structured behavior. While server-based applications contain varying degrees of parallelism and are economically appealing for GPU acceleration, they often do not adhere to the specific properties expected of an application to obtain the benefits offered by the GPU.This dissertation explores the potential for using GPUs as energy-efficient accelerators for traditional server-based applications in the datacenter through a software-hardware co-design. It first evaluates a popular key-value store server application, Memcached, demonstrating that the GPU can outperform the CPU by 7.5x for the core Memcached processing. However, the core processing of a networking application is only part of the end-to-end computation required at the server. This dissertation then proposes a GPU-accelerated software networking framework, GNoM, which offloads all of the network and application processing to the GPU. GNoM facilitates the design of MemcachedGPU, an end-to-end Memcached implementation on contemporary Ethernet and GPU hardware. MemcachedGPU achieves 10 Gbit line-rate processing at the smallest request size with 95-percentile latencies under 1.1 milliseconds and efficiencies under 12 microjoules per request. GNoM highlights limitations in the traditional GPU programming model, which relies on a CPU for managing GPU tasks. Consequently, the CPU may be unnecessarily involved on the critical path, affecting overall performance, efficiency, and the potential for CPU workload consolidation. To address these limitations, this dissertation proposes an event-driven GPU programming model and set of hardware modifications, EDGE, which enables any device in a heterogeneous system to directly manage the execution of pre-registered GPU tasks through interrupts. EDGE employs a fine-grained GPU preemption mechanism that reuses existing GPU compute resources to begin processing interrupts in under 50 GPU cycles.

View record

Architectural support for inter-thread synchronization in SIMT architectures (2018)

GPU Computing Architecture for Irregular Parallelism (2015)

Locality and Scheduling in the Massively Multithreaded Era (2015)

Massively parallel processing devices, like Graphics Processing Units (GPUs), have the ability to accelerate highly parallel workloads in an energy-efficient manner. However, executing irregular or less tuned workloads poses performance and energy-efficiency challenges on contemporary GPUs. These inefficiencies come from two primary sources: ineffective management of locality and decreased functional unit utilization. To decrease these effects, GPU programmers are encouraged to restructure their code to fit the underlying hardware architecture which affects the portability of their code and complicates the GPU programming process. Thisdissertation proposes three novel GPU microarchitecture enhancements for mitigating both the locality and utilization problems on an important class of irregular GPU applications. The first mechanism, Cache-Conscious Warp Scheduling (CCWS), is an adaptive hardware mechanism that makes use of a novel locality detector to capture memory reference locality that is lost by other schedulers due to excessive contention for cache capacity. On cache-sensitive, irregular GPU workloads, CCWS provides a 63% speedup over previous scheduling techniques. This dissertation uses CCWS to demonstrate that improvements to the hardware thread scheduling policy in massively multithreaded systems offer a promising new design space to explore in locality management. The second mechanism, Divergence-Aware Warp Scheduling (DAWS), introduces a divergence-based cache footprint predictor to estimate how much L1 data cache capacity is needed to capture locality in loops. We demonstrate that the predictive, pre-emptive nature of DAWS can provide an additional 26% performance improvement over CCWS. This dissertation also demonstrates that DAWS can effectively shift the burden of locality management from software to hardware by increasing the performance of simpler and more portable code on the GPU. Finally, this dissertation details a Variable Warp-Size Architecture (VWS) which improves the performance of irregular applications by 35%. VWS improves irregular code by using a smaller warp size while maintaining the performance and energy-efficiency of regular code by ganging the execution of these smaller warps together in the warp scheduler.

View record

Designing network-on-chips for throughput accelerators (2014)

Physical limits of power usage for integrated circuits have steered the microprocessor industry towards parallel architectures in the past decade. Modern Graphics Processing Units (GPU) are a form of parallel processor that harness chip area more effectively compared to traditional single threaded architectures by favouring application throughput over latency. Modern GPUs can be used as throughput accelerators: accelerating massively parallel non-graphics applications. As the number of compute cores in throughput accelerators increases, so does the importance of efficient memory subsystem design. In this dissertation, we present system-level microarchitectural analysis and optimizations with an emphasis on the memory subsystem of throughput accelerators that employ Bulk-Synchronous-Parallel programming models such as CUDA and OpenCL. We model the whole throughput accelerator as a closed-loop system in order to capture the effects of complex interactions of microarchitectural components: we simulate components such as compute cores, on-chip network and memory controllers with cycle-level accuracy. For this purpose, the first version of GPGPU-Sim simulator that was capable of running unmodified applications by emulating NVIDIA's virtual instruction set was developed. We use this simulator to model and analyze several applications and explore various microarchitectural tradeoffs for throughput accelerators to better suit these applications. Based on our observations, we identify the Network-on-Chip (NoC) component of memory subsystem as our main optimization target and set out to design throughput effective NoCs for future throughput accelerators. We provide a new framework for NoC researchers to ensure the optimizations are "throughput effective", namely, parallel application-level performance improves per unit chip area. We then use this framework to guide the development of several optimizations. Accelerator workloads demand high off-chip memory bandwidth resulting in a many-to-few-to-many traffic pattern. Leveraging this observation, we reduce NoC area by proposing a checkerboard NoC which utilizes routers with limited connectivity. Additionally, we improve performance by increasing the terminal bandwidth of memory controller nodes to better handle frequent read-reply traffic. Furthermore, we propose a double checkerboard inverted NoC organization which maintains the benefits of these optimizations while having a simpler routing mechanism and smaller area and results in a 24.3% improvement in average application throughput per unit area.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Accelerating network function virtualization (2021)

Hybrid rendering: in pursuit of real-time raytracing (2020)

Resprop: reused sparsified backpropagation (2020)

Sparse weight activation training (2020)

Faster convolutional neural network training via approximate memoization (2018)

A mix-grained architecture for improving HLS-generated controllers on FPGAs (2017)

Accelerating Web Search using GPUs (2015)

The amount of content on the Internet is growing rapidly as well as the number of the online Internet users. As a consequence, web search engines need to increase their computing capabilities and data continually while maintaining low search latency and without a significant rise in the cost per query. To serve this larger numbers of online users, web search engines utilize a large distributed system in the data centers. They partition their data across several hundred of thousands of independent commodity servers called Index Serving Nodes (ISNs). These ISNs work together to serve search queries as a single coherent system in a distributed manner. The choice of a high number of commodity servers vs. a smaller number of supercomputers is due to the need for scalability, high availability/reliability, performance, and cost efficiency. For the web search engines to serve a larger data, the web search engines can be scaled either vertically or horizontally~\cite{michael2007scale}.Vertical scaling enables ranking more documents per query within a single node by employing machines with higher single thread and throughput performance with bigger and faster memory. Horizontal scaling supports a larger index by adding more index serving nodes at the cost of increased synchronization, aggregation overhead, and power consumption. This thesis evaluates the potential for achieving better vertical scaling by using Graphics processing unit (GPUs) to reduce the documents ranking latency per query at a reasonable initial cost increase.It achieves this by exploiting the parallelism in ranking the numerous potential documents that match a query to offload to the GPU. We evaluate this approach using hundreds of rankers from a commercial web search engine on real production data. Our results show an 8.8x harmonic mean reduction in the latency and 2x power efficiency when ranking 10000 web documents per query for a variety of rankers using C++AMP and a commodity GPU.

View record

Inter-core locality aware memory access scheduling (2015)

Deterministic execution on GPU architectures (2013)

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one’s ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. Here we propose a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this thesis we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet’s design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

View record

Improving GPU programming models through hardware cache coherence (2013)

Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over contemporary chip multiprocessors (CMPs) on massively parallel programs. The lack of well-defined GPU memory models, however, prevents support of high-level languages like C++ and Java, and negatively impacts their programmability. This thesis proposes to improve GPU programmability by adding support for a well-defined memory consistency model through hardware cache coherence. We show that GPU coherence introduces a new set of challenges different from that posed by scalable cache coherence for CMPs. First, introducing conventional directory coherence protocols adds unnecessary coherence traffic overhead to existing GPU applications. Second, the massively multithreaded GPU architecture presents significant storage overheads for buffering thousands of in-flight coherence requests. Third, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC), explored the use of time-based approaches in CMP coherence protocols. This thesis describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present two implementations of TC, called TC-Strong and TC-Weak. TC-Strong implements an optimized version of LCC, while TC-Weak uses a novel timestamp based memory fence mechanism to implement Release Consistency on GPUs. TC-Weak eliminates TC-Strong’s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications requiring coherence by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also show that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.

View record

Optimizing network-on-chips for FPGAs (2013)

On Replay and Hazards in Graphics Processing Units (2012)

Improving the performance of post-silicon trace generation (2011)

As microprocessor designs become more complex, the task of finding errors in the design becomes more difficult. Most design errors will be caught before the chip is fabricated, however, there may be some that make it into the fabricated design. The process of determining what is wrong when the fabricated chip of a new design behaves incorrectly is called post-silicon debug (also known as silicon validation). One of the challenges of post-silicon debug is the lack of observability into the state of a fabricated chip. BackSpace is a proposal for tackling the observability challenge. It does this by generating a post-silicon trace using a combination of on-chip monitoring hardware and off-chip formal analysis. To add one state to the trace, BackSpace generates a set of possible predecessor states by analysing the design which are then tested one at a time. The testing is performed by loading a candidate state into a circuit that compares it with the current state of the chip, and running the chip. If the state is reached during execution, then it is added to the trace. The process of testing states one at a time is time consuming. This thesis shows that correlation information characterizing the application running on the chip can reduce the number of runs of the chip by up to 51%. New post-silicon trace generation algorithms, BackSpace-2bp and BackSpace-3bp, are also proposed that do not need to generate the set of possible predecessor states. This leads to a speedup relative to practical implementations of BackSpace and also reduces the hardware overhead needed to generate post-silicon traces by 94.4% relative to the state of the art implementation of BackSpace. To evaluate these new algorithms BackSpace, and the new algorithms, were implemented on a superscalar out-of-order processor using the Alpha instruction set. Non-determinism was introduced in the form of variable memory latency. The effects of non-determinism on the processor and the trace generation algorithms are also evaluated.

View record

Towards synchronization in SIMT architectures (2011)

GPU Compute Memory Systems (2010)

Modern Graphic Process Units (GPUs) offer orders of magnitude more raw computing powerthan contemporary CPUs by using many simpler in-order single-instruction, multiple-data (SIMD)cores optimized for multi-thread performance rather than single-thread performance. As such,GPUs operate much closer to the "Memory Wall", thus requiring much more careful memorymanagement.This thesis proposes changes to the memory system of our detailed GPU performance simulator,GPGPU-Sim, to allow proper simulation of general-purpose applications written using NVIDIA'sCompute Unified Device Architecture (CUDA) framework. To test these changes, fourteen CUDAapplications with varying degrees of memory intensity were collected. With these changes, weshow that our simulator predicts performance of commodity GPU hardware with 86% correlation.Furthermore, we show that increasing chip resources to allow more threads to run concurrently doesnot necessarily increase performance due to increased contention for the shared memory system.Moreover, this thesis proposes a hybrid analytical DRAM performance model that uses memoryaddress traces to predict the efficiency of a DRAM system when using a conventional First-ReadyFirst-Come First-Serve (FR-FCFS) memory scheduling policy. To stress the proposed model, amassively multithreaded architecture based upon contemporary high-end GPUs is simulated togenerate the memory address trace needed. The results show that the hybrid analytical modelpredicts DRAM efficiency to within 11.2% absolute error when arithmetically averaged across amemory-intensive subset of the CUDA applications introduced in the first part of this thesis.Finally, this thesis proposes a complexity-effective solution to memory scheduling that recoversmost of the performance loss incurred by a naive in-order First-in First-out (FIFO) DRAM schedulercompared to an aggressive out-of-order FR-FCFS scheduler. While FR-FCFS scheduling re-ordersmemory requests to improve row access locality, we instead employ an interconnection network arbitration scheme that preserves the inherently high row access locality of memory request streamsfrom individual "shader cores" and, in doing so, achieve DRAM efficiency and system performanceclose to that of FR-FCFS with a simpler design. We evaluate our interconnection network arbitration scheme using crossbar, ring, and mesh networks and show that, when coupled with a bankedFIFO in-order scheduler, it obtains up to 91.0% of the performance obtainable with an out-of-ordermemory scheduler with eight-entry DRAM controller queues.

View record

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

Tor Aamodt's Profile

Publications on Google Scholar

Membership Status

Member of G+PS

View explanation of statuses

Program Affiliations

Electrical and Computer Engineering

Academic Unit(s)

Department of Electrical & Computer Engineering