Tor Aamodt

Professor

Relevant Thesis-Based Degree Programs

 
 

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Accelerating deep learning with lossy compression (2022)

Parallel hardware accelerators, for example Graphics Processor Units, have limited on-chip memory capacity and off-chip bandwidth. Neural network workloads can generally be separated into the preparation (training) and usage (inference) of the network. The training of deep neural networks is especially hampered by memory limitations, as increasing model size (and thus memory) is a common technique to improve accuracy. Previous approaches have examined lossy compression and offloading to reduce activation memory consumption, the primary memory consumer for most models. However, prior works use techniques without utilizing the spatial properties of many neural networks. As well, there is no known relationship between the trained accuracy of a network, and the compression rate, leading to expensive searches for compression rates. In this dissertation, we begin by examining lossy spatial compression using JPEG for ACTivations (JPEG-ACT). JPEG-ACT uses custom-tuned error sensitivities to target machine perception and a hardware accelerator to offload compressed activation values during training. Using the JPEG-ACT accelerator results in an 8.5x memory reduction over uncompressed training and 2.3x performance versus the next-best offload accelerator. Following this, we examine the theoretical relationship between compression errors and accuracy. Prior approaches used expensive tuning to determine the compression/accuracy trade-off after training. Thus, in this dissertation, accuracy guarantees can be set before training, which have a lower runtime, and provide knowledge about how much accuracy is given up for compression.Compression is set using Activation Compression with Guaranteed Convergence (AC-GC) error bounds, with a performance overhead of 0.4% over compressed training. Combining these bounds with various compression methods results in 15.1x compression on average, with theoretical guarantees of convergence. These guarantees provide better potential for AC-GC to function well on current and yet-to-be-developed models. By reducing activation memory consumption, JPEG-ACT and AC-GC allow faster iteration through possible neural network models, advancing the field to new applications and better models.

View record

Models and techniques for designing mobile system-on-chip devices (2020)

Mobile SoCs have become ubiquitous computing platforms, and, in recent years, they have become increasingly heterogeneous and complex. A typical SoC today includes CPUs, GPUs, image processors, video encoders/decoders, and AI engines. This dissertation addresses some of the challenges associated with SoCs in three pieces of work. The first piece of work develops a cycle-accurate model, Emerald, which provides a platform for studying system-level SoC interactions while including the impact of graphics. Our cycle-accurate infrastructure builds upon well-established tools, GPGPU-Sim and gem5, with support for graphics and GPGPU workloads, and full system simulation with Android. We present two case studies using Emerald. First, we use Emerald's full-system mode to highlight the importance of system-wide interactions by studying and analyzing memory organization and scheduling in SoCs. Second, we use Emerald's standalone mode to evaluate a dynamic mechanism for balancing the shading work assigned to GPU cores. Our dynamic mechanism speeds up frame rendering by 7.3-19% compared to static load-balancing. The second work highlights the time-variant traffic asymmetry in heterogeneous SoCs. We analyze the impact of this asymmetry on network performance and propose interleaved source injection (ISI), an interconnect topology and associated flow control mechanism to manage time-varying asymmetric network traffic. We evaluate ISI using stochastic traffic patterns and a set of traces that emulate mobile use cases with traffic from various IP blocks. We show that ISI increases saturation throughput by 80-184% for 12% increase in NoC area. In the last piece of work, we study the compression properties of framebuffer surfaces and highlight the characteristics of surfaces generated by different applications. We use our analysis to propose Dynamic Color Palettes (DCP), a hardware scheme that dynamically constructs color palettes and employs them to efficiently compress framebuffer surfaces. We evaluated DCP against a set of 124 workloads and found that DCP improves compression rates by 91% for UI and 20% for 2D applications compared to previous proposals. We also propose a hybrid scheme (HDCP) that combines DCP with a generic compression scheme. HDCP outperforms previous proposals by 161%, 124% and 83% for UI, 2D, and 3D applications, respectively.

View record

Software-hardware co-design for energy efficient datacenter computing (2019)

Datacenters have become commonplace computing environments used to offload applications from distributed local machines to centralized environments. Datacenters offer increased performance and efficiency, reliability and security guarantees, and reduced costs relative to independently operating the computing equipment. The growing trend over the last decade towards server-side (cloud) computing in the datacenter has resulted in increasingly higher demands for performance and efficiency. Graphics processing units (GPUs) are massively parallel, highly efficient accelerators, which can provide significant improvements to applications with ample parallelism and structured behavior. While server-based applications contain varying degrees of parallelism and are economically appealing for GPU acceleration, they often do not adhere to the specific properties expected of an application to obtain the benefits offered by the GPU.This dissertation explores the potential for using GPUs as energy-efficient accelerators for traditional server-based applications in the datacenter through a software-hardware co-design. It first evaluates a popular key-value store server application, Memcached, demonstrating that the GPU can outperform the CPU by 7.5x for the core Memcached processing. However, the core processing of a networking application is only part of the end-to-end computation required at the server. This dissertation then proposes a GPU-accelerated software networking framework, GNoM, which offloads all of the network and application processing to the GPU. GNoM facilitates the design of MemcachedGPU, an end-to-end Memcached implementation on contemporary Ethernet and GPU hardware. MemcachedGPU achieves 10 Gbit line-rate processing at the smallest request size with 95-percentile latencies under 1.1 milliseconds and efficiencies under 12 microjoules per request. GNoM highlights limitations in the traditional GPU programming model, which relies on a CPU for managing GPU tasks. Consequently, the CPU may be unnecessarily involved on the critical path, affecting overall performance, efficiency, and the potential for CPU workload consolidation. To address these limitations, this dissertation proposes an event-driven GPU programming model and set of hardware modifications, EDGE, which enables any device in a heterogeneous system to directly manage the execution of pre-registered GPU tasks through interrupts. EDGE employs a fine-grained GPU preemption mechanism that reuses existing GPU compute resources to begin processing interrupts in under 50 GPU cycles.

View record

Architectural support for inter-thread synchronization in SIMT architectures (2018)

Single-Instruction Multiple-Threads (SIMT) architectures have seen widespread interest in accelerating data parallel applications. In the SIMT model, small groups of scalar threads operate in lockstep. Within each group, current SIMT implementations serialize the execution of threads that follow different paths, and to ensure efficiency, revert to lockstep execution as soon as possible. These thread scheduling constraints may cause a deadlock-free program on a multiple-instruction multiple-data architecture to deadlock on a SIMT machine. Further, fine-grained synchronization is often implemented using busy-wait loops. However, busy-wait synchronization incurs significant overheads and existing CPU solutions do not readily translate to SIMT architectures. In this thesis, we tackle these challenges. First, we propose a static analysis technique that detects SIMT deadlocks by inspecting the application control flow graph (CFG). We further propose a CFG transformation that avoids SIMT deadlocks when synchronization is local to a function. The static detection has a false detection rate of 4%-5%. The automated transformation has an average performance overhead of 8.2%-10.9% compared to manual transformation. We also propose an adaptive hardware reconvergence mechanism that supports MIMD synchronization without changing the application CFG. Our hardware approach performs on par with the compiler transformation but avoids key limitations in the compiler only solution. We show that this hardware can be further extended to support concurrent multi-path execution to improve the performance of divergent applications. Finally, We propose a hardware warp scheduling policy informed by a novel hardware mechanism for accurately detecting busy-wait synchronization on GPUs. When employed, it deprioritizes spinning warps achieving a speedup of 42.7% over Greedy Then Oldest scheduling.

View record

GPU Computing Architecture for Irregular Parallelism (2015)

Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer and an uncertain amount of performance/efficiency benefit. One known challenge in developing GPU applications with irregular parallelism is the underutilization of SIMD hardware in GPUs due to the application’s irregular control flow behavior, known as branch divergence. Another major development effort is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks. The GPU software developers may need to spend significant effort verifying the data synchronization mechanisms used in their applications. Despite various research studies indicating the potential benefits, the risks involved may discourage software developers from employing GPUs for this class of applications.This dissertation aims to reduce the burden on GPU software developers with two major enhancements to GPU architectures. First, thread block compaction (TBC) is a microarchitecture innovation that reduces the performance penalty caused by branch divergence in GPU applications. Our evaluations show that TBC provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism on a set of GPU applications that suffer significantly from branch divergence. Second, Kilo TM is a cost effective, energy efficient solution for supporting transactional memory (TM) on GPUs. With TM, programmers can uses transactions instead of fine-grained locks to create deadlock-free, maintainable, yet aggressively-parallelized code. In our evaluations, Kilo TM achieves 192X speedup over coarse-grained locking and captures 66% of the performance of fine-grained locking with 34% energy overhead.

View record

Locality and Scheduling in the Massively Multithreaded Era (2015)

Massively parallel processing devices, like Graphics Processing Units (GPUs), have the ability to accelerate highly parallel workloads in an energy-efficient manner. However, executing irregular or less tuned workloads poses performance and energy-efficiency challenges on contemporary GPUs. These inefficiencies come from two primary sources: ineffective management of locality and decreased functional unit utilization. To decrease these effects, GPU programmers are encouraged to restructure their code to fit the underlying hardware architecture which affects the portability of their code and complicates the GPU programming process. Thisdissertation proposes three novel GPU microarchitecture enhancements for mitigating both the locality and utilization problems on an important class of irregular GPU applications. The first mechanism, Cache-Conscious Warp Scheduling (CCWS), is an adaptive hardware mechanism that makes use of a novel locality detector to capture memory reference locality that is lost by other schedulers due to excessive contention for cache capacity. On cache-sensitive, irregular GPU workloads, CCWS provides a 63% speedup over previous scheduling techniques. This dissertation uses CCWS to demonstrate that improvements to the hardware thread scheduling policy in massively multithreaded systems offer a promising new design space to explore in locality management. The second mechanism, Divergence-Aware Warp Scheduling (DAWS), introduces a divergence-based cache footprint predictor to estimate how much L1 data cache capacity is needed to capture locality in loops. We demonstrate that the predictive, pre-emptive nature of DAWS can provide an additional 26% performance improvement over CCWS. This dissertation also demonstrates that DAWS can effectively shift the burden of locality management from software to hardware by increasing the performance of simpler and more portable code on the GPU. Finally, this dissertation details a Variable Warp-Size Architecture (VWS) which improves the performance of irregular applications by 35%. VWS improves irregular code by using a smaller warp size while maintaining the performance and energy-efficiency of regular code by ganging the execution of these smaller warps together in the warp scheduler.

View record

Designing network-on-chips for throughput accelerators (2014)

Physical limits of power usage for integrated circuits have steered the microprocessor industry towards parallel architectures in the past decade. Modern Graphics Processing Units (GPU) are a form of parallel processor that harness chip area more effectively compared to traditional single threaded architectures by favouring application throughput over latency. Modern GPUs can be used as throughput accelerators: accelerating massively parallel non-graphics applications. As the number of compute cores in throughput accelerators increases, so does the importance of efficient memory subsystem design. In this dissertation, we present system-level microarchitectural analysis and optimizations with an emphasis on the memory subsystem of throughput accelerators that employ Bulk-Synchronous-Parallel programming models such as CUDA and OpenCL. We model the whole throughput accelerator as a closed-loop system in order to capture the effects of complex interactions of microarchitectural components: we simulate components such as compute cores, on-chip network and memory controllers with cycle-level accuracy. For this purpose, the first version of GPGPU-Sim simulator that was capable of running unmodified applications by emulating NVIDIA's virtual instruction set was developed. We use this simulator to model and analyze several applications and explore various microarchitectural tradeoffs for throughput accelerators to better suit these applications. Based on our observations, we identify the Network-on-Chip (NoC) component of memory subsystem as our main optimization target and set out to design throughput effective NoCs for future throughput accelerators. We provide a new framework for NoC researchers to ensure the optimizations are "throughput effective", namely, parallel application-level performance improves per unit chip area. We then use this framework to guide the development of several optimizations. Accelerator workloads demand high off-chip memory bandwidth resulting in a many-to-few-to-many traffic pattern. Leveraging this observation, we reduce NoC area by proposing a checkerboard NoC which utilizes routers with limited connectivity. Additionally, we improve performance by increasing the terminal bandwidth of memory controller nodes to better handle frequent read-reply traffic. Furthermore, we propose a double checkerboard inverted NoC organization which maintains the benefits of these optimizations while having a simpler routing mechanism and smaller area and results in a 24.3% improvement in average application throughput per unit area.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Accelerating network function virtualization (2021)

Network function virtualization (NFV) [50] is increasingly used to implement network operations traditionally implemented in customized ASICs. NFV employs commodity, general-purpose computer hardware located in a datacenter. General-purpose computing hardware has performance limitations limiting the scope of NFV. An emerging solution is to leverage programmable accelerators such as GPUs and FPGAs for NFV. However, traditional computer architecture research of NFV applications is challenging, due to the lack of NFV benchmark suites. This dissertation presents a set of candidates for NFV benchmark suite, based on analysis of most common NFV application. We then study NFV acceleration on GPUs. We identify overheads that are especially pronounced in Service Function Chains (SFC) that are common in NFV. This dissertation proposes GPUChain, a mechanism to enhance SFC acceleration on GPUs by moving the chaining capability onto the GPU. GPUChain avoids the significant latency overheads incurred by a CPU-centric SFC execution model. GPUChain achieves an average latency reduction of 44% and improves throughput by 168% versus a GPU supporting pre-registered kernels that are triggered to launch by an external device [110] while incurring only0.007% area overhead.

View record

Hybrid rendering: in pursuit of real-time raytracing (2020)

Hybrid rendering combines ray-tracing and rasterization graphics techniques to generate visually accurate photorealistic computer-generated images at a tight real-time frame rate. This thesis presents contributions to the field of hybrid rendering by introducing an open-source ray-traced ambient occlusion workload, by quantifying the performance tradeoffs of this workload and by evaluating one promising hardware technique to accelerate real-time raytracing.In this thesis, we introduce the first open-source implementation of a viewpoint independent raytraced ambient occlusion GPU compute benchmark. We study this workload using a cycle-level GPU simulator and present the trade-offs between performance and quality. We show that some ray-traced ambient occlusion is possible on a real-time frame budget but that the full quality effect is still too computationally expensive for today's GPU architectures. This thesis provides a limit study of a new promising technique, Hash-Based Ray Path Prediction (HRPP), which exploits the similarity between rays to predict leaf nodes to avoid redundant acceleration structure traversals. Our data shows that acceleration structure traversal consumes a significant proportion of the raytracing rendering time regardless of the platform or the target image quality. Our study quantifies unused ray locality and evaluates the theoretical potential for improved ray traversal performance for both coherent and seemingly incoherent rays. We show that HRPP can skip, on average, 40\% of all hit-all traversal computations. We further show that the significant memory overhead, ranging on the order of megabytes, inhibits this technique from being feasible for current architectures.

View record

Resprop: reused sparsified backpropagation (2020)

The success of Convolutional Neural Networks (CNNs) in various applications isaccompanied by a significant increase in computation and training time. In thiswork, we focus on accelerating training by observing that about 90% of gradientsare reusable during training. Leveraging this observation, we propose a new algorithm,Reuse-Sparse-Backprop (ReSprop), as a method to sparsify gradient vectorsduring CNN training. ReSprop maintains state-of-the-art accuracy on CIFAR-10,CIFAR-100, and ImageNet datasets with less than 1.1% accuracy loss while enablinga reduction in back-propagation computations by a factor of 10x resultingin a 2.7x overall speedup in training. As the computation reduction introduced byReSprop is accomplished by introducing fine-grained sparsity that reduces computationefficiency on GPUs, we introduce a generic sparse convolution neural networkaccelerator (GSCN), which is designed to accelerate sparse back-propagationconvolutions. When combined with ReSprop, GSCN achieves 8.0x and 7.2xspeedup in the backward pass on ResNet34 and VGG16 versus a GTX 1080 TiGPU.

View record

Sparse weight activation training (2020)

Neural network training is computationally and memory intensive. Sparse trainingcan reduce the burden, but it can affect network convergence. In this work, we proposea novel CNN training algorithm Sparse Weight Activation Training (SWAT).SWAT is : (1) more computation and memory-efficient than conventional training,(2) learns a sparse network topology directly, and (3) can be adapted to learn astructured or unstructured sparse topology. SWAT is developed based on insightsderived from an empirical sensitivity analysis of network training on six differentnetwork architectures and three different datasets. Empirically, we find networkconvergence is robust to the elimination of small magnitude weights during the forwardpass and small magnitude weights and activations during the backward pass.SWAT obtains efficiency by constraining the forward and backward pass duringtraining. SWAT dynamically searches for a sparse topology. The dynamic searchof the weights allows SWAT to train a wide variety of architectures such as ResNet,VGG, DenseNet and WideResNet up to 90% sparsity. SWAT demonstrates similaror better performance on CIFAR-10, CIFAR-100, and ImageNet dataset comparedto other pruning and sparse learning algorithms. Moreover, SWAT reduces totalcomputations during training by 50% to 90%, reduces memory footprint duringthe backward pass by 23% to 50% for activations and 50% to 90% for weights.

View record

Faster convolutional neural network training via approximate memoization (2018)

Deep Convolutional Neural Networks have found wide application but their trainingtime can be significant. We find that between successive epochs during training,many neurons compute nearly the same output when presented with the same input.This presents an opportunity to skip computation in the forward pass on thelater epoch via memoization. This dissertation explores the potential of such anapproach by investigating the correlation of neuron activations between trainingepochs. We develop an implementation of activation memoization that takes intoaccount the lockstep behavior of threads executing together in single-instruction,multiple-thread Graphic Processing Units (GPU). Finally, we discuss the trade-offbetween speedup and accuracy by showing that STAN achieves a 1.3-4X convolutionspeedup over our baseline GPU implementation at the expense of 2.7-7%loss in accuracy. When STAN is applied to Hyperband which can be robustto drop in accuracy, an overall training speedup of 7%-16% can be achieved witha minimal change in test error (-0.2 to +0.2).

View record

A mix-grained architecture for improving HLS-generated controllers on FPGAs (2017)

With the recent slowdowns in traditional technology scaling, hardware accelerators, such as Field Programmable Gate Arrays (FPGAs), offer the potential for improved performance and energy efficiency compared to general purpose processing systems. While FPGAs were traditionally used for applications such as signal processing, they have recently gained popularity in new, larger scale domains, such as cloud computing. However, despite their performance and power efficiency, programming FPGAs remains a hard task due to the difficulties involved with the low-level design flow for FPGAs. High-Level Synthesis (HLS) tools aim to assist with this time-consuming task by supporting higher level programming models which significantly increases design productivity. This also makes the use of FPGAs for large scale design development for evolving applications more feasible.In this thesis we explore the potential of modifying the current FPGA architecture to better support the designs generated by HLS tools. We propose a specialized mix-grained architecture for Finite State Machine (FSM) implementation that can be integrated into existing FPGA architectures. The proposed mix-grained architecture exploits the characteristics of the controller units generated by HLS tools to reduce the control-path area of the design. We show that our proposed architecture reduces the area of the next state calculation in FSMs by more than 3X without impacting the performance and often reducing the critical path delay of the next state calculation in FSMs.

View record

Accelerating Web Search using GPUs (2015)

The amount of content on the Internet is growing rapidly as well as the number of the online Internet users. As a consequence, web search engines need to increase their computing capabilities and data continually while maintaining low search latency and without a significant rise in the cost per query. To serve this larger numbers of online users, web search engines utilize a large distributed system in the data centers. They partition their data across several hundred of thousands of independent commodity servers called Index Serving Nodes (ISNs). These ISNs work together to serve search queries as a single coherent system in a distributed manner. The choice of a high number of commodity servers vs. a smaller number of supercomputers is due to the need for scalability, high availability/reliability, performance, and cost efficiency. For the web search engines to serve a larger data, the web search engines can be scaled either vertically or horizontally~\cite{michael2007scale}.Vertical scaling enables ranking more documents per query within a single node by employing machines with higher single thread and throughput performance with bigger and faster memory. Horizontal scaling supports a larger index by adding more index serving nodes at the cost of increased synchronization, aggregation overhead, and power consumption. This thesis evaluates the potential for achieving better vertical scaling by using Graphics processing unit (GPUs) to reduce the documents ranking latency per query at a reasonable initial cost increase.It achieves this by exploiting the parallelism in ranking the numerous potential documents that match a query to offload to the GPU. We evaluate this approach using hundreds of rankers from a commercial web search engine on real production data. Our results show an 8.8x harmonic mean reduction in the latency and 2x power efficiency when ranking 10000 web documents per query for a variety of rankers using C++AMP and a commodity GPU.

View record

Inter-core locality aware memory access scheduling (2015)

Graphics Processing Units (GPUs) run thousands of parallel threads and achieve high Memory Level Parallelism (MLP). To support high MLP, a structure called a Miss-Status Holding Register (MSHR) handles multiple in-flight miss requests. When multiple cores send requests to the same cache line, the requests are merged into one last level cache MSHR entry and only one memory request is sent to the Dynamic Random-Access Memory (DRAM). We call this inter-core locality. The main reason for inter-core locality is that multiple cores access shared read-only data within the same cache line. By prioritizing memory requests that have high inter-core locality, more threads resume execution. Many memory access scheduling policies have been proposed for general-purpose multi-core processors and GPUs. However, some of these policies do not consider the characteristic of GPUs and others do not utilize inter-core locality information.In this thesis, we analyze the reasons that inter-core locality exists and show that requests with more inter-core locality have a higher impact performance. To exploit inter-core locality, we enable the GPU DRAM controller to be aware of inter-core locality by using Level 2 (L2) cache MSHR information. We propose a memory scheduling policy to coordinate the last level cache MSHR and the DRAM controller. 1) We introduce a structure to enable the DRAM to be aware of L2 cache MSHR information. 2) We propose a memory scheduling policy to use L2 cache MSHR information. 3) To prevent starvation, we introduce age information to the scheduling policy.Our evaluation shows a 28% memory request latency reduction and an 11% performance improvement on the average for high inter-core locality benchmarks.

View record

Deterministic execution on GPU architectures (2013)

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one’s ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. Here we propose a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this thesis we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet’s design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

View record

Improving GPU programming models through hardware cache coherence (2013)

Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over contemporary chip multiprocessors (CMPs) on massively parallel programs. The lack of well-defined GPU memory models, however, prevents support of high-level languages like C++ and Java, and negatively impacts their programmability. This thesis proposes to improve GPU programmability by adding support for a well-defined memory consistency model through hardware cache coherence. We show that GPU coherence introduces a new set of challenges different from that posed by scalable cache coherence for CMPs. First, introducing conventional directory coherence protocols adds unnecessary coherence traffic overhead to existing GPU applications. Second, the massively multithreaded GPU architecture presents significant storage overheads for buffering thousands of in-flight coherence requests. Third, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC), explored the use of time-based approaches in CMP coherence protocols. This thesis describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present two implementations of TC, called TC-Strong and TC-Weak. TC-Strong implements an optimized version of LCC, while TC-Weak uses a novel timestamp based memory fence mechanism to implement Release Consistency on GPUs. TC-Weak eliminates TC-Strong’s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications requiring coherence by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also show that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.

View record

Optimizing network-on-chips for FPGAs (2013)

As larger System-on-Chip (SoC) designs are attempted on Field Programmable Gate Arrays (FPGAs), the need for a low cost and high performance Network-on-Chip (NoC) grows. Virtual Channel (VC) routers provide desirable traits for an NoC such as higher throughput and deadlock prevention but at significant resource cost when implemented on an FPGA. This thesis presents an FPGA specific optimization to reduce resource utilization. We propose sharing Block RAMs between multiple router ports to store the high logic resource consuming VC buffers and present the Block RAM Split (BRS) router architecture that implements the proposed optimization. We evaluate the performance of the modifications using synthetic traffic patterns on mesh and torus networks and synthesize the NoCs to determine overall resource usage and maximum clock frequency. We find that the additional logic to support sharing Block RAMs has little impact on Adaptive Logic Module (ALM) usage in designs that currently use Block RAMs while at the same time decreasing Block RAM usage by as much as 40%. In comparison to CONNECT, a router design that does not use Block RAMs, a 71% reduction in ALM usage is shown to be possible. This resource reduction comes at the cost of a 15% reduction in the saturation throughput for uniform random traffic and a 50% decrease in the worst case neighbour traffic pattern on a mesh network. The throughput penalty from the neighbour traffic pattern can be reduced to 3% if a torus network is used. In all cases, there is little change in network latency at low load. BRS is capable of running at 161.71 MHz which is a decrease of only 4% from the base VC router design. Determining the optimum NoC topology is a challenging task. This thesis also proposes initial work towards the creation of an analytical model to assist with finding the best topology to use in an FPGA NoC.

View record

On Replay and Hazards in Graphics Processing Units (2012)

Graphics Processing Units (GPUs) have potential for more efficient execution of programs, both time wise and energy wise. They can achieve this by devoting more of their hardware for functional units. GPGPU-Sim is modified to simulate an aggressively modern compute accelerator and used to investigate the impact of hazards in massively multithreaded accelerator architectures such as those represented by modern GPUs employing a single-instruction, multiple thread (SIMT) execution model. Hazards are events that occur in the execution of a program which limit performance. We explore design tradeoffs in hazard handling. We find that in common architectures hazards that stall the pipeline cause other unrelated threads to be unable to make forward progress. This thesis explores an alternative organization, called replay, in which instructions that cannot proceed through the memory stage are squashed and later replayed so that instructions from other threads can make forward progress. This replay architecture can behave pathologically in some cases by excessively replaying without forward progress. A method which predicts when hazards may occur and thus can reduce the number of unnecessary replays and improve performance is proposed and evaluated. This method is found to improve performance up to 13.3% over the stalling baseline, and up to 3.3% over replay.

View record

Improving the performance of post-silicon trace generation (2011)

As microprocessor designs become more complex, the task of finding errors in the design becomes more difficult. Most design errors will be caught before the chip is fabricated, however, there may be some that make it into the fabricated design. The process of determining what is wrong when the fabricated chip of a new design behaves incorrectly is called post-silicon debug (also known as silicon validation). One of the challenges of post-silicon debug is the lack of observability into the state of a fabricated chip. BackSpace is a proposal for tackling the observability challenge. It does this by generating a post-silicon trace using a combination of on-chip monitoring hardware and off-chip formal analysis. To add one state to the trace, BackSpace generates a set of possible predecessor states by analysing the design which are then tested one at a time. The testing is performed by loading a candidate state into a circuit that compares it with the current state of the chip, and running the chip. If the state is reached during execution, then it is added to the trace. The process of testing states one at a time is time consuming. This thesis shows that correlation information characterizing the application running on the chip can reduce the number of runs of the chip by up to 51%. New post-silicon trace generation algorithms, BackSpace-2bp and BackSpace-3bp, are also proposed that do not need to generate the set of possible predecessor states. This leads to a speedup relative to practical implementations of BackSpace and also reduces the hardware overhead needed to generate post-silicon traces by 94.4% relative to the state of the art implementation of BackSpace. To evaluate these new algorithms BackSpace, and the new algorithms, were implemented on a superscalar out-of-order processor using the Alpha instruction set. Non-determinism was introduced in the form of variable memory latency. The effects of non-determinism on the processor and the trace generation algorithms are also evaluated.

View record

Towards synchronization in SIMT architectures (2011)

An important class of compute accelerators are graphics processing units (GPUs). Popular programming models for non-graphics computation on GPUs, such as CUDA and OpenCL, provide an abstraction of many parallel scalar threads. Contemporary GPU hardware groups 32 to 64 scalar threads as a single warp or wavefront and executes this group of scalar threads in lockstep. The inherent mismatch between scalar programming model and vector hardware creates a challenge when developing applications that employ synchronization on the GPU. This challenge arises from the use of a hardware stack to manage control flow divergence among scalar threads.This thesis explains the porting of the Apriori benchmark to a GPU which ledto the research on synchronization in SIMT hardware. It then proposes instruction set and hardware changes that simplify the implementation of mutual exclusion when porting multiple-instruction, multiple data (MIMD) programs with synchronization to accelerators employing single-instruction, multiple thread (SIMT) hardware. These instructions when compared with more complex software only solutions, achieve similar performance. This thesis also implements and evaluates queue based mutual exclusion on SIMT hardware.

View record

GPU Compute Memory Systems (2010)

Modern Graphic Process Units (GPUs) offer orders of magnitude more raw computing powerthan contemporary CPUs by using many simpler in-order single-instruction, multiple-data (SIMD)cores optimized for multi-thread performance rather than single-thread performance. As such,GPUs operate much closer to the "Memory Wall", thus requiring much more careful memorymanagement.This thesis proposes changes to the memory system of our detailed GPU performance simulator,GPGPU-Sim, to allow proper simulation of general-purpose applications written using NVIDIA'sCompute Unified Device Architecture (CUDA) framework. To test these changes, fourteen CUDAapplications with varying degrees of memory intensity were collected. With these changes, weshow that our simulator predicts performance of commodity GPU hardware with 86% correlation.Furthermore, we show that increasing chip resources to allow more threads to run concurrently doesnot necessarily increase performance due to increased contention for the shared memory system.Moreover, this thesis proposes a hybrid analytical DRAM performance model that uses memoryaddress traces to predict the efficiency of a DRAM system when using a conventional First-ReadyFirst-Come First-Serve (FR-FCFS) memory scheduling policy. To stress the proposed model, amassively multithreaded architecture based upon contemporary high-end GPUs is simulated togenerate the memory address trace needed. The results show that the hybrid analytical modelpredicts DRAM efficiency to within 11.2% absolute error when arithmetically averaged across amemory-intensive subset of the CUDA applications introduced in the first part of this thesis.Finally, this thesis proposes a complexity-effective solution to memory scheduling that recoversmost of the performance loss incurred by a naive in-order First-in First-out (FIFO) DRAM schedulercompared to an aggressive out-of-order FR-FCFS scheduler. While FR-FCFS scheduling re-ordersmemory requests to improve row access locality, we instead employ an interconnection network arbitration scheme that preserves the inherently high row access locality of memory request streamsfrom individual "shader cores" and, in doing so, achieve DRAM efficiency and system performanceclose to that of FR-FCFS with a simpler design. We evaluate our interconnection network arbitration scheme using crossbar, ring, and mesh networks and show that, when coupled with a bankedFIFO in-order scheduler, it obtains up to 91.0% of the performance obtainable with an out-of-ordermemory scheduler with eight-entry DRAM controller queues.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Get key application advice, hear about the latest research opportunities and keep up with the latest news from UBC's graduate programs.