Tor Aamodt


Relevant Degree Programs


Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - Nov 2020)
Models and techniques for designing mobile system-on-chip devices (2020)

Mobile SoCs have become ubiquitous computing platforms, and, in recent years, they have become increasingly heterogeneous and complex. A typical SoC today includes CPUs, GPUs, image processors, video encoders/decoders, and AI engines. This dissertation addresses some of the challenges associated with SoCs in three pieces of work. The first piece of work develops a cycle-accurate model, Emerald, which provides a platform for studying system-level SoC interactions while including the impact of graphics. Our cycle-accurate infrastructure builds upon well-established tools, GPGPU-Sim and gem5, with support for graphics and GPGPU workloads, and full system simulation with Android. We present two case studies using Emerald. First, we use Emerald's full-system mode to highlight the importance of system-wide interactions by studying and analyzing memory organization and scheduling in SoCs. Second, we use Emerald's standalone mode to evaluate a dynamic mechanism for balancing the shading work assigned to GPU cores. Our dynamic mechanism speeds up frame rendering by 7.3-19% compared to static load-balancing. The second work highlights the time-variant traffic asymmetry in heterogeneous SoCs. We analyze the impact of this asymmetry on network performance and propose interleaved source injection (ISI), an interconnect topology and associated flow control mechanism to manage time-varying asymmetric network traffic. We evaluate ISI using stochastic traffic patterns and a set of traces that emulate mobile use cases with traffic from various IP blocks. We show that ISI increases saturation throughput by 80-184% for 12% increase in NoC area. In the last piece of work, we study the compression properties of framebuffer surfaces and highlight the characteristics of surfaces generated by different applications. We use our analysis to propose Dynamic Color Palettes (DCP), a hardware scheme that dynamically constructs color palettes and employs them to efficiently compress framebuffer surfaces. We evaluated DCP against a set of 124 workloads and found that DCP improves compression rates by 91% for UI and 20% for 2D applications compared to previous proposals. We also propose a hybrid scheme (HDCP) that combines DCP with a generic compression scheme. HDCP outperforms previous proposals by 161%, 124% and 83% for UI, 2D, and 3D applications, respectively.

View record

Software-hardware co-design for energy efficient datacenter computing (2019)

Datacenters have become commonplace computing environments used to offload applications from distributed local machines to centralized environments. Datacenters offer increased performance and efficiency, reliability and security guarantees, and reduced costs relative to independently operating the computing equipment. The growing trend over the last decade towards server-side (cloud) computing in the datacenter has resulted in increasingly higher demands for performance and efficiency. Graphics processing units (GPUs) are massively parallel, highly efficient accelerators, which can provide significant improvements to applications with ample parallelism and structured behavior. While server-based applications contain varying degrees of parallelism and are economically appealing for GPU acceleration, they often do not adhere to the specific properties expected of an application to obtain the benefits offered by the GPU.This dissertation explores the potential for using GPUs as energy-efficient accelerators for traditional server-based applications in the datacenter through a software-hardware co-design. It first evaluates a popular key-value store server application, Memcached, demonstrating that the GPU can outperform the CPU by 7.5x for the core Memcached processing. However, the core processing of a networking application is only part of the end-to-end computation required at the server. This dissertation then proposes a GPU-accelerated software networking framework, GNoM, which offloads all of the network and application processing to the GPU. GNoM facilitates the design of MemcachedGPU, an end-to-end Memcached implementation on contemporary Ethernet and GPU hardware. MemcachedGPU achieves 10 Gbit line-rate processing at the smallest request size with 95-percentile latencies under 1.1 milliseconds and efficiencies under 12 microjoules per request. GNoM highlights limitations in the traditional GPU programming model, which relies on a CPU for managing GPU tasks. Consequently, the CPU may be unnecessarily involved on the critical path, affecting overall performance, efficiency, and the potential for CPU workload consolidation. To address these limitations, this dissertation proposes an event-driven GPU programming model and set of hardware modifications, EDGE, which enables any device in a heterogeneous system to directly manage the execution of pre-registered GPU tasks through interrupts. EDGE employs a fine-grained GPU preemption mechanism that reuses existing GPU compute resources to begin processing interrupts in under 50 GPU cycles.

View record

Architectural support for inter-thread synchronization in SIMT architectures (2018)

Single-Instruction Multiple-Threads (SIMT) architectures have seen widespread interest in accelerating data parallel applications. In the SIMT model, small groups of scalar threads operate in lockstep. Within each group, current SIMT implementations serialize the execution of threads that follow different paths, and to ensure efficiency, revert to lockstep execution as soon as possible. These thread scheduling constraints may cause a deadlock-free program on a multiple-instruction multiple-data architecture to deadlock on a SIMT machine. Further, fine-grained synchronization is often implemented using busy-wait loops. However, busy-wait synchronization incurs significant overheads and existing CPU solutions do not readily translate to SIMT architectures. In this thesis, we tackle these challenges. First, we propose a static analysis technique that detects SIMT deadlocks by inspecting the application control flow graph (CFG). We further propose a CFG transformation that avoids SIMT deadlocks when synchronization is local to a function. The static detection has a false detection rate of 4%-5%. The automated transformation has an average performance overhead of 8.2%-10.9% compared to manual transformation. We also propose an adaptive hardware reconvergence mechanism that supports MIMD synchronization without changing the application CFG. Our hardware approach performs on par with the compiler transformation but avoids key limitations in the compiler only solution. We show that this hardware can be further extended to support concurrent multi-path execution to improve the performance of divergent applications. Finally, We propose a hardware warp scheduling policy informed by a novel hardware mechanism for accurately detecting busy-wait synchronization on GPUs. When employed, it deprioritizes spinning warps achieving a speedup of 42.7% over Greedy Then Oldest scheduling.

View record

GPU Computing Architecture for Irregular Parallelism (2015)

Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer and an uncertain amount of performance/efficiency benefit. One known challenge in developing GPU applications with irregular parallelism is the underutilization of SIMD hardware in GPUs due to the application’s irregular control flow behavior, known as branch divergence. Another major development effort is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks. The GPU software developers may need to spend significant effort verifying the data synchronization mechanisms used in their applications. Despite various research studies indicating the potential benefits, the risks involved may discourage software developers from employing GPUs for this class of applications.This dissertation aims to reduce the burden on GPU software developers with two major enhancements to GPU architectures. First, thread block compaction (TBC) is a microarchitecture innovation that reduces the performance penalty caused by branch divergence in GPU applications. Our evaluations show that TBC provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism on a set of GPU applications that suffer significantly from branch divergence. Second, Kilo TM is a cost effective, energy efficient solution for supporting transactional memory (TM) on GPUs. With TM, programmers can uses transactions instead of fine-grained locks to create deadlock-free, maintainable, yet aggressively-parallelized code. In our evaluations, Kilo TM achieves 192X speedup over coarse-grained locking and captures 66% of the performance of fine-grained locking with 34% energy overhead.

View record

Locality and Scheduling in the Massively Multithreaded Era (2015)

Massively parallel processing devices, like Graphics Processing Units (GPUs), have the ability to accelerate highly parallel workloads in an energy-efficient manner. However, executing irregular or less tuned workloads poses performance and energy-efficiency challenges on contemporary GPUs. These inefficiencies come from two primary sources: ineffective management of locality and decreased functional unit utilization. To decrease these effects, GPU programmers are encouraged to restructure their code to fit the underlying hardware architecture which affects the portability of their code and complicates the GPU programming process. Thisdissertation proposes three novel GPU microarchitecture enhancements for mitigating both the locality and utilization problems on an important class of irregular GPU applications. The first mechanism, Cache-Conscious Warp Scheduling (CCWS), is an adaptive hardware mechanism that makes use of a novel locality detector to capture memory reference locality that is lost by other schedulers due to excessive contention for cache capacity. On cache-sensitive, irregular GPU workloads, CCWS provides a 63% speedup over previous scheduling techniques. This dissertation uses CCWS to demonstrate that improvements to the hardware thread scheduling policy in massively multithreaded systems offer a promising new design space to explore in locality management. The second mechanism, Divergence-Aware Warp Scheduling (DAWS), introduces a divergence-based cache footprint predictor to estimate how much L1 data cache capacity is needed to capture locality in loops. We demonstrate that the predictive, pre-emptive nature of DAWS can provide an additional 26% performance improvement over CCWS. This dissertation also demonstrates that DAWS can effectively shift the burden of locality management from software to hardware by increasing the performance of simpler and more portable code on the GPU. Finally, this dissertation details a Variable Warp-Size Architecture (VWS) which improves the performance of irregular applications by 35%. VWS improves irregular code by using a smaller warp size while maintaining the performance and energy-efficiency of regular code by ganging the execution of these smaller warps together in the warp scheduler.

View record

Designing network-on-chips for throughput accelerators (2014)

Physical limits of power usage for integrated circuits have steered the microprocessor industry towards parallel architectures in the past decade. Modern Graphics Processing Units (GPU) are a form of parallel processor that harness chip area more effectively compared to traditional single threaded architectures by favouring application throughput over latency. Modern GPUs can be used as throughput accelerators: accelerating massively parallel non-graphics applications. As the number of compute cores in throughput accelerators increases, so does the importance of efficient memory subsystem design. In this dissertation, we present system-level microarchitectural analysis and optimizations with an emphasis on the memory subsystem of throughput accelerators that employ Bulk-Synchronous-Parallel programming models such as CUDA and OpenCL. We model the whole throughput accelerator as a closed-loop system in order to capture the effects of complex interactions of microarchitectural components: we simulate components such as compute cores, on-chip network and memory controllers with cycle-level accuracy. For this purpose, the first version of GPGPU-Sim simulator that was capable of running unmodified applications by emulating NVIDIA's virtual instruction set was developed. We use this simulator to model and analyze several applications and explore various microarchitectural tradeoffs for throughput accelerators to better suit these applications. Based on our observations, we identify the Network-on-Chip (NoC) component of memory subsystem as our main optimization target and set out to design throughput effective NoCs for future throughput accelerators. We provide a new framework for NoC researchers to ensure the optimizations are "throughput effective", namely, parallel application-level performance improves per unit chip area. We then use this framework to guide the development of several optimizations. Accelerator workloads demand high off-chip memory bandwidth resulting in a many-to-few-to-many traffic pattern. Leveraging this observation, we reduce NoC area by proposing a checkerboard NoC which utilizes routers with limited connectivity. Additionally, we improve performance by increasing the terminal bandwidth of memory controller nodes to better handle frequent read-reply traffic. Furthermore, we propose a double checkerboard inverted NoC organization which maintains the benefits of these optimizations while having a simpler routing mechanism and smaller area and results in a 24.3% improvement in average application throughput per unit area.

View record

Master's Student Supervision (2010 - 2018)
A mix-grained architecture for improving HLS-generated controllers on FPGAs (2017)

With the recent slowdowns in traditional technology scaling, hardware accelerators, such as Field Programmable Gate Arrays (FPGAs), offer the potential for improved performance and energy efficiency compared to general purpose processing systems. While FPGAs were traditionally used for applications such as signal processing, they have recently gained popularity in new, larger scale domains, such as cloud computing. However, despite their performance and power efficiency, programming FPGAs remains a hard task due to the difficulties involved with the low-level design flow for FPGAs. High-Level Synthesis (HLS) tools aim to assist with this time-consuming task by supporting higher level programming models which significantly increases design productivity. This also makes the use of FPGAs for large scale design development for evolving applications more feasible.In this thesis we explore the potential of modifying the current FPGA architecture to better support the designs generated by HLS tools. We propose a specialized mix-grained architecture for Finite State Machine (FSM) implementation that can be integrated into existing FPGA architectures. The proposed mix-grained architecture exploits the characteristics of the controller units generated by HLS tools to reduce the control-path area of the design. We show that our proposed architecture reduces the area of the next state calculation in FSMs by more than 3X without impacting the performance and often reducing the critical path delay of the next state calculation in FSMs.

View record

Accelerating Web Search using GPUs (2015)

The amount of content on the Internet is growing rapidly as well as the number of the online Internet users. As a consequence, web search engines need to increase their computing capabilities and data continually while maintaining low search latency and without a significant rise in the cost per query. To serve this larger numbers of online users, web search engines utilize a large distributed system in the data centers. They partition their data across several hundred of thousands of independent commodity servers called Index Serving Nodes (ISNs). These ISNs work together to serve search queries as a single coherent system in a distributed manner. The choice of a high number of commodity servers vs. a smaller number of supercomputers is due to the need for scalability, high availability/reliability, performance, and cost efficiency. For the web search engines to serve a larger data, the web search engines can be scaled either vertically or horizontally~\cite{michael2007scale}.Vertical scaling enables ranking more documents per query within a single node by employing machines with higher single thread and throughput performance with bigger and faster memory. Horizontal scaling supports a larger index by adding more index serving nodes at the cost of increased synchronization, aggregation overhead, and power consumption. This thesis evaluates the potential for achieving better vertical scaling by using Graphics processing unit (GPUs) to reduce the documents ranking latency per query at a reasonable initial cost increase.It achieves this by exploiting the parallelism in ranking the numerous potential documents that match a query to offload to the GPU. We evaluate this approach using hundreds of rankers from a commercial web search engine on real production data. Our results show an 8.8x harmonic mean reduction in the latency and 2x power efficiency when ranking 10000 web documents per query for a variety of rankers using C++AMP and a commodity GPU.

View record

Inter-core locality aware memory access scheduling (2015)

Graphics Processing Units (GPUs) run thousands of parallel threads and achieve high Memory Level Parallelism (MLP). To support high MLP, a structure called a Miss-Status Holding Register (MSHR) handles multiple in-flight miss requests. When multiple cores send requests to the same cache line, the requests are merged into one last level cache MSHR entry and only one memory request is sent to the Dynamic Random-Access Memory (DRAM). We call this inter-core locality. The main reason for inter-core locality is that multiple cores access shared read-only data within the same cache line. By prioritizing memory requests that have high inter-core locality, more threads resume execution. Many memory access scheduling policies have been proposed for general-purpose multi-core processors and GPUs. However, some of these policies do not consider the characteristic of GPUs and others do not utilize inter-core locality information.In this thesis, we analyze the reasons that inter-core locality exists and show that requests with more inter-core locality have a higher impact performance. To exploit inter-core locality, we enable the GPU DRAM controller to be aware of inter-core locality by using Level 2 (L2) cache MSHR information. We propose a memory scheduling policy to coordinate the last level cache MSHR and the DRAM controller. 1) We introduce a structure to enable the DRAM to be aware of L2 cache MSHR information. 2) We propose a memory scheduling policy to use L2 cache MSHR information. 3) To prevent starvation, we introduce age information to the scheduling policy.Our evaluation shows a 28% memory request latency reduction and an 11% performance improvement on the average for high inter-core locality benchmarks.

View record

Deterministic execution on GPU architectures (2013)

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one’s ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. Here we propose a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this thesis we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet’s design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

View record

Improving GPU programming models through hardware cache coherence (2013)

Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over contemporary chip multiprocessors (CMPs) on massively parallel programs. The lack of well-defined GPU memory models, however, prevents support of high-level languages like C++ and Java, and negatively impacts their programmability. This thesis proposes to improve GPU programmability by adding support for a well-defined memory consistency model through hardware cache coherence. We show that GPU coherence introduces a new set of challenges different from that posed by scalable cache coherence for CMPs. First, introducing conventional directory coherence protocols adds unnecessary coherence traffic overhead to existing GPU applications. Second, the massively multithreaded GPU architecture presents significant storage overheads for buffering thousands of in-flight coherence requests. Third, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC), explored the use of time-based approaches in CMP coherence protocols. This thesis describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present two implementations of TC, called TC-Strong and TC-Weak. TC-Strong implements an optimized version of LCC, while TC-Weak uses a novel timestamp based memory fence mechanism to implement Release Consistency on GPUs. TC-Weak eliminates TC-Strong’s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications requiring coherence by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also show that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.

View record

Optimizing network-on-chips for FPGAs (2013)

As larger System-on-Chip (SoC) designs are attempted on Field Programmable Gate Arrays (FPGAs), the need for a low cost and high performance Network-on-Chip (NoC) grows. Virtual Channel (VC) routers provide desirable traits for an NoC such as higher throughput and deadlock prevention but at significant resource cost when implemented on an FPGA. This thesis presents an FPGA specific optimization to reduce resource utilization. We propose sharing Block RAMs between multiple router ports to store the high logic resource consuming VC buffers and present the Block RAM Split (BRS) router architecture that implements the proposed optimization. We evaluate the performance of the modifications using synthetic traffic patterns on mesh and torus networks and synthesize the NoCs to determine overall resource usage and maximum clock frequency. We find that the additional logic to support sharing Block RAMs has little impact on Adaptive Logic Module (ALM) usage in designs that currently use Block RAMs while at the same time decreasing Block RAM usage by as much as 40%. In comparison to CONNECT, a router design that does not use Block RAMs, a 71% reduction in ALM usage is shown to be possible. This resource reduction comes at the cost of a 15% reduction in the saturation throughput for uniform random traffic and a 50% decrease in the worst case neighbour traffic pattern on a mesh network. The throughput penalty from the neighbour traffic pattern can be reduced to 3% if a torus network is used. In all cases, there is little change in network latency at low load. BRS is capable of running at 161.71 MHz which is a decrease of only 4% from the base VC router design. Determining the optimum NoC topology is a challenging task. This thesis also proposes initial work towards the creation of an analytical model to assist with finding the best topology to use in an FPGA NoC.

View record

On Replay and Hazards in Graphics Processing Units (2012)

Graphics Processing Units (GPUs) have potential for more efficient execution of programs, both time wise and energy wise. They can achieve this by devoting more of their hardware for functional units. GPGPU-Sim is modified to simulate an aggressively modern compute accelerator and used to investigate the impact of hazards in massively multithreaded accelerator architectures such as those represented by modern GPUs employing a single-instruction, multiple thread (SIMT) execution model. Hazards are events that occur in the execution of a program which limit performance. We explore design tradeoffs in hazard handling. We find that in common architectures hazards that stall the pipeline cause other unrelated threads to be unable to make forward progress. This thesis explores an alternative organization, called replay, in which instructions that cannot proceed through the memory stage are squashed and later replayed so that instructions from other threads can make forward progress. This replay architecture can behave pathologically in some cases by excessively replaying without forward progress. A method which predicts when hazards may occur and thus can reduce the number of unnecessary replays and improve performance is proposed and evaluated. This method is found to improve performance up to 13.3% over the stalling baseline, and up to 3.3% over replay.

View record

Improving the performance of post-silicon trace generation (2011)

As microprocessor designs become more complex, the task of finding errors in the design becomes more difficult. Most design errors will be caught before the chip is fabricated, however, there may be some that make it into the fabricated design. The process of determining what is wrong when the fabricated chip of a new design behaves incorrectly is called post-silicon debug (also known as silicon validation). One of the challenges of post-silicon debug is the lack of observability into the state of a fabricated chip. BackSpace is a proposal for tackling the observability challenge. It does this by generating a post-silicon trace using a combination of on-chip monitoring hardware and off-chip formal analysis. To add one state to the trace, BackSpace generates a set of possible predecessor states by analysing the design which are then tested one at a time. The testing is performed by loading a candidate state into a circuit that compares it with the current state of the chip, and running the chip. If the state is reached during execution, then it is added to the trace. The process of testing states one at a time is time consuming. This thesis shows that correlation information characterizing the application running on the chip can reduce the number of runs of the chip by up to 51%. New post-silicon trace generation algorithms, BackSpace-2bp and BackSpace-3bp, are also proposed that do not need to generate the set of possible predecessor states. This leads to a speedup relative to practical implementations of BackSpace and also reduces the hardware overhead needed to generate post-silicon traces by 94.4% relative to the state of the art implementation of BackSpace. To evaluate these new algorithms BackSpace, and the new algorithms, were implemented on a superscalar out-of-order processor using the Alpha instruction set. Non-determinism was introduced in the form of variable memory latency. The effects of non-determinism on the processor and the trace generation algorithms are also evaluated.

View record

Towards synchronization in SIMT architectures (2011)

An important class of compute accelerators are graphics processing units (GPUs). Popular programming models for non-graphics computation on GPUs, such as CUDA and OpenCL, provide an abstraction of many parallel scalar threads. Contemporary GPU hardware groups 32 to 64 scalar threads as a single warp or wavefront and executes this group of scalar threads in lockstep. The inherent mismatch between scalar programming model and vector hardware creates a challenge when developing applications that employ synchronization on the GPU. This challenge arises from the use of a hardware stack to manage control flow divergence among scalar threads.This thesis explains the porting of the Apriori benchmark to a GPU which ledto the research on synchronization in SIMT hardware. It then proposes instruction set and hardware changes that simplify the implementation of mutual exclusion when porting multiple-instruction, multiple data (MIMD) programs with synchronization to accelerators employing single-instruction, multiple thread (SIMT) hardware. These instructions when compared with more complex software only solutions, achieve similar performance. This thesis also implements and evaluates queue based mutual exclusion on SIMT hardware.

View record

GPU Compute Memory Systems (2010)

Modern Graphic Process Units (GPUs) offer orders of magnitude more raw computing powerthan contemporary CPUs by using many simpler in-order single-instruction, multiple-data (SIMD)cores optimized for multi-thread performance rather than single-thread performance. As such,GPUs operate much closer to the "Memory Wall", thus requiring much more careful memorymanagement.This thesis proposes changes to the memory system of our detailed GPU performance simulator,GPGPU-Sim, to allow proper simulation of general-purpose applications written using NVIDIA'sCompute Unified Device Architecture (CUDA) framework. To test these changes, fourteen CUDAapplications with varying degrees of memory intensity were collected. With these changes, weshow that our simulator predicts performance of commodity GPU hardware with 86% correlation.Furthermore, we show that increasing chip resources to allow more threads to run concurrently doesnot necessarily increase performance due to increased contention for the shared memory system.Moreover, this thesis proposes a hybrid analytical DRAM performance model that uses memoryaddress traces to predict the efficiency of a DRAM system when using a conventional First-ReadyFirst-Come First-Serve (FR-FCFS) memory scheduling policy. To stress the proposed model, amassively multithreaded architecture based upon contemporary high-end GPUs is simulated togenerate the memory address trace needed. The results show that the hybrid analytical modelpredicts DRAM efficiency to within 11.2% absolute error when arithmetically averaged across amemory-intensive subset of the CUDA applications introduced in the first part of this thesis.Finally, this thesis proposes a complexity-effective solution to memory scheduling that recoversmost of the performance loss incurred by a naive in-order First-in First-out (FIFO) DRAM schedulercompared to an aggressive out-of-order FR-FCFS scheduler. While FR-FCFS scheduling re-ordersmemory requests to improve row access locality, we instead employ an interconnection network arbitration scheme that preserves the inherently high row access locality of memory request streamsfrom individual "shader cores" and, in doing so, achieve DRAM efficiency and system performanceclose to that of FR-FCFS with a simpler design. We evaluate our interconnection network arbitration scheme using crossbar, ring, and mesh networks and show that, when coupled with a bankedFIFO in-order scheduler, it obtains up to 91.0% of the performance obtainable with an out-of-ordermemory scheduler with eight-entry DRAM controller queues.

View record


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Learn about our faculties, research, and more than 300 programs in our 2021 Graduate Viewbook!