Doctor of Philosophy in Electrical and Computer Engineering (PhD)
Protecting Mobile Phone Users Against Malware
The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications. Unlike latency-critical Central Processing Units (CPUs), throughput-oriented GPUs provide high performance by exploiting massive application parallelism. In parallel programming, synchronization is necessary to exchange information for inter-thread dependency. However, inefficient synchronization support can serialize thread execution and restrict parallelism significantly. Considering parallelism is key to GPU performance, we aim to provide efficient and reliable synchronization support for both single-GPU and multi-GPU systems. To achieve this target, this dissertation explores multiple abstraction layers of computer systems, including programming models, memory consistency models, cache coherence protocols, and application specific knowledges of graphics rendering.First, to reduce programming burden without introducing data-races, we propose Relativistic Cache Coherence (RCC) to enforce Sequential Consistency (SC). By avoiding stalls of write permission acquisition with logical timestamps, RCC is 30% faster than the best prior SC proposal, and only 7% slower than the best non-SC design. Second, we introduce GETM, the first GPU Hardware Transactional Memory (HTM) with eager conflict detection, to help programmers implement deadlock-free, yet aggressively parallel code. Compared to the best prior GPU HTM, GETM is up to 2.1× (1.2× gmean) faster, area overheads are 3.6× lower, and power overheads are 2.2× lower. Third, we design HMG, a hierarchical cache coherence protocol for multi-GPU systems. By leveraging the latest scoped memory model, HMG not only can avoid full cache invalidation of software coherence protocol, but also filters out write invalidation acknowledgments and transient coherence states. Despite minimal hardware overhead, HMG can achieve 97% of the performance of an idealized caching system. Finally, we propose CHOPIN, a novel Split Frame Rendering (SFR) scheme by taking advantage of the parallelism of image composition. CHOPIN can eliminate the performance overheads of primitive duplication and sequential primitive distribution that exist in previous work. CHOPIN outperforms the best prior SFR implementation by up to 56% (25% gmean) in an 8-GPU system.
Improving the spectral efficiency is a key challenge to meet the increasing demandfor higher capacity over communication channels. Faster than Nyquist (FTN) signaling(first proposed in 1970s) has recently regained popularity due to its abilityto improve signal’s spectral efficiency. FTN can achieve a symbol rate faster thanNyquist rate; therefore, it has been widely investigated in high-capacity wirelessand optical communications.FTN signaling comes at the cost of intersymbol interference (ISI), which requirescomplex decoder at the receiver to recover the received symbol from ISInoise. This decoder is a bottleneck of the system throughput since it is on the criticalpath of receiving a noise free signal. Several decoders have been proposed inthe literature to mitigate the ISI noise, however, these systems require a significantcomputational complexity to equalize the received frames.In this thesis, we propose a hardware architecture that implement an FTN decodersystem. This system consists of a Maximum A-posteriori (MAP) equalizerand low-density parity (LDPC) decoder that work together iteratively to mitigateISI noise and channel noise. The MAP equalizer is mainly responsible for equalizingISI noise, while the LDPC decoder uses parity checks to remove randomchannel noise.We study the design trade-offs for each block, evaluate it separately, and proposea high-throughput hardware architecture for the FTN decoder system. Weevaluate this architecture on Xilinx UltraScale+ (xcvu13p) device. Our MAP equalizerachieves a throughput up to 602 Mbps per processing element (PE). We alsodesign an LDPC decoder that achieves a throughput of up to 520 Mbps per PE. TheFTN decoder architecture achieves a throughput of up to 2.16 Gbps.
Caches are essential to today's microprocessors. They close the huge speed gap between processors and memories. However, cache design presents an important tradeoff. A bigger cache size should increase performance and allow processors to perform faster, but it is also limited by its silicon, area, and power consumption costs. Today's caches often use half of the silicon area in processor chips and consume a lot of power. Instead of physically increasing the cache size, effective cache capacity can be substantially increased if the data inside the cache is compressed.Current cache compression techniques focus only on one granularity, either compressing inside one cache line, or compressing similar cache lines together. In this work, we combine both compression techniques to leverage both inter-line and intra-line compression. We find that combining both techniques results in better compression than previously described methods, and also maintains the same performance as a normal uncompressed cache when running incompressible applications. We study and address the design considerations and tradeoffs that arise from such design. We address issues related to the design like cache structure and replacement policies. Then we present an implementation that achieves the best possible compression and performance while maintaining overheads as low as possible.
In recent years, neural networks have regained popularity in a variety of fields such as image recognition and speech transcription. As deep neural networks grow more popular for solving everyday tasks, deployment on small embedded devices — such as phones — is becoming increasingly popular. Moreover, many applications — such as face recognition or health applications — require personalization, which means that networks must be retrained after they have been deployed.Because today’s state-of-the-art networks are too large to fit on mobile devices and exceed mobile device power envelopes, techniques such as pruning and quantization have been developed to allow pre-trained networks to be shrunk by about an order of magnitude. However, they all assume that the network is first fully trained off-line on datacenter-class GPUs, then pruned in a post-processing step, and only then deployed to the mobile device.In this thesis, we introduce DropBack, a technique that significantly reduces the storage and computation required during both inference and training. In contrast to existing pruning schemes, which retain the weights with the largest values and set the rest to zero, DropBack identifies the weights that have changed the most, and recomputes the original initialization values for all other weights. This means that only the most important weights must be stored in off-chip memory both during inference and training, reducing off-chip memory accesses (responsible for a majority of the power usage) by up to 72×.Crucially, networks pruned using DropBack maintain high accuracy even for challenging network architectures: indeed, on modern, compact network architectures such as Densenet and WRN-28-10, DropBack outperforms the current state-of-the- art pruning techniques in both accuracy and off-chip memory storage required for weights. On the CIFAR-10 dataset, we observe 5× reduction in weights on an already 9×-reduced VGG-16 network, which we call VGG-S, and 4.5× on Densenet and WRN-28-10 — all with zero or negligible accuracy loss — or 19×, 27×, and 36×, respectively, with a minor impact on accuracy. When the recomputed initial weights are decayed to zero, the weight memory footprint of WRN-28-10 can be reduced up to 72×.