Doctor of Philosophy in Electrical and Computer Engineering (PhD)
Protecting Mobile Phone Users Against Malware
Caches are essential to today's microprocessors. They close the huge speed gap between processors and memories. However, cache design presents an important tradeoff. A bigger cache size should increase performance and allow processors to perform faster, but it is also limited by its silicon, area, and power consumption costs. Today's caches often use half of the silicon area in processor chips and consume a lot of power. Instead of physically increasing the cache size, effective cache capacity can be substantially increased if the data inside the cache is compressed.Current cache compression techniques focus only on one granularity, either compressing inside one cache line, or compressing similar cache lines together. In this work, we combine both compression techniques to leverage both inter-line and intra-line compression. We find that combining both techniques results in better compression than previously described methods, and also maintains the same performance as a normal uncompressed cache when running incompressible applications. We study and address the design considerations and tradeoffs that arise from such design. We address issues related to the design like cache structure and replacement policies. Then we present an implementation that achieves the best possible compression and performance while maintaining overheads as low as possible.
In recent years, neural networks have regained popularity in a variety of fields such as image recognition and speech transcription. As deep neural networks grow more popular for solving everyday tasks, deployment on small embedded devices — such as phones — is becoming increasingly popular. Moreover, many applications — such as face recognition or health applications — require personalization, which means that networks must be retrained after they have been deployed.Because today’s state-of-the-art networks are too large to fit on mobile devices and exceed mobile device power envelopes, techniques such as pruning and quantization have been developed to allow pre-trained networks to be shrunk by about an order of magnitude. However, they all assume that the network is first fully trained off-line on datacenter-class GPUs, then pruned in a post-processing step, and only then deployed to the mobile device.In this thesis, we introduce DropBack, a technique that significantly reduces the storage and computation required during both inference and training. In contrast to existing pruning schemes, which retain the weights with the largest values and set the rest to zero, DropBack identifies the weights that have changed the most, and recomputes the original initialization values for all other weights. This means that only the most important weights must be stored in off-chip memory both during inference and training, reducing off-chip memory accesses (responsible for a majority of the power usage) by up to 72×.Crucially, networks pruned using DropBack maintain high accuracy even for challenging network architectures: indeed, on modern, compact network architectures such as Densenet and WRN-28-10, DropBack outperforms the current state-of-the- art pruning techniques in both accuracy and off-chip memory storage required for weights. On the CIFAR-10 dataset, we observe 5× reduction in weights on an already 9×-reduced VGG-16 network, which we call VGG-S, and 4.5× on Densenet and WRN-28-10 — all with zero or negligible accuracy loss — or 19×, 27×, and 36×, respectively, with a minor impact on accuracy. When the recomputed initial weights are decayed to zero, the weight memory footprint of WRN-28-10 can be reduced up to 72×.