Prashant Nair

Assistant Professor

Research Interests

Reliability, security, and performance-power efficient memory systems
System-level and architecture-level optimization to enable efficient and practical quantum computers

Relevant Thesis-Based Degree Programs

Affiliations to Research Centres, Institutes & Clusters

 
 

Graduate Student Supervision

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Lightweight mitigation against transient cache side-channel attacks (2023)

Today, nearly all modern devices, including smartphones, PCs, and cloud servers, benefit significantly from two architectural advancements that speed up computation. First, speculative execution allows the execution of instructions ahead of time; and second, cache memory store data close to the processor to reduce latency of memory operations. Unfortunately, these features also make compute systems vulnerable to security holes known as transient side-channel attacks. These attacks exploit predictive mechanisms and use mis-speculated instructions to modify the cache persistently as a side channel. Such attacks enables leakage of private data, such as cryptographic keys.Full-fledged mitigations against transient attacks have come at a significant performance cost. They either try to stop potential leakage at the source — where it is speculatively execution instructions — and penalize many “innocent” executions or laboriously restore past states in the memory hierarchy.This research focuses on mitigating the transient side-channel attacks while minimizing performance loss. Our approach combines significantly more efficient protection schemes inside the cache hardware, the destination, to stop potential leaks. We identify and leverage a specific memory access pattern to detect an ongoing cache side-channel attack and localize the protection to where it occurs.We propose FantôMiss, our mitigation strategy that allows cache state changes speculatively but also tracks the modified cache lines under speculation. Then, for subsequent accesses to those speculated cache lines, we generate fake cache misses that traverse the memory hierarchy that conceals speculative cache states, thereby closing the side-channel.Crucially, the increased latency is not imposed upon all load instructions but only on loads that read potential leak sources. As a result, FantôMiss significantly outperforms prior proposed mitigations, with execution-time overheads of just 0.5% and 1.6%(geometric-mean) across the PARSEC and SPEC benchmark suites.

View record

Structural coding : a low-cost scheme to protect CNNs from large-granularity memory errors (2023)

Convolutional Neural Networks (CNNs) are broadly used in safety-critical applications such as autonomous vehicles. While demonstrating high accuracy, CNN models are vulnerable to Dynamic Random Access Memory (DRAM) errors corrupting their parameters, thereby degrading their accuracy. Unfortunately, existing techniques for protecting CNNs from memory errors are either costly or not complete, meaning that they fail to protect from large-granularity, multi-bit DRAM errors.In this thesis, we propose a software-implemented coding scheme, Structural Coding, which is able to achieve three orders of magnitude reduction in Silent Data Corruption (SDC) rates of CNNs under large-granularity memory errors. Its error correction coverage is also significantly higher than other software-techniques to protect CNNs from faults in the memory. Additionally, its average performanceoverhead on a real machine is less than 3%. The memory footprint overhead of Structural Coding is
View record

Bo-tree: a dynamic merkle tree for enabling scalable memories (2022)

Securing off-chip main-memories is an integral component of trusted-execution environments like Intel SGX. Unfortunately, secure memories that offer replay protection face performance overheads due to the additional memory accesses from traversing multiple levels of the counter integrity tree. While recent works try to reduce these overheads by enabling shorter, higher-arity trees, these approaches do not scale with memory capacities and tree heights. Thus, as we to develop efficient techniques that will continue to maintain short tree-heights and sustain low performance overheads as memory sizes grow. In this thesis, we propose Bo-Tree, an efficient integrity tree design that achieves near-zero traversal overheads. Unlike prior works that are restricted to static tree structures, Bo-Tree dynamically detects “hot” data blocks and collects their counters in a logically separate, smaller tree that can efficiently take advantage of the on-chip metadata cache. A hardware mechanism dynamically adds and removes counters to this hot tree, making the optimization transparent to software. To track frequently accessed blocks with minimal overhead, Bo-Tree uses a probabilistic counting mechanism. We experiment on a 32GB DDR4 secure memory, Bo-Tree on average provides a speedup of 17.1% over prior work SC-64 over all SPEC-2006 and GAPworkloads, while incurring a memory capacity overhead of
View record

Accelerating input dispatching for deep learning recommendation models training (2021)

Deep-Learning and Time-Series based recommendation models require copiousamounts of compute for the deep learning part and large memory capacities fortheir embedding table portion. Training these models typically involves usingGPUs to accelerate the deep learning phase but restrict the memory-intensive embeddingtables to the CPUs. This causes data to be constantly transferred betweenthe CPU and GPUs, which limits the overall throughput of the training process.This thesis offers a heterogeneous acceleration pipeline, called Hotline, by leveragingthe insight that only a small number of embedding entries are accessedfrequently, and can easily fit in a single GPU’s local memory. Hotline aims topipeline the training mini-batches by efficiently utilizing (1) the main memory fornot-frequently accessed embeddings, (2) the GPUs’ local memory for frequentlyaccessed embeddings and their compute for the entire recommender model, whilststitching their execution through a novel hardware accelerator that gathers requiredworking parameters and dispatches training inputs.Hotline accelerator processes multiple input mini-batches to collect and dispatchthe ones that access the frequently-accessed embeddings directly to GPUs.For inputs that require infrequently accessed embeddings, Hotline hides the CPUGPUtransfer time by proactively obtaining them from the main memory. Thisenables the recommendation system training, for its entirety of mini-batches, to beperformed on low-capacity high-throughput GPUs. Results on real-world datasetsand recommender models shows that Hotline reduces the average training time by3.45 in comparison to a XDL baseline when using 4 GPUs. Moreover, Hotlineincreases the overall training throughput to 20.8 epochs/hr in comparison to 5.3epochs/hr for Criteo Terabyte dataset.

View record

Accelerating recommendation system training by leveraging popular choices (2021)

Recommendation systems have been deployed in e-commerce and online advertising to expose desired items from the user's perspective. To meet this end, various deep learning-based recommendation models have been employed such as the Deep learning recommendation model or DLRM at Facebook. The input of such a model can be categorized as dense and sparse representations. The former demonstrates the numerical representation of items and users with discrete parameters. On the other hand, the latter refers to continuous input such as time or age. Such models are comprised of two main components: computation-intensive components like multilayer perceptron or MLP and memory-intensive like embedding tables which save the numerical representation of sparse. Training these large-scale recommendation models is evolving to require increasing data and compute resources.The highly parallel neural networks portion of these models can benefit from GPU acceleration, however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this thesis deep dives into the semantics of training data and feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed. Only a few embedding entries are accessed up to 10000× more.In this thesis, we focus on improving the end-to-end training performance using this insight and offer a framework, called Frequently Accessed Embeddings or FAE. we propose a hot-embedding-aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reducing the data transfers from CPU to GPU. We choose DLRM~\cite{dlrm} and XDL~\cite{xdl} as the baseline. Both of these models have been commercialized and are well-established in the industry. DLRM has been deployed by Facebook as well as XDL by Alibaba. We choose XDL as of its high utilization of CPU and a notably scalable solution for training recommendation models.Experiments on production-scale recommendation models with datasets from real work show that FAE reduces the overall training time by 2.3× and 1.52× in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Follow these steps to apply to UBC Graduate School!