Prashant Nair

Assistant Professor

Research Interests

Reliability, security, and performance-power efficient memory systems
System-level and architecture-level optimization to enable efficient and practical quantum computers

Relevant Thesis-Based Degree Programs

Affiliations to Research Centres, Institutes & Clusters


Graduate Student Supervision

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Bo-tree: a dynamic merkle tree for enabling scalable memories (2022)

Securing off-chip main-memories is an integral component of trusted-execution environments like Intel SGX. Unfortunately, secure memories that offer replay protection face performance overheads due to the additional memory accesses from traversing multiple levels of the counter integrity tree. While recent works try to reduce these overheads by enabling shorter, higher-arity trees, these approaches do not scale with memory capacities and tree heights. Thus, as we to develop efficient techniques that will continue to maintain short tree-heights and sustain low performance overheads as memory sizes grow. In this thesis, we propose Bo-Tree, an efficient integrity tree design that achieves near-zero traversal overheads. Unlike prior works that are restricted to static tree structures, Bo-Tree dynamically detects “hot” data blocks and collects their counters in a logically separate, smaller tree that can efficiently take advantage of the on-chip metadata cache. A hardware mechanism dynamically adds and removes counters to this hot tree, making the optimization transparent to software. To track frequently accessed blocks with minimal overhead, Bo-Tree uses a probabilistic counting mechanism. We experiment on a 32GB DDR4 secure memory, Bo-Tree on average provides a speedup of 17.1% over prior work SC-64 over all SPEC-2006 and GAPworkloads, while incurring a memory capacity overhead of
View record

Accelerating input dispatching for deep learning recommendation models training (2021)

Deep-Learning and Time-Series based recommendation models require copiousamounts of compute for the deep learning part and large memory capacities fortheir embedding table portion. Training these models typically involves usingGPUs to accelerate the deep learning phase but restrict the memory-intensive embeddingtables to the CPUs. This causes data to be constantly transferred betweenthe CPU and GPUs, which limits the overall throughput of the training process.This thesis offers a heterogeneous acceleration pipeline, called Hotline, by leveragingthe insight that only a small number of embedding entries are accessedfrequently, and can easily fit in a single GPU’s local memory. Hotline aims topipeline the training mini-batches by efficiently utilizing (1) the main memory fornot-frequently accessed embeddings, (2) the GPUs’ local memory for frequentlyaccessed embeddings and their compute for the entire recommender model, whilststitching their execution through a novel hardware accelerator that gathers requiredworking parameters and dispatches training inputs.Hotline accelerator processes multiple input mini-batches to collect and dispatchthe ones that access the frequently-accessed embeddings directly to GPUs.For inputs that require infrequently accessed embeddings, Hotline hides the CPUGPUtransfer time by proactively obtaining them from the main memory. Thisenables the recommendation system training, for its entirety of mini-batches, to beperformed on low-capacity high-throughput GPUs. Results on real-world datasetsand recommender models shows that Hotline reduces the average training time by3.45 in comparison to a XDL baseline when using 4 GPUs. Moreover, Hotlineincreases the overall training throughput to 20.8 epochs/hr in comparison to 5.3epochs/hr for Criteo Terabyte dataset.

View record

Accelerating recommendation system training by leveraging popular choices (2021)

Recommendation systems have been deployed in e-commerce and online advertising to expose desired items from the user's perspective. To meet this end, various deep learning-based recommendation models have been employed such as the Deep learning recommendation model or DLRM at Facebook. The input of such a model can be categorized as dense and sparse representations. The former demonstrates the numerical representation of items and users with discrete parameters. On the other hand, the latter refers to continuous input such as time or age. Such models are comprised of two main components: computation-intensive components like multilayer perceptron or MLP and memory-intensive like embedding tables which save the numerical representation of sparse. Training these large-scale recommendation models is evolving to require increasing data and compute resources.The highly parallel neural networks portion of these models can benefit from GPU acceleration, however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this thesis deep dives into the semantics of training data and feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed. Only a few embedding entries are accessed up to 10000× more.In this thesis, we focus on improving the end-to-end training performance using this insight and offer a framework, called Frequently Accessed Embeddings or FAE. we propose a hot-embedding-aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reducing the data transfers from CPU to GPU. We choose DLRM~\cite{dlrm} and XDL~\cite{xdl} as the baseline. Both of these models have been commercialized and are well-established in the industry. DLRM has been deployed by Facebook as well as XDL by Alibaba. We choose XDL as of its high utilization of CPU and a notably scalable solution for training recommendation models.Experiments on production-scale recommendation models with datasets from real work show that FAE reduces the overall training time by 2.3× and 1.52× in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.

View record


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Discover the amazing research that is being conducted at UBC!