Prashant Nair

Assistant Professor

Research Interests

Reliability, security, and performance-power efficient memory systems
System-level and architecture-level optimization to enable efficient and practical quantum computers

Relevant Degree Programs

Affiliations to Research Centres, Institutes & Clusters

 
 

Graduate Student Supervision

Master's Student Supervision (2010 - 2021)
Accelerating input dispatching for deep learning recommendation models training (2021)

Deep-Learning and Time-Series based recommendation models require copiousamounts of compute for the deep learning part and large memory capacities fortheir embedding table portion. Training these models typically involves usingGPUs to accelerate the deep learning phase but restrict the memory-intensive embeddingtables to the CPUs. This causes data to be constantly transferred betweenthe CPU and GPUs, which limits the overall throughput of the training process.This thesis offers a heterogeneous acceleration pipeline, called Hotline, by leveragingthe insight that only a small number of embedding entries are accessedfrequently, and can easily fit in a single GPU’s local memory. Hotline aims topipeline the training mini-batches by efficiently utilizing (1) the main memory fornot-frequently accessed embeddings, (2) the GPUs’ local memory for frequentlyaccessed embeddings and their compute for the entire recommender model, whilststitching their execution through a novel hardware accelerator that gathers requiredworking parameters and dispatches training inputs.Hotline accelerator processes multiple input mini-batches to collect and dispatchthe ones that access the frequently-accessed embeddings directly to GPUs.For inputs that require infrequently accessed embeddings, Hotline hides the CPUGPUtransfer time by proactively obtaining them from the main memory. Thisenables the recommendation system training, for its entirety of mini-batches, to beperformed on low-capacity high-throughput GPUs. Results on real-world datasetsand recommender models shows that Hotline reduces the average training time by3.45 in comparison to a XDL baseline when using 4 GPUs. Moreover, Hotlineincreases the overall training throughput to 20.8 epochs/hr in comparison to 5.3epochs/hr for Criteo Terabyte dataset.

View record

Accelerating recommendation system training by leveraging popular choices (2021)

Recommendation systems have been deployed in e-commerce and online advertising to expose desired items from the user's perspective. To meet this end, various deep learning-based recommendation models have been employed such as the Deep learning recommendation model or DLRM at Facebook. The input of such a model can be categorized as dense and sparse representations. The former demonstrates the numerical representation of items and users with discrete parameters. On the other hand, the latter refers to continuous input such as time or age. Such models are comprised of two main components: computation-intensive components like multilayer perceptron or MLP and memory-intensive like embedding tables which save the numerical representation of sparse. Training these large-scale recommendation models is evolving to require increasing data and compute resources.The highly parallel neural networks portion of these models can benefit from GPU acceleration, however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this thesis deep dives into the semantics of training data and feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed. Only a few embedding entries are accessed up to 10000× more.In this thesis, we focus on improving the end-to-end training performance using this insight and offer a framework, called Frequently Accessed Embeddings or FAE. we propose a hot-embedding-aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reducing the data transfers from CPU to GPU. We choose DLRM~\cite{dlrm} and XDL~\cite{xdl} as the baseline. Both of these models have been commercialized and are well-established in the industry. DLRM has been deployed by Facebook as well as XDL by Alibaba. We choose XDL as of its high utilization of CPU and a notably scalable solution for training recommendation models.Experiments on production-scale recommendation models with datasets from real work show that FAE reduces the overall training time by 2.3× and 1.52× in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Discover the amazing research that is being conducted at UBC!