Assistant Professor
Relevant Thesis-Based Degree Programs
If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.
Membership Status
Member of G+PS
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
Significant advancements have recently been made in image and video generative models. Among these, diffusion models have demonstrated a strong capability for generating high-quality images and videos, thus inviting significant study within the field. However, despite these exciting achievements, diffusion models for visual content generation still face numerous challenges. In this thesis, we focus on two key challenges facing diffusion models and propose potential solutions to address them.Firstly, the research on metrics for assessing generative models remains relatively underexplored, particularly in the domain of video generation. To bridge this research gap, we propose the Fréchet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key-point tracking and then measure the similarity between these features via the Fréchet distance. We conduct a sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics.Second, diffusion models face challenges in compositionality and interpretability. While humans understand images structurally, generative models typically generate all pixels simultaneously. Latent Diffusion Models, widely used in this domain, rely on continuous latent variables from Variational Autoencoders (VAEs), which lack interpretability and structure. To address this, we propose DiffuseDRAW, a novel framework incorporating structured latent variables with diffusion models. Our approach integrates non-parametric structured latent variables from NP-DRAW with discrete vector-quantized representations from VQ-GAN. Built upon VQ-GAN, our model transforms input images into combined discrete latent variables and applies a diffusion model in the discrete latent space. We model dependencies between structured and discrete latent variables using a Transformer backbone with cross-conditioning. Experiments on CIFAR-10 and LSUN datasets demonstrate that our model outperforms prior structured generative models and competes with state-of-the-art diffusion models. Moreover, its compositionality and interpretability offer significant advantages in zero-shot latent space editing.
View record
Ejection fraction (EF) serves as a critical indicator of cardiac function, traditionally assessed through expert clinicians' manual interpretation of echocardiograms. However, the labor-intensive nature of this process, along with inter-observer variability and data scarcity, highlights the need for automated and scalable solutions. This thesis explores the application of video diffusion models to generate synthetic echocardiograms as a means to augment limited datasets, thereby enhancing ejection fraction estimation models. By leveraging synthetic data generation, we aim to address data scarcity, enhance model performance, and validate the effectiveness of synthetic data in echocardiography (echo).The proposed methodology integrates diffusion models with echo video data to create realistic cardiac echocardiograms tailored to each patient. We also develop a data augmentation framework aiming at improving EF estimation. Extensive experiments are conducted to evaluate the contribution of the synthetic data to model performance on EF prediction accuracy, focusing on scenarios with limited labeled data. Our results demonstrate that incorporating diffusion-augmented training data leads to improvements in both the accuracy and robustness of automated EF estimation models.In addition, we investigate and present various strategies for the rapid generation of synthetic echocardiograms through model distillation. Our preliminary findings establish a foundation for future research in real-time echocardiogram synthesis, facilitating applications in clinical training and procedural guidance.Ultimately, this thesis provides a novel approach to controlled synthetic data-driven augmentation, contributing to the broader field of cardiac imaging by enabling more efficient and precise diagnostic tools. This work advances the potential for scalable, Artificial Intelligence (AI)-driven cardiac assessments, offering enhanced accessibility to high-quality care in high-resource and low-resource clinical environments.
View record
There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed analysis, we explain which questions benefit, and which don’t, from contextualized commonsense knowledge from COMET.
View record
In this thesis, we investigate the ability of neural networks, particularly Transformers, to reason and memorize. First, we focus on graph neural networks and Transformers, and analyze their performance on algorithmic reasoning tasks. We show that while models can achieve high accuracy on data from the same distribution as their training data, their performance drops significantly when faced with new, out-of-distribution data. We further show that even high performance on benchmark numbers may be misleading and true reasoning capability of these models remains limited. We identify several challenges involved in achieving true reasoning abilities and generalization to new data. We propose solutions to some of these challenges, including fixing input representation issues, hybrid models, and enlarging the training dataset. We also examine the expressivity of Transformers, providing a theoretical analysis of their ability to memorize data points. The results show a linear relationship between a Transformer's memory capacity and both the number of its attention heads as well as the input's context size.
View record
If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.