Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Complete these steps before you reach out to a faculty member!
- Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
- Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.
- Identify specific faculty members who are conducting research in your specific area of interest.
- Establish that your research interests align with the faculty member’s research interests.
- Read up on the faculty members in the program and the research being conducted in the department.
- Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.
- Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
- Do not send non-specific, mass emails to everyone in the department hoping for a match.
- Address the faculty members by name. Your contact should be genuine rather than generic.
- Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
- Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
- Demonstrate that you are familiar with their research:
- Convey the specific ways you are a good fit for the program.
- Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
- Be enthusiastic, but don’t overdo it.
G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.
Graduate Student Supervision
Master's Student Supervision (2010 - 2021)
Despite significant progress, controlled generation of complex images with interacting people remains difficult. Existing layout to image generation methods fall short of synthesizing realistic person instances, while pose-guided generation approaches focus on a single person and assume simple or known backgrounds. To tackle these limitations, we propose a new problem, Person in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts, with user control over both. The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparse annotations. To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space, which encodes shape, location and appearance information for both context and person structures in a disentangled manner. This structural space is then decoded to the image space using a multi-level feature modulation strategy, and learned in a self-supervised manner from image collections and their corresponding inputs. Extensive experiments on two large-scale datasets (COCO-Stuff and Visual Genome) demonstrate that our framework outperforms state-of-the-art methods with respect to synthesis quality.
Sequence decoding is one of the core components of most visual-lingual models. However, typical neural decoders when faced with decoding multiple, possibly correlated, sequences of tokens resort to simple independent decoding schemes. In this work, we introduce a consistent multiple sequence decoding architecture, which, while relatively simple, is general and allows for consistent and simultaneous decoding of an arbitrary number of sequences. Our formulation utilizes a consistency fusion mechanism, implemented using message passing in a Graph Neural Network (GNN), to aggregate context from related decoders. This context is then utilized as a secondary input, in addition to previously generated output, to make a prediction at a given step of decoding. Self-attention, in the GNN, is used to modulate the fusion mechanism locally at each node and each step in the decoding process. We show the efficacy of our consistent multiple sequence decoder on the task of dense relational captioning and illustrate state-of-the-art performance (improvement of 5.2% in mAP) on the task. More importantly, we illustrate that the decoded sentences, for the same regions, are more consistent (improvement of 9.5% in consistency score), while across images and regions maintain diversity.
Generation of realistic high-resolution videos of human subjects is a challenging and important task in computer vision. In this thesis, we focus on human motion transfer -- generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video. Our GAN-based architecture, DwNet, leverages a dense intermediate pose-guided representation and a refinement process to warp the required subject appearance, in the form of the texture, from a source image into a desired pose. Temporal consistency is maintained by further conditioning the decoding process within a GAN on the previously generated frame. In this way a video is generated in an iterative and recurrent fashion. We illustrate the efficacy of our approach by showing state-of-the-art quantitative and qualitative performance on two benchmark datasets: TaiChi and Fashion Modeling. The latter was collected by us and is made publicly available to the community. We also show how our proposed method can be further improved by using a recent segmentation-mask-based architecture, such as SPADE, and how to battle temporal inconsistency in video synthesis using a temporal discriminator. Supplementary material available at: http://hdl.handle.net/2429/77282.
In this work, we address the problem of food ingredient detection from meal images, which is an intermediate step for generating cooking instructions. Although image-based object detection is a familiar task in computer vision and has been studied extensively in the last decades, the existing models are not suitable for detecting food ingredients. Normally objects in an image are explicit, but ingredients in food photos are most often invisible (integrated) and hence need to be inferred in a much more contextual manner. To this end, we explore an end-to-end neural framework with the core property of learning the relationships between ingredient pairs. We incorporate a Transformer module followed by a Gated Graph Attention Network (GGAT) to determine the ingredient list for the input dish image.This framework encodes ingredients in a contextual yet order-less manner. Furthermore, we validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Recipe1M dataset.
Autonomous systems deployed in human environments must have the ability to understand and anticipate the motion and behavior of dynamic targets. More specifically, predicting the future positions of agents and planning future actions based on these predictions is a key component of such systems. This is a challenging task because the motion behavior of each agent not only depends on its own goal intent, but also the presence and actions of surrounding agents, social relations between agents, social rules and conventions, and the environment characteristics such as topology and geometry.We are specially interested in the problem of human motion trajectory prediction in real-world, social environments where potential interactions affect the way people move. One such environment is a basketball game with dynamic and complex movements driven by various social interactions. In this work, we focus on player motion trajectory prediction in real basketball games. We view the problem of trajectory prediction as a sequence prediction task where our goal is to predict the future positions of players using their past positions. Following the success of recurrent neural network models for sequence prediction tasks, we investigate the ability of these models to predict motion trajectories of players. More specifically, we propose a graph-based pooling procedure that uses relation networks and incorporates it with long short-term memory networks. We study the effect of different graph structures on the accuracy of predictions.We evaluate the different variations of our model on three datasets; two publicly available pedestrian datasets of ETH and UCY, as well as a real-world basketball dataset. Our model outperforms vanilla LSTM and Social-LSTM baselines on both of these datasets.
The problem of visual grounding has attracted much attention in recent years due to its pivotal role in more general visio-lingual high level reasoning tasks (e.g., image captioning, VQA). Despite the tremendous progress in this area, the performance of most approaches has been hindered by the precision of bounding box proposals obtained in the early stages of the recent pipelines. To address this limitation, we propose a general progressive query-guided bounding box refinement architecture (OptiBox) that regresses the output of a visual grounding system closer to the ground truth. We apply this architecture in the context of the GroundeR model and the One-Stage Grounding model. The results from the GroundeR model show that our model can provide an additional grounding accuracy gain for a two-stage grounding system. Further, our experiments show that the proposed model can significantly improve bounding box precision when the predicted box of a grounding system deviates from the ground truth.
Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be especially valuable in multi-modal learning tasks. However, all prior attention frameworks lack the ability to explicitly model structural dependencies among attention variables, making it difficult to predict consistent attention masks.In this work we develop a novel structured spatial attention mechanism which is end-to-end trainable and can be integrated with any feed-forward convolutional neural network. This proposed AttentionRNN layer explicitly enforces structure over the spatial attention variables by sequentially predicting attention values in the spatial mask in a bi-directional raster-scan and inverse raster-scan order. As a result, each attention value depends not only on local image or contextual information, but also on the previously predicted attention values. Our experiments show consistent quantitative and qualitative improvements on a variety of recognition tasks and datasets, including image categorization, question answering and image generation.
Understanding images beyond salient actions involves reasoning about scene con-text, objects, and the roles they play in the captured event. Situation recognition has recently been introduced as the task of jointly reasoning about the verbs (actions) and a set of semantic-role and entity (noun) pairs in the form of action frames. Labeling an image with an action frame requires an assignment of values (nouns) to the roles based on the observed image content. Among the inherent challenges are the rich conditional structured dependencies between the output role assignments and the overall semantic sparsity. In this work, we propose a novel mixture-kernel attention graph neural network (GNN) architecture designed to address these challenges. Our GNN enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs. It also alleviates semantic sparsity by representing graph kernels using a convex combination of learned basis. We illustrate the efficacy of our model and design choices by conducting experiments on imSitu benchmark dataset, with accuracy improvements of up to 10% over state-of-the-art
In recent years, phrase (or more generally language) grounding has emerged as a fundamental task in computer vision. Phrase grounding is a generalization of more traditional computer vision tasks with the goal of localizing a natural language phrase spatially in a given image. Most recent work use state-of-the-art deep learning techniques to achieve good performance on this task. However, they do not capture complex dependencies among proposal regions and phrases that are crucial for the superior performance on the task. In this work we try to overcome this limitation through a model that makes no assumptions regarding the underlying dependencies in both of the modalities. We present an end-to-end framework for grounding of the phrases in images that uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation is used to make the grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Flickr30k Entities dataset and the ReferIt Game dataset.
Sparse incidence tensors can represent a variety of structured data. For example, we may represent attributed graphs using their node-node, node-edge, or edge-edge incidence matrices. In higher dimensions, incidence tensors can represent simplicial complexes and polytopes. In this work, we formalize incidence tensors, analyze their structure, and present the family of equivariant networks that operate on them. We show that any incidence tensor decomposes into invariant subsets. This decomposition, in turn, leads to a decomposition of the corresponding equivariant layer that allows efficient and intuitive pooling-and-broadcasting implementation, for both dense and sparse tensors. We demonstrate the effectiveness of this family of networks by reporting state-of-the-art on graph learning tasks for many targets in the QM9 dataset.