Leonid Sigal


Research Classification

Research Interests

Artificial Intelligence
Computer Science and Statistics
Parametric and Non-Parametric Inference
Computer Vision
Machine Learning
Semantic Recognition
Vision + Natural Language Processing
Visual Recognition and Understanding

Relevant Thesis-Based Degree Programs

Research Options

I am available and interested in collaborations (e.g. clusters, grants).
I am interested in and conduct interdisciplinary research.

Research Methodology

Deep Learning
Convolutional Neural Networks
Recurrent Neural Networks
Generative Models
Semi- and Weakly-supervised Learning
Transfer Learning
Structured Learning


Doctoral students
I support public scholarship, e.g. through the Public Scholars Initiative, and am available to supervise students and Postdocs interested in collaborating with external partners as part of their research.
I support experiential learning experiences, such as internships and work placements, for my graduate students and Postdocs.
I am open to hosting Visiting International Research Students (non-degree, up to 12 months).

Complete these steps before you reach out to a faculty member!

Check requirements
  • Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
  • Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.
Focus your search
  • Identify specific faculty members who are conducting research in your specific area of interest.
  • Establish that your research interests align with the faculty member’s research interests.
    • Read up on the faculty members in the program and the research being conducted in the department.
    • Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.
Make a good impression
  • Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
    • Do not send non-specific, mass emails to everyone in the department hoping for a match.
    • Address the faculty members by name. Your contact should be genuine rather than generic.
  • Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
  • Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
  • Demonstrate that you are familiar with their research:
    • Convey the specific ways you are a good fit for the program.
    • Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
  • Be enthusiastic, but don’t overdo it.
Attend an information session

G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.



These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.

Graduate Student Supervision

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Layered controllable video generation (2023)

Advances in deep generative models have led to impressive results in image and video synthesis.However, synthesis of realistic images/videos, without the ability to control the depicted content in them, has limited practical utility. We introduce layered controllable video generation, where we, without any supervision, decompose the initial frame of a video into foreground and background layers, with which the user can control the video generation process by simply manipulating the foreground mask. The key challenges are the unsupervised foreground-background separation, which is ambiguous, and ability to anticipate user manipulations with access to only raw video sequences. We address these challenges by proposing a two-stage learning procedure.In the first stage, with the rich set of losses and dynamic foreground size prior, we learn how to separate the frame into foreground and background layers and, conditioned on these layers, how to generate the next frame using VQ-VAE generator. In the second stage, we fine-tune this network to anticipate edits to the mask, by fitting (parameterized) control to the mask from future frame. We demonstrate the effectiveness of this learning and the more granular control mechanism, while illustrating state-of-the-art performance on two benchmark datasets.

View record

Self-supervision through Random Segments with Autoregressive Coding (RandSAC) (2023)

Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), and advances in recent visual architecture design with Vision Transformers (ViTs), in this thesis, we explore the effect various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC). In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT. We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning. We illustrate the pertinence of these design choices and explore alternatives on a number of datasets (e.g., CIFAR10, CIFAR100, ImageNet). While our pre-training strategy works with vanilla Transformer, we also propose a conceptually simple, but highly effective, addition to the decoder that allows learnable skip-connections to encoder’s feature layers, which further improves the performance.

View record

Reinforcement learning in the presence of sensing costs (2022)

In recent years, reinforcement learning (RL) has become an increasingly popular framework for formalizing decision-making problems. Despite its popularity, the use of RL has remained relatively limited in challenging real-world scenarios, due to various unrealistic assumptions made about the environment, such as assuming sufficiently accurate models to train on in simulation, or no significant delays between the execution of an action and receiving the next observation. Such assumptions unavoidably make RL algorithms suffer from poor generalization. In this work, we aim to take a closer look at how incorporating realistic constraints impact the behaviour of RL agents. In particular, we consider the cost in time and energy of making observations and taking a decision, which is an important aspect of natural environments that is typically overlooked in a traditional RL setup. As a first attempt, we propose to explicitly incorporate the cost of sensing the environment into the RL training loop, and analyze the emerging behaviours of the agent on a suite of simulated gridworld environments.

View record

Person in context synthesis with compositional structural space (2021)

Despite significant progress, controlled generation of complex images with interacting people remains difficult. Existing layout to image generation methods fall short of synthesizing realistic person instances, while pose-guided generation approaches focus on a single person and assume simple or known backgrounds. To tackle these limitations, we propose a new problem, Person in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts, with user control over both. The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparse annotations. To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space, which encodes shape, location and appearance information for both context and person structures in a disentangled manner. This structural space is then decoded to the image space using a multi-level feature modulation strategy, and learned in a self-supervised manner from image collections and their corresponding inputs. Extensive experiments on two large-scale datasets (COCO-Stuff and Visual Genome) demonstrate that our framework outperforms state-of-the-art methods with respect to synthesis quality.

View record

Consistent multiple sequence decoding (2020)

Sequence decoding is one of the core components of most visual-lingual models. However, typical neural decoders when faced with decoding multiple, possibly correlated, sequences of tokens resort to simple independent decoding schemes. In this work, we introduce a consistent multiple sequence decoding architecture, which, while relatively simple, is general and allows for consistent and simultaneous decoding of an arbitrary number of sequences. Our formulation utilizes a consistency fusion mechanism, implemented using message passing in a Graph Neural Network (GNN), to aggregate context from related decoders. This context is then utilized as a secondary input, in addition to previously generated output, to make a prediction at a given step of decoding. Self-attention, in the GNN, is used to modulate the fusion mechanism locally at each node and each step in the decoding process. We show the efficacy of our consistent multiple sequence decoder on the task of dense relational captioning and illustrate state-of-the-art performance (improvement of 5.2% in mAP) on the task. More importantly, we illustrate that the decoded sentences, for the same regions, are more consistent (improvement of 9.5% in consistency score), while across images and regions maintain diversity.

View record

Generative adversarial networks for pose-guided human video generation (2020)

Generation of realistic high-resolution videos of human subjects is a challenging and important task in computer vision. In this thesis, we focus on human motion transfer -- generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video. Our GAN-based architecture, DwNet, leverages a dense intermediate pose-guided representation and a refinement process to warp the required subject appearance, in the form of the texture, from a source image into a desired pose. Temporal consistency is maintained by further conditioning the decoding process within a GAN on the previously generated frame. In this way a video is generated in an iterative and recurrent fashion. We illustrate the efficacy of our approach by showing state-of-the-art quantitative and qualitative performance on two benchmark datasets: TaiChi and Fashion Modeling. The latter was collected by us and is made publicly available to the community. We also show how our proposed method can be further improved by using a recent segmentation-mask-based architecture, such as SPADE, and how to battle temporal inconsistency in video synthesis using a temporal discriminator. Supplementary material available at: http://hdl.handle.net/2429/77282.

View record

Graph-based food ingredient detection (2020)

In this work, we address the problem of food ingredient detection from meal images, which is an intermediate step for generating cooking instructions. Although image-based object detection is a familiar task in computer vision and has been studied extensively in the last decades, the existing models are not suitable for detecting food ingredients. Normally objects in an image are explicit, but ingredients in food photos are most often invisible (integrated) and hence need to be inferred in a much more contextual manner. To this end, we explore an end-to-end neural framework with the core property of learning the relationships between ingredient pairs. We incorporate a Transformer module followed by a Gated Graph Attention Network (GGAT) to determine the ingredient list for the input dish image.This framework encodes ingredients in a contextual yet order-less manner. Furthermore, we validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Recipe1M dataset.

View record

Team LSTM: player trajectory prediction in basketball games using graph-based LSTM networks (2020)

Autonomous systems deployed in human environments must have the ability to understand and anticipate the motion and behavior of dynamic targets. More specifically, predicting the future positions of agents and planning future actions based on these predictions is a key component of such systems. This is a challenging task because the motion behavior of each agent not only depends on its own goal intent, but also the presence and actions of surrounding agents, social relations between agents, social rules and conventions, and the environment characteristics such as topology and geometry.We are specially interested in the problem of human motion trajectory prediction in real-world, social environments where potential interactions affect the way people move. One such environment is a basketball game with dynamic and complex movements driven by various social interactions. In this work, we focus on player motion trajectory prediction in real basketball games. We view the problem of trajectory prediction as a sequence prediction task where our goal is to predict the future positions of players using their past positions. Following the success of recurrent neural network models for sequence prediction tasks, we investigate the ability of these models to predict motion trajectories of players. More specifically, we propose a graph-based pooling procedure that uses relation networks and incorporates it with long short-term memory networks. We study the effect of different graph structures on the accuracy of predictions.We evaluate the different variations of our model on three datasets; two publicly available pedestrian datasets of ETH and UCY, as well as a real-world basketball dataset. Our model outperforms vanilla LSTM and Social-LSTM baselines on both of these datasets.

View record

Visual grounding through iterative refinement (2020)

The problem of visual grounding has attracted much attention in recent years due to its pivotal role in more general visio-lingual high level reasoning tasks (e.g., image captioning, VQA). Despite the tremendous progress in this area, the performance of most approaches has been hindered by the precision of bounding box proposals obtained in the early stages of the recent pipelines. To address this limitation, we propose a general progressive query-guided bounding box refinement architecture (OptiBox) that regresses the output of a visual grounding system closer to the ground truth. We apply this architecture in the context of the GroundeR model and the One-Stage Grounding model. The results from the GroundeR model show that our model can provide an additional grounding accuracy gain for a two-stage grounding system. Further, our experiments show that the proposed model can significantly improve bounding box precision when the predicted box of a grounding system deviates from the ground truth.

View record

Enforcing structure in visual attention (2019)

Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be especially valuable in multi-modal learning tasks. However, all prior attention frameworks lack the ability to explicitly model structural dependencies among attention variables, making it difficult to predict consistent attention masks.In this work we develop a novel structured spatial attention mechanism which is end-to-end trainable and can be integrated with any feed-forward convolutional neural network. This proposed AttentionRNN layer explicitly enforces structure over the spatial attention variables by sequentially predicting attention values in the spatial mask in a bi-directional raster-scan and inverse raster-scan order. As a result, each attention value depends not only on local image or contextual information, but also on the previously predicted attention values. Our experiments show consistent quantitative and qualitative improvements on a variety of recognition tasks and datasets, including image categorization, question answering and image generation.

View record

Graph Neural Network for Situation Recognition (2019)

Understanding images beyond salient actions involves reasoning about scene con-text, objects, and the roles they play in the captured event. Situation recognition has recently been introduced as the task of jointly reasoning about the verbs (actions) and a set of semantic-role and entity (noun) pairs in the form of action frames. Labeling an image with an action frame requires an assignment of values (nouns) to the roles based on the observed image content. Among the inherent challenges are the rich conditional structured dependencies between the output role assignments and the overall semantic sparsity. In this work, we propose a novel mixture-kernel attention graph neural network (GNN) architecture designed to address these challenges. Our GNN enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs. It also alleviates semantic sparsity by representing graph kernels using a convex combination of learned basis. We illustrate the efficacy of our model and design choices by conducting experiments on imSitu benchmark dataset, with accuracy improvements of up to 10% over state-of-the-art

View record

Graph-based language grounding (2019)

In recent years, phrase (or more generally language) grounding has emerged as a fundamental task in computer vision. Phrase grounding is a generalization of more traditional computer vision tasks with the goal of localizing a natural language phrase spatially in a given image. Most recent work use state-of-the-art deep learning techniques to achieve good performance on this task. However, they do not capture complex dependencies among proposal regions and phrases that are crucial for the superior performance on the task. In this work we try to overcome this limitation through a model that makes no assumptions regarding the underlying dependencies in both of the modalities. We present an end-to-end framework for grounding of the phrases in images that uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation is used to make the grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Flickr30k Entities dataset and the ReferIt Game dataset.

View record

Incidence networks for Geometric Deep Learning (2019)

Sparse incidence tensors can represent a variety of structured data. For example, we may represent attributed graphs using their node-node, node-edge, or edge-edge incidence matrices. In higher dimensions, incidence tensors can represent simplicial complexes and polytopes. In this work, we formalize incidence tensors, analyze their structure, and present the family of equivariant networks that operate on them. We show that any incidence tensor decomposes into invariant subsets. This decomposition, in turn, leads to a decomposition of the corresponding equivariant layer that allows efficient and intuitive pooling-and-broadcasting implementation, for both dense and sparse tensors. We demonstrate the effectiveness of this family of networks by reporting state-of-the-art on graph learning tasks for many targets in the QM9 dataset.

View record


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Get key application advice, hear about the latest research opportunities and keep up with the latest news from UBC's graduate programs.