James Little


Relevant Degree Programs

Affiliations to Research Centres, Institutes & Clusters


Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - April 2022)
Pragmatic investigations of applied deep learning in computer vision applications (2021)

Deep neural networks have dominated performance benchmarks on numerous machine learning tasks. These models now power the core technology of a growing list of products such as Google Search, Google Translate, Apple Siri, and even Snapchat, to mention a few. We first address two challenges in the real-world applications of deep neural networks in computer vision: data scarcity and prediction reliability. We present a new approach to data collection through synthetic data via video games that is cost-effective and can produce high-quality labelled training data on a large scale. We validate the effectiveness of synthetic data on multiple problems through cross-dataset evaluation and simple adaptive techniques. We also examine the reliability of neural network predictions in computer vision problems and show that these models are fragile on out-of-distribution test data. Motivated by statistical learning theory, we argue that it is necessary to detect out-of-distribution samples before relying on the predictions. To facilitate the development of reliable out-of-distribution sample detectors, we present a less biased evaluation framework. Using our framework, we thoroughly evaluate over ten methods from data mining, deep learning, and Bayesian methods. We show that on real-world problems, none of the evaluated methods can reliably certify a prediction. Finally, we explore the applications of deep neural networks on high-resolution portrait production pipelines. We introduce AutoPortrait, a pipeline that performs professional-grade colour-correction, portrait cropping, and portrait retouching in under two seconds. We release the first large scale professional retouching dataset.

View record

Algorithms for large-scale multi-codebook quantization (2019)

Combinatorial vector compression is the task of expressing a set of vectors as accurately as possible in terms of discrete entries in multiple bases. The problem is of interest in the context of large-scale similarity search, as it provides a memory-efficient, yet ready-to-use compact representation of high-dimensional data on which vector similarities such as Euclidean distancesand dot products can be efficiently approximated.Combinatorial compression poses a series of challenging optimization problems that are often a barrier to its deployment on very large scale systems (e.g., of over a billion entries). In this thesis we explore algorithms and optimization techniques that make combinatorial compression more accurate and efficient in practice, and thus provide a practical alternative to current methods for large-scale similarity search.

View record

Computational single-image high dynamic range imaging (2018)

This thesis proposes solutions for increasing the dynamic range (DR)—the number of intensity levels—of a single image captured by a camera with a standard dynamic range (SDR). The DR in a natural scene is usually too high for SDR cameras to capture, even with optimum exposure settings. The intensity values of bright objects (highlights) that are above the maximum exposure capacity get clipped due to sensor over-exposure, while objects that are too dark (shades) appear dark and noisy in the image. Capturing a high number of intensity levels would solve this problem, but this is costly, as it requires the use of a camera with a high dynamic range (HDR). Reconstructing an HDR image from a single SDR image is difficult, if not impossible, to achieve for all imaging situations. For some situations, however, it is possible to restore the scene details, using computational imaging techniques. We investigate three such cases, which also occur commonly in imaging. These cases pose relaxed and well-posed versions of the general single-image high dynamic range imaging (HDRI) problem. The first case occurs when the scene has highlights that occupy a small number of pixels in the image; for example, night scenes. We propose the use of a cross-screen filter, installed at the lens aperture, to spread a small part of the light from the highlights across the rest of the image. In post-processing, we detect the spread-out brightness and use this information to reconstruct the clipped highlights. Second, we investigate the cases when highlights occupy a large part of the scene. The first method is not applicable here. Instead, we propose to apply a spatial filter at the sensor that locally varies the DR of the sensor. In post-processing, we reconstruct an HDR image. The third case occurs when the clipped parts of the image are not white but have a color. In such cases, we restore the missing image details in the clipped color channels by analyzing the scene information available in other color channels in the captured image. For each method, we obtain a maximum-a-posteriori estimate of the unknown HDR image by analyzing and inverting the forward imaging process.

View record

Towards automatic broadcast of team sports (2018)

Sports is the social glue of society as it allows people to interact with each other and appreciate games irrespective of their social status, age and ethnicity. Automatic sports broadcasting produces stream videos from vision sensors without human intervention. The goal is to predict where cameras should look and which camera should be on air. The technique can benefit millions of people as most viewers participant in sports by watching TV or Internet broadcasting. The target team sports include basketball, soccer and ice hockey in which team members quickly move their positions in the game, excluding sports like baseball and cricket in which team members have relatively stable positions.Automatic sports broadcasting covers areas of statistics, commentary, camera control and so on. We provide solutions for automatically setting camera parameters such as camera orientation angles and locations using computer vision. We restrict our attention to static pan-tilt-zoom (PTZ) cameras for television or live Internet broadcasting. We propose three essential components of autonomous broadcasting: camera calibration, planning and selection. By learning from human demonstrations, our work can predict camera angles for single camera systems and camera viewpoints for multi-camera systems. We obtain human demonstrations from existing videos that are generated by professional camera operators. These videos contain camera angles and camera IDs if there are multiple cameras.Because camera angles are not directly available, we first propose two novel camera calibration methods. We evaluate and compare our methods with previous algorithms. Our methods are more accurate and faster than previous algorithms. With labeled data from human operators, we develop two methods for smooth camera planning which predict camera pan angles. The first method directly incorporates temporal consistency into a data-driven predictor. The second method optimizes the camera trajectory in overlapped temporal windows. We show they outperform previous methods in the literature. We also propose two methods for selecting a broadcast camera view from multiple candidate camera views. The first method uses deep features for camera selection. The second method augments the training data with Internet videos. We demonstrate comparable results with selections from human operators in soccer games.

View record

Towards large-scale nonparametric scene parsing of images and video (2017)

In computer vision, scene parsing is the problem of labelling every pixel in an image or video with its semantic category. Its goal is a complete and consistent semantic interpretation of the structure of the real world scene. Scene parsing forms a core component in many emerging technologies such as self-driving vehicles and prosthetic vision, and also informs complementary computer vision tasks such as depth estimation.This thesis presents a novel nonparametric scene parsing framework for images and video. In contrast to conventional practice, our scene parsing framework is built on nonparametric search-based label transfer instead of discriminative classification. We formulate exemplar-based scene parsing for both 2D (from images) and 3D (from video), and demonstrate accurate labelling on standard benchmarks. Since our framework is nonparametric, it is easily extensible to new categories and examples as the database grows.Nonparametric scene parsing is computationally demanding at test time, and requires methods for searching large collections of data that are time and memory efficient. This thesis also presents two novel binary encoding algorithms for large-scale approximate nearest neighbor search: the bank of random rotations is data independent and does not require training, while the supervised sparse projections algorithm targets efficient search of high-dimensional labelled data. We evaluate these algorithms on standard retrieval benchmarks, and then demonstrate their integration into our nonparametric scene parsing framework. Using 256-bit codes, binary encoding reduces search times by an order of magnitude and memory requirements by three orders of magnitude, while maintaining a mean per-class accuracy within 1% on the 3D scene parsing task.

View record

Using Unlabeled 3D Motion Examples for Human Activity Understanding (2016)

We demonstrate how a large collection of unlabeled motion examples can help us in understanding human activities in a video. Recognizing human activity in monocular videos is a central problem in computer vision with wide-ranging applications in robotics, sports analysis, and healthcare. Obtaining annotated data to learn from videos in a supervised manner is tedious, time-consuming, and not scalable to a large number of human actions. To address these issues, we propose an unsupervised, data-driven approach that only relies on 3d motion examples in the form of human motion capture sequences.The first part of the thesis deals with adding view-invariance to the standard action recognition task, i.e., identifying the class of activity given a short video sequence. We learn a view-invariant representation of human motion from 3d examples by generating synthetic features. We demonstrate the effectiveness of our method on a standard dataset with results competitive to the state of the art. Next, we focus on the problem of 3d pose estimation in realistic videos. We present a non-parametric approach that does not rely on a motion model built for a specific action. Thus, our method can deal with video sequences featuring multiple actions. We test our 3d pose estimation pipeline on a challenging professional basketball sequence.

View record

Where Did It Go? Regaining a Lost Target for Robot Visual Servoing (2016)

When a robotic visual servoing/tracking system loses sight of its target, the servo fails due to loss of input. To resolve this problem, a search method is required to generate efficient actions and bring the target back into the camera field of view (FoV) as soon as possible. For high dimensional platforms like a camera-mounted manipulator, an eye-in-hand system, such a search must address the difficult challenge of generating efficient actions in an online manner while avoiding visibility and kinematic constraints.This work considers two common scenarios of visual servoing/tracking failure, when the target leaves the camera FoV and when visual occlusions, occlusions for brevity, disrupt the process. To handle the first scenario, a novel algorithm called lost target search (LTS) is introduced to plan online efficient sensor actions. To handle the second scenario, an improved algorithm called lost target recovery algorithm (LTRA) allows a robot to look behind an occluder during active visual search and re-acquire its target in an online manner. Then the overall algorithm is implemented on a telepresence platform to evaluate the necessity and efficacy of autonomous occlusion handling for remote users. Occlusions can occur when users in remote locations are engaged in physical collaborative tasks. This can yield to frustration and inefficient collaboration between the collaborators. Therefore, two human-subjects experiments are conducted (N=20 and 36 respectively) to investigate the following interlinked research questions: a) what are the impacts of occlusion on telepresence collaborations, and b) can an autonomous handling of occlusions improve telepresence collaboration experience for remote users? Results from the first experiment demonstrate that occlusions introduce a significant social interference that necessitates collaborators to reorient or reposition themselves. Subsequently, results from the second experiment indicate that the use of an autonomous controller yields a remote user experience that is more comparable (in terms of their vocal non-verbal behaviors, task performance and perceived workload) to collaborations performed by two co-located parties. These contributions represent a step forward in making robots more autonomous and user friendly while interacting with human co-workers. This is a necessary next step for successful adoption of robots in human environments.

View record

Improving object detection using 3D spatial relationships (2013)

Reliable object detection is one of the most significant hurdles that must be overcome to develop useful household robots. Overall, the goal of this work is to demonstrate how effective 3D qualitative spatial relationships can be for improving object detection. We show that robots can utilize 3D qualitative spatial relationships to improve object detection by differentiating between true and false positive detections.The main body of the thesis focuses on an approach for improving object detection rates that identifies the most likely subset of 3D detections using seven types of 3D relationships and adjusts detection confidence scores to improve the average precision. These seven 3D qualitative spatial relationships are adapted from 2D qualitative spatial reasoning techniques. We learn a model for identifying the most likely subset using a structured support vector machine [Tsochantaridis et al., 2004] from examples of 3D layouts of objects in offices and kitchens. We produce 3D detections from 2D detections using a fiducial marker and images of a scene and show our model is successful at significantly improving overall detection rates on real world scenes of both offices and kitchens.After the real world results, we test our method on synthetic detections where the properties of the 3D detections are controlled. Our approach improves on the model it was based upon, that of [Desai et al., 2009], by utilizing a branch and bound tree search to improve both training and inference. Our model relies on sufficient true positive detections in the training data or good localization of the true positive detections. Finally, we analyze the cumulative benefits of the spatial relationships and determine that the most effective spatial relationships depend on both the scene type and localization accuracy. We demonstrate that there is no one relationship that is sufficient on its own or always outperforms others and that a mixture of relationships is always useful.

View record

Visual object recognition for mobile platforms (2013)

A robot must recognize objects in its environment in order to complete numerous tasks. Significant progress has been made in modeling visual appearance for image recognition, but the performance of current state-of-the-art approachesstill falls short of that required by applications. This thesis describes visual recognition methods that leverage the spatial information sources available on-board mobile robots, such as the position of the platform in the world andthe range data from its sensors, in order to significantly improve performance. Our research includes: a physical robotic platform that is capable of state-of-the-art recognition performance; a re-usable data set that facilitates study of the robotic recognition problem by thescientific community; and a three dimensional object model that demonstrates improved robustness to clutter. Based on our 3D model, we describe algorithms that integrate information across viewpoints, relate objects to auxiliary 3D sensor information, plan paths to next-best-views, explicitly model object occlusions and reason about the sub-parts of objects in 3D. Our approaches have been proven experimentally on-board the Curious George robot platform, which placed first in an international object recognition challenge for mobile robots for several years. We have also collected a large set of visual experiences from a robot, annotated the true objects in this data and made it public to the research community for use in performance evaluation. A path planning system derived from our model has been shown to hasten confidentrecognition by allowing informative viewpoints to be observed quickly. In each case studied, our system demonstrates significant improvements in recognition rate, in particular on realistic cluttered scenes, which promises moresuccessful task execution for robotic platforms in the future.

View record

Active Exploration of Training Data for Improved Object Detection (2012)

This thesis concerns the problem of object detection, which is defined as finding all instances of an object class of interest and fitting each of them with a tight bounding window. This seemingly easy task for humans is still extremely difficult for machines. However, recent advances in object detection have enabled machines to categorize many classes of objects. Statistical models are often used for representing an object class of interest. These models learn from extensive training sets and generalize with low error rates to unseen data in a highly generic manner. But, these statistical methods have a major drawback in that they require a large amount of training data. We approach this problem by making the process of acquiring labels less tedious and less costly by reducing human labelling effort. Throughout this thesis, we explore means of efficient label acquisition for realizing cheaper training, faster development time, and higher-performance of object detectors.We use active learning with our novel interface to combine machine intelligence with human interventions, and effectively improve a state-of-the-art classifier by using additional unlabelled images from the Web. As the approach relies on a small amount of label input from a human oracle, there is still room to further reduce the amount of human effort. An ideal solution is, if possible, to have no humans involved in labelling novel data. Given a sparsely labelled video that contains very few labels, our novel self-learning approach achieves automatic acquisition of additional labels from the unlabelled portion of the video. Our approach combines colour segmentation, object detection and tracking in order to discover potential labels from novel data. We empirically show that our self-learning approach improves the performance of models that detect players in broadcast footage of sports games.

View record

Learning to track and identify players from broadcast sports videos (2012)

Tracking and identifying players in sports videos filmed with a single pan-tilt-zoom camera has many applications, but it is also a challenging problem. This thesis introduces the first intelligent system that tackles this difficult task. The system possesses the ability to detect and track multiple players, estimates the homography between video frames and the court, and identifies the players. The tracking system is based on the tracking-by-detection philosophy. We first localize players using a player detector, categorize detections based on team colors, and then group them into tracks of specific players. Instead of using visual cues to distinguish between players, we instead rely on their short-term motion patterns. The homography estimation is solved by using a variant of the Iterated Closest Points (ICP). Unlike most existing algorithms that rely on matching robust feature points, we propose to match edge points in two images. In addition, we also introduce a technique to update the model online to accommodate logos and patterns in different stadiums. The identification system utilizes both visual and spatial cues, and exploits both temporal and mutual exclusion constraints in a Conditional Random Field. In addition, we propose a novel Linear Programming Relaxation algorithm for predicting the best player identification in a video clip. In order to reduce the number of labeled training data required to learn the identification system, we pioneer the use of weakly supervised learning with the assistance of play-by-play texts. Experiments show promising results in tracking, homography estimation, and identification. Moreover, weakly supervised learning with play-by-play texts greatly reduces the number of labeled training data required. Experiments show that we can use weakly supervised learning with merely 200 labels to achieve similar accuracies to a strongly supervised approach, which requires at least 20000 labels.

View record

Navigation and obstacle avoidance help (NOAH) for elderly wheelchair users with cognitive impairment in long term care (2012)

Cognitive impairments prevent older adults from using powered wheelchairs because of safety concerns, thus reducing mobility and resulting in increased dependence on caregivers. An intelligent powered wheelchair system (NOAH) is proposed to help restore mobility, while ensuring safety. Machine vision and learning techniques are described to help prevent collisions with obstacles, and provide reminders and navigation assistance through adaptive audio prompts. The intelligent wheelchair is initially tested in various controlled environments and simulated scenarios. Finally, the system is tested with older adults with mild-to-moderate cognitive impairment through a single-subject research design. Results demonstrate the high diversity of the target population, and highlight the need for customizable assistive technologies that account for the varying capabilities and requirements of the intended users. We show that the collision avoidance module is able to improve safety for all users by lowering the number of frontal collisions. In addition, the wayfinding module assists users in navigating along shorter routes to the destination. Prompting accuracy is found to be quite high during the study. While compliance with correct prompts is high across all users, we notice a distinct difference in the rates of compliance with incorrect prompts. Results show that users who are unsure about the optimal route rely more highly on system prompts for assistance, and thus are able to improve their wayfinding performance by following correct prompts. Improvements in wheelchair position estimation accuracy and joystick usability will help improve user performance and satisfaction. Further user studies will help refine user needs and hopefully allow us to increase mobility and independence of several elderly residents.

View record

Master's Student Supervision (2010 - 2021)
Hierarchical part-based disentanglement of pose and appearance (2021)

Landmarks and keypoints are an important intermediate representation for image understanding and reconstruction. Although, many supervised approaches exist, these require labels of the target domain, which exist for humans, but only for sparse keypoints and not for the breadth of object and animal classes present in our rich world. We propose a self-supervised approach for discovering landmarks from unstructured image collections by disentangling pose and appearance of object parts. In particular, we propose a hierarchical structure that helps to find more meaningful keypoint locations. We demonstrate that our simplifications and hierarchical extensions of prior work are effective, in terms of quantitative 2D keypoint estimation and qualitative image modification operations when applied to persons. Our approach eases the discovery of objects and their parts in domains for which no labeled data exist and thereby eases downstream tasks, such as keypoint estimation, behavior classification for neuroscience applications, and intuitive image editing.

View record

Spatio-temporal relational reasoning for video question answering (2020)

Video question answering is the task of automatically answering questions about videos. Apart from direct practical interest, it provides a good way to benchmark our progress on various tasks in video understanding. A successful algorithm must ground objects of interest and model relationships among them in both the spatial and temporal domains jointly. We show that the existing state-of-the-art approaches, which are based on Convolutional Neural Networks or Recurrent Neural Networks, are not effective at joint reasoning in both spatial and temporal domains. Moreover, they are short-sighted and struggle with long-range dependencies in videos. To address these challenges, we present a novel spatio-temporal reasoning neural module that models complex multi-entity relationships in space and long-term dependencies in time. Our model captures both time-changing object interactions and action dynamics of individual objects in an effective way. We evaluate our module on two benchmark datasets which require spatio-temporal reasoning: TGIF-QA and SVQA. We achieve state-of-the-art performance on both datasets. More significantly, we achieve substantial improvements in some of the most challenging question types, like counting, which demonstrate the effectiveness of our proposed spatio-temporal relational module.

View record

Team LSTM: player trajectory prediction in basketball games using graph-based LSTM networks (2020)

Autonomous systems deployed in human environments must have the ability to understand and anticipate the motion and behavior of dynamic targets. More specifically, predicting the future positions of agents and planning future actions based on these predictions is a key component of such systems. This is a challenging task because the motion behavior of each agent not only depends on its own goal intent, but also the presence and actions of surrounding agents, social relations between agents, social rules and conventions, and the environment characteristics such as topology and geometry.We are specially interested in the problem of human motion trajectory prediction in real-world, social environments where potential interactions affect the way people move. One such environment is a basketball game with dynamic and complex movements driven by various social interactions. In this work, we focus on player motion trajectory prediction in real basketball games. We view the problem of trajectory prediction as a sequence prediction task where our goal is to predict the future positions of players using their past positions. Following the success of recurrent neural network models for sequence prediction tasks, we investigate the ability of these models to predict motion trajectories of players. More specifically, we propose a graph-based pooling procedure that uses relation networks and incorporates it with long short-term memory networks. We study the effect of different graph structures on the accuracy of predictions.We evaluate the different variations of our model on three datasets; two publicly available pedestrian datasets of ETH and UCY, as well as a real-world basketball dataset. Our model outperforms vanilla LSTM and Social-LSTM baselines on both of these datasets.

View record

Visual grounding through iterative refinement (2020)

The problem of visual grounding has attracted much attention in recent years due to its pivotal role in more general visio-lingual high level reasoning tasks (e.g., image captioning, VQA). Despite the tremendous progress in this area, the performance of most approaches has been hindered by the precision of bounding box proposals obtained in the early stages of the recent pipelines. To address this limitation, we propose a general progressive query-guided bounding box refinement architecture (OptiBox) that regresses the output of a visual grounding system closer to the ground truth. We apply this architecture in the context of the GroundeR model and the One-Stage Grounding model. The results from the GroundeR model show that our model can provide an additional grounding accuracy gain for a two-stage grounding system. Further, our experiments show that the proposed model can significantly improve bounding box precision when the predicted box of a grounding system deviates from the ground truth.

View record

Group event recognition in ice hockey (2019)

With the success of deep learning in computer vision community, most approaches for group activity recognition in sports started relying on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). However, how to model the interactions among players and the interactions between players and the scene remains a challenging problem. In order to better model these interactions, we propose two models. Our first model combines features of all players in a scene through an attention mechanism. The aggregated feature is then concatenated with the feature of the frame and passed through an RNN to generate the final prediction. In our second model, we designed a spatial grid feature and a temporal grid feature calculated from appearance features and motion features of all players in a scene, as well as their locations. We then apply CNNs to the spatial grid feature, the temporal grid feature, target frame of the scene (the frame at which the event happens), and the stack of optical flow containing the target frame separately. Results from the four streams are fused through score fusion to make the final prediction. Inputs to our models are: the target frame image, a stack of optical flow images, bounding boxes of players and coordinates of players calculated from homography matrix of the frame. We evaluated the two models on an Ice Hockey dataset, and results show that both models produced promising results. We also provide a possible solution for event detection in a more general setting.

View record

Understanding the sources of error for 3D human pose estimation from monocular images and videos (2018)

With the success of deep learning in the field of computer vision, most state-of-the-art approaches of estimating 3D human pose from images or videos rely on training a network end-to-end which can regress into 3D joint locations or heatmaps from an RGB image. Although most of these approaches provide good results, the major sources of error are often difficult to understand. The errors may either come from incorrect 2D pose estimation or from the incorrect mapping of the features in 2D to 3D. In this work, we aim to understand the sources of error in estimating 3D pose from images and videos. Therefore, we have built three different systems. The first takes 2D joint locations of every frame individually as inputs and predicts 3D joint positions. To our surprise, we found that by using a simple feed-forward fully connected network, with residual connections, the ground truth 2D joint locations can be mapped to 3D space at a remarkably low error rate, outperforming the best reported result by almost 30% on the Human 3.6M dataset, the largest publicly available dataset of motion capture data. Furthermore, training this network on the outputs of an off-the-shelf 2D pose detector gives us state-of-the-art results when compared with a vast array of systems trained end-to-end. To validate the efficacy of this network, we also trained an end-to-end system that takes an image as input and regresses 3D pose directly. We found that it is harder to train the network end-to-end than decoupling the task. To examine whether temporal information over a sequence improves results, we built a sequence-to-sequence network that takes a sequence of 2D poses as input and predicts a sequence of 3D poses as output. We found that the temporal information improves the results from our first system. We argue that a large portion of error of 3D pose estimation systems results from the error in 2D pose estimation.

View record

Classification of puck possession events in ice hockey (2017)

Group activity recognition in sports is often challenging due to the complex dynamics and interaction among the players. In this thesis, we propose a deep architecture to classify puck possession events in ice hockey. Our model consists of three distinct phases: feature extraction, feature aggregation and, learning and inference. For the feature extraction and aggregation, we use a Convolutional Neural Network (CNN) followed by a late fusion model on top to extract and aggregate different types of features that includes handcrafted homography features for encodingthe camera information. The output from the CNN is then passed into a Recurrent Neural Network (RNN) for the temporal extension and classification of the events. The proposed model captures the context information from the frame features as well as the homography features. The individual attributes of the players and the interaction among them is also incorporated using a pre-trained model and team pooling. Our model requires only the player positions on the image and the homography matrix and does not need any explicit annotations for the individual actions or player trajectories, greatly simplifying the input of the system. We evaluate our model on a new Ice Hockey Dataset and a Volleyball Dataset. Experimental results show that our model produces promising results on both these challenging datasets with much simpler inputs compared with the previous work.

View record

Multiview Depth-based Pose Estimation (2016)

Commonly used human motion capture systems require intrusive attachment of markers that are visually tracked with multiple cameras.In this work we present an efficient and inexpensive solution to markerless motion capture using only a few Kinect sensors.We use our system to design a smart home platform with a network of Kinects that are installed inside the house.Our first contribution is a multiview pose estimation system.Unlike the previous work on 3d pose estimation using a single depth camera, we relax constraints on the camera location and do not assume a co-operative user.We apply recent image segmentation techniques with convolutional neural networks to depth images and use curriculum learning to train our system on purely synthetic data.Our method accurately localizes body parts without requiring an explicit shape model.The body joint locations are then recovered by combining evidence from multiple views in real-time.Our second contribution is a dataset of 6 million synthetic depth frames for pose estimation from multiple cameras with varying levels of complexity to make curriculum learning possible.We show the efficacy and applicability of our data generation process through various evaluations.Our final system exceeds the state-of-the-art results on multiview pose estimation on the Berkeley MHAD dataset.Our third contribution is a scalable software platform to coordinate Kinect devices in real-time over a network.We use various compression techniques and develop software services that allow communication with multiple Kinects through TCP/IP.The flexibility of our system allows real-time orchestration of up to 10 Kinect devices over Ethernet.

View record

Compositional Compression of Deep Image Features Using Stacked Quantizers (2015)

In computer vision, it is common for image representations to be stored as high-dimensional real-valued vectors. In many computer vision applications, such as retrieval, classification, registration and reconstruction, the computational bottleneck arises in a process known as feature matching, where, given a query vector, a similarity score has to be computed to many vectors in a (potentially very large) database. For example, it is not uncommon for object retrieval and classification to be performed by matching global representations in collections with thousands or millions of images. A popular approach to reduce the computational and memory requirements of this process is vector quantization. In this work, we first analyze several vector compression methods typically used in the computer vision literature in terms of their computational trade-offs. In particular, we observe that Product Quantization (PQ) and Additive Quantization (AQ) lie on the extremes of a compositional vector compression design choice, where the former assumes complete codebook independence and the latter assumes full codebook dependence. We explore an intermediate approach that exploits a hierarchical structure in the codebooks. This results in a method that is largely competitive with AQ in structured vectors, and outperforms AQ in unstructured vectors while being several orders of magnitude faster. We also perform an extensive evaluation of our method on standard benchmarks of Scale Invariant Feature Transform (SIFT), and GIST descriptors, as well as on new datasets of features obtained from state-of-the-art convolutional neural networks. In benchmarks of low-dimensional deep features, our approach obtains the best known-to-date results, often requiring less than half the memory of PQ to achieve the same performance.

View record

Towards human pose estimation in video sequences (2014)

Recent advancements in human pose estimation from single images have attracted wide scientific interest of the Computer Vision community to the problem domain. However, the problem of pose estimation from monocular video sequences is largely under-represented in the literature despite the wide range of its applications, such as action recognition and human-computer interaction. In this thesis we present two novel algorithms for video pose estimation that demonstrate how one could improve the performance of a state-of-the-art single-image articulated human detection algorithm on realistic video sequences. Furthermore, we release the UCF Sports Pose dataset, containing full-body pose annotations of people performing various actions in realistic videos, together with a novel pose evaluation metric that better reflects the performance of current state of the art. We also release the Video Pose Annotation tool, a highly customizable application that we used to construct the dataset. Finally, we introduce a task-based abstraction for human pose estimation, which selects the "best" algorithm for every specific instance based on a task description defined using an application programming interface covering the large volume of the human pose estimation domain.

View record

Automatic Basketball Tracking in Broadcast Video (2012)

We proposed and implemented an automatic basketball detection and tracking system for broadcast basketball video recorded with a single pan-tilt-zoom camera, using knowledge of player tracking information. The task is challenging because the basketball is blurred due to the camera and the ball's fast movements, and broadcast video compression; also the motion pattern of the basketball is complicated and the ball is hard to distinguish from the cluttered background region. We incorporated three independent detection approaches to detect the basketball and tracked the basketball using the Kalman Filter, and then we analyzed the tracklets and selected the passing / shooting tracklets and inferred the player possession information. We tested the system using 830 frames in broadcast basketball video, and our system demonstrated the ability to track some passing / shooting actions and then infer the player who controls the ball. The system is a first attempt to extend the intelligent basketball tracking system to include basketball tracking and player possession inference. Our proposed methodologies can be extended to other intelligent sports analysis systems, even when the ball movement in the sport is not constrained in two dimensional space.

View record

Automatic initialization for broadcast sports videos rectification (2012)

Broadcast sport videos can be captured by a static or a moving camera. Unfortunately, the problem with a moving camera is that planar projective transformations (i.e., the homographies) have to be computed for each image frame in a video sequence in order to compensate for camera motions and viewpoint changes. Recently, a variety of methods have been proposed to estimate the homography between two images based on various correspondences (e.g., points, lines, ellipses matchings, and their combinations). Since the frame to frame homography estimation is an iterative process, it needs an initial estimate. Moreover, the initial estimate has to be accurate enough to guarantee that the method is going to converge to an optimal estimate. Although the initialization can be done manually for a couple of frames, manual initialization is not feasible where we are dealing with thousands of images within an entire sports game. Thus, automatic initialization is an important part of the automatic homography estimation process. In this dissertation we aim to address the problem of automatic initialization for homography estimation. More precisely, this thesis comprises four key modules, namely preprocessing, keyframe selection, keyframe matching, and frame-to-frame homography estimation, that work together in order to automatically initialize any homography estimation method that can be used for broadcast sports videos. The first part removes blurry images and roughly estimates the game-field area within remaining salient images and represents them as a set of binary masks. Then, those resulting binary masks are fed into the keyframe selection module in order to select a set of representative frames by using a robust dimensionality reduction method together with a clustering algorithm. The third module finds the closest keyframe to each input frame by taking advantage of three classifiers together with an artificial neural network to combine their results and improve the overall accuracy of the matching process. The last module takes the input frames, their corresponding closest keyframes, and computes the model-to-frame homography for all input frames. Finally, we evaluate the accuracy and robustness of our proposed method on one hockey and two basketball datasets.

View record

Object Persistence in 3D for Home Robotics (2012)

This document presents an interactive pipeline for collecting, labelling and re-recognizing movable objects in home-like environments. Inspired by the fact that a human child learns about a movable object by observing it moving from time to time, and memorizes its name once a name is associated with it, we have designed an object persistence system which performs similar tasks based on change detection. We utilize 3D registration and change detection systems in order to distinguish foreground from the background. Focusing on the dynamic objects (from different vantage points) and interactively asking the user for labels, endows our system with a database of labeled object segments, which further is used for multi-view instance recognition. We have expanded the temporal interval logic to 3D bounding boxes in order to aggregate regions that contain foreground dynamic objects, and simultaneously update our model of the background. The object segments are extracted by removing the background. Finally the objects are matched to the existing database of objects and if no match is present the user will be prompted to provide a label. To demonstrate the capabilities of our system, an inexpensive RGB-D sensor (Kinect) is used to collect 3D point clouds. Results show that for tabletop scenes, our approach is able to effectively separate object regions from background, and that objects can be successfully modelled and recognized during system operation.

View record

Automatic detection and tracking in underwater environments with marine snow (2011)

This project addresses the issue of automatic detection and tracking of man-made objects in subsea environments with poor visibility and marine snow. Underwater research and engineering is a quickly growing field and there are few computer vision techniques that specifically address these challenges.The proposed system involves minimizing noise and video artifacts, estimating camera motion, detecting line segments and tracking targets. Overall, the system performs well under the conditions in the test video and the equal error rate is approximately 16%. Tests show how parameters may be tuned to account for changes in environmental conditions and to trade off the number of false negatives and false positives. System performance is affected by many factors. Poorest performance occurs under conditions of heavy marine show, low-contrast targets, and fast camera motion. Performance also suffers if the background conditions in the image change.This research makes two contributions. First, we provide a survey of techniques that address similar problems and evaluate their suitability for this application, Second, we integrate existing techniques into a larger system. Techniques include median filtering, Canny edge detection, Hough transforms, Lucas-Kanade first-order optical flow and particle filtering. Where gaps exist between system components, new methods are developed. Testing evaluates the effects of system parameters and the conditions under which the system is effective.

View record

Using the Structure and Motion of Stereo Point Clouds for the Semantic Segmentation of Images (2010)

The segmentation of images into semantically coherent regions has been approached in many different ways in the over 40 years since the problem was first addressed. Recently systems using the motion of point clouds derived from laser depth scanners and structure from motion have been described, but these are monetarily and computationally expensive options. We explore the use of stereo cameras to achieve the same results. This approach is shown to work in an indoor environment, giving results that compare favorably with existing systems. The use of stereo instead of structure from motion is shown to be preferable in this environment, while the choice of stereo algorithm proves highly critical to the quality of the results. The use of aggregated voting regions is explored, which is shown to moderately improve the results while speeding up the process considerably. Experiments are also run biasing the randomized input to the classifier generation process, which show further improvements in both performance and execution time. Overall, the approach is shown to be feasible, but not currently practical for robotic navigation in this environment.

View record


Membership Status

Member of G+PS
View explanation of statuses

Program Affiliations


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Get key application advice, hear about the latest research opportunities and keep up with the latest news from UBC's graduate programs.