Adaptive Hashing for Fast Similarity Search

With the staggering growth in image and video datasets, hashing methods that map the data into Hamming space have shown promise for providing fast similarity search and compact storage; however, many of these methods employ a batch-learning strategy in which the computational cost and memory requirements may become intractable with larger and larger datasets. To overcome these challenges, we propose an online learning algorithm based on stochastic gradient descent in which the hash functions are updated iteratively with streaming data. In experiments with three image retrieval benchmarks, our online algorithm attains retrieval accuracy that is comparable to competing state-of-the-art batch-learning solutions, while being orders of magnitude faster and adaptable to the variations of the data.

Human Motion and Gesture Analysis

Space-Time Tree Ensemble For Action Recognition

Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition. The hierarchical spatio-temporal trees provide a robust mid-level representation for actions. We show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve state-of-the-art performance on two challenging datasets: UCF Sports and HighFive. Moreover, trees learned on HighFive are used in recognizing two action classes in a different dataset, Hollywood3D, demonstrating the potential for cross-dataset generality of the trees our approach discovers.

Action Recognition and Localization by Hierarchical Space-Time Segments

We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained spacetime segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.

Computer-Human Interaction and Assistive Technology

This ongoing project focuses on video-based human-computer interaction systems for people who need assistive technology for rehabilitation or as a means to communicate. Our first system, the “Camera Mouse,” provides computer access by tracking the user’s movements with a video camera and translating them into the movements of the mouse pointer on the screen. The system has been commercialized and is in wide use in homes, hospitals, and schools the U.S.and the U.K. Other systems detect the user’s eye blinks or raised fingers and interpret the communication intent.


Image Segmentation

How to Collect Segmentations for Biomedical Images?

Advances in microscopy and storage technologies have led to large amounts of images of biological structures that, if analyzed, could provide an understanding of fundamental biological processes and, in turn, aid in diagnosing diseases and engineering biomaterials. Segmentation is the most time-consuming image analysis task for human annotators and so is our initial focus of impact. We freely share our resources we are developing to accelerate the research in the biomedical community to automatically analyze biomedical images. We collaborate with biologists and biomedical engineers to create image libraries representing various image acquisition modalities, biological structure types, magnification levels, and image acquisition parameters. We also define annotation standards to create reliable reference data for algorithm validation and to collect annotations from non-experts using crowdsourcing platforms. Finally, we develop approaches and systems to expedite or replace expert efforts to consistently and efficiently collect high quality boundaries of biological structures in their images.

Analysis of Segmentation Quality

(WACV 2013 Best Paper Award) Finding the outline of an object in an image is a fundamental step in many computer vision based applications. While researchers commonly analyze segmentation quality by measuring how similar an algorithm generated segmentation is to a gold standard segmentation, they face the challenge of choosing appropriate evaluation measures and gold standard segmentations. We propose a framework to obtain meaningful project specific segmentation performance indicators, share a freely available toolbox implementing this framework that links existing annotation collection tools with gold standard generation methods and evaluation algorithms, and describe case studies using this toolbox that show how to establish trusted gold standard segmentations for cell and artery images.

Machine Learning

Generalized Majorization-Minimization

Non-convex optimization is ubiquitous in machine learning. The Majorization-Minimization (MM) is a systematic procedure for optimizing non-convex functions through an iterative construction and optimization of upper bounds on the objective function. However, we show that the bounding conditions in MM can be overly restrictive. We therefore generalize MM into a new framework for designing optimization algorithms, named Generalized Majorization-Minimization (G-MM). Compared to MM, G-MM is much more flexible, and appears to be less sensitive to initialization. We derive G-MM algorithms for several latent variable models and show that they consistently outperform their MM counterparts.

Learning with Differential Geometric Regularization

We study the problem of supervised learning for both binary and multiclass classification from a unified geometric perspective. In particular, we propose a geometric regularization technique to find the submanifold corresponding to an estimator of the class probability P(y|x). The regularization term measures the volume of this submanifold, based on the intuition that overfitting produces rapid local oscillations and hence large volume of the estimator. This technique can be applied to regularize any classification function that satisfies two requirements: firstly, an estimator of the class probability can be obtained; secondly, first and second derivatives of the class probability estimator can be calculated.

Bayesian Online Classifier Ensemble

We propose a Bayesian approach for recursively estimating the classifier weights in online learning of a classifier ensemble. In contrast with past methods, such as stochastic gradient descent or online boosting, our approach estimates the weights by recursively updating its posterior distribution. For a specified class of loss functions, we show that it is possible to formulate a suitably defined likelihood function and hence use the posterior distribution as an approximation to the global empirical loss minimizer. If the stream of training data is sampled from a stationary process, we can also show that our approach admits a superior rate of convergence to the expected loss minimizer than is possible with standard stochastic gradient descent. In experiments with real-world datasets, our formulation often performs better than state-of-the-art stochastic gradient descent and online boosting algorithms.

Object Recognition and Pose Estimation

breslav16Discovering Useful Parts for Pose Estimation in Sparsely Annotated Datasets

Our work introduces a novel way to increase pose estimation accuracy by discovering parts from unannotated regions of training images. Discovered parts are used to generate more accurate appearance likelihoods for traditional part-based models like Pictorial Structures and its derivatives. Our experiments on images of a hawkmothin flight show that our proposed approach significantly improves over existing work for this application, while also being more generally applicable. Our proposed approach localizes landmarks at least twice as accurately as a baseline based on a Mixture of Pictorial Structures (MPS) model. Our unique High-Resolution Moth Flight (HRMF) dataset is made publicly available with annotations.

Parameterizing Object Detectors in the Continuous Pose Space

Object detection and pose estimation are interdependent problems in computer vision. Many past works decouple these problems, either by discretizing the continuous pose and training pose-specific object detectors, or by building pose estimators on top of detector outputs. We propose a structured kernel machine approach to treat object detection and pose estimation jointly in a mutually beneficial way. In our formulation, a unified, continuously parameterized, discriminative appearance model is learned over the entire pose space. We propose a cascaded discrete-continuous algorithm for efficient inference, and give effective online constraint generation strategies for learning our model using structural SVMs. Our method performs better than, or on par with, state-of-the-art methods in the combined task of object detection and pose estimation.

3D Pose Estimation of Bats in the Wild

We propose a model-based multi-view articulated 3D bat pose estimation framework for estimating and subsequently analyzing articulated 3D bat pose in the wild. Key challenges include the large search space associated with articulated 3D pose, the ambiguities that arise from 2D projections of 3D bodies, and the low resolution image data we have available. Our method uses multi-view camera geometry and temporal constraints to reduce the state space of possible articulated 3D bat poses and finds an optimal set using a Markov Random Field based model. Our experiments use real video data of flying bats and gold-standard annotations by a bat biologist. Our results show, for the first time in the literature, articulated 3D pose estimates being generated automatically for video sequences of bats flying in the wild.


Top-down Neural Attention by Excitation Backprop

We aim to model the top-down attention of a CNN classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative.

sodUnconstrained Salient Object Detection

We aim to do bounding box localization for salient objects in unconstrained images. We propose a system that can output a highly reduced set of detection windows based on a CNN proposal generation model and a novel proposal subset optimization formulation. Our system significantly outperforms existing methods in localizing dominant objects.

Minimum Barrier Salient Object Detection

We propose a highly efficient, yet powerful, salient object detection method based on a fast Minimum Barrier Distance Transform algorithm. Our salient object detection method (MB) achieves state-of-the-art performance and runs at about 80 FPS using a single thread. Furthermore, a technique based on color whitening is proposed to extend our method to leverage the appearance-based backgroundness cue. This extended version (MB+) further improves the performance, while still runs at about 50 FPS.

frontpage2rSalient Object Subitizing

People can immediately and precisely identify that an image contains 1, 2, 3 or 4 items by a simple glance. The phenomenon, known as Subitizing, inspires us to pursue the task of Salient Object Subitizing (SOS), i.e. predicting the existence and the number of salient objects in a scene using holistic cues. To study this problem, we propose a new image dataset annotated using an online crowdsourcing marketplace. We show that a proposed subitizing technique using an end-to-end Convolutional Neural Network (CNN) model achieves significantly better than chance performance in matching human labels on our dataset. Finally, we demonstrate the usefulness of the proposed subitizing technique in two computer vision applications: salient object detection and object proposal.

Saliency Detection: A Boolean Map Approach

A novel Boolean Map based Saliency (BMS) model is proposed. An image is characterized by a set of binary images, which are generated by randomly thresholding the image’s color channels. Based on a Gestalt principle of figure-ground segregation, BMS computes saliency maps by analyzing the topological structure of Boolean maps. BMS is simple to implement and efficient to run. Despite its simplicity, BMS consistently achieves state-of-the-art performance compared with ten leading methods on five eye tracking datasets. Furthermore, BMS is also shown to be advantageous in salient object detection.


frontpageMEEM: Robust Tracking via Multiple Experts using Entropy Minimization

We propose a multi-expert restoration scheme to address the model drift problem in online tracking. In the proposed scheme, a tracker and its historical snapshots constitute an expert ensemble, where the best expert is selected to restore the current tracker when needed based on a minimum entropy criterion, so as to correct undesirable model updates. In experiments, our tracking method achieves substantially better overall performance than 32 trackers on a benchmark dataset of 50 video sequences under various evaluation settings. In addition, in experiments with a newly collected dataset of challenging sequences, we show that the proposed multi-expert restoration scheme significantly improves the robustness of our base tracker, especially in scenarios with frequent occlusions and repetitive appearance variations.

Randomized Ensemble Tracking

We propose a randomized ensemble algorithm to model the time-varying appearance of an object for visual tracking. In contrast with previous online methods for updating classifier ensembles in tracking-by-detection, the weight vector that combines weak classifiers is treated as a random variable and the posterior distribution for the weight vector is estimated in a Bayesian manner. In essence, the weight vector is treated as a distribution that reflects the confidence among the weak classifiers used to construct and adapt the classifier ensemble. The resulting formulation models the time-varying discriminative ability among weak classifiers so that the ensembled strong classifier can adapt to the varying appearance, backgrounds, and occlusions. The formulation is tested in a tracking-by-detection implementation. Experiments on 28 challenging benchmark videos demonstrate that the proposed method can achieve results comparable to and often better than those of state-of-the-art approaches.

Online Motion Agreement Tracking

We propose a fast online multi-target tracking method, called motion agreement algorithm, which dynamically selects stable object regions to track. The appearance of each object, here pedestrians, is represented by multiple local patches. For each patch, the algorithm computes a local estimate of the direction of motion. By fusion of the agreements between a global estimate of the object motion and each local estimate, the algorithm identifies the object stable regions and enables robust tracking. The proposed patch-based appearance model was integrated into an efficient online tracking system that uses bipartite matching for data association. The experiments on recent pedestrian tracking benchmark sequences show that the proposed method achieves competitive results compared to state-of-the-art methods, including several offline tracking techniques.

Online Multi-Person Tracking by Tracker Hierarchy

Tracking-by-detection is a widely used paradigm for multi-person tracking but is affected by variations in crowd density, obstacles in the scene, varying illumination, human pose variation, scale changes, etc. We propose an improved tracking-by-detection framework for multi-person tracking where the appearance model is formulated as a template ensemble updated online given detections provided by a pedestrian detector. We employ a hierarchy of trackers to select the most effective tracking strategy and an algorithm to adapt the conditions for trackers’ initialization and termination. Our formulation is online and does not require calibration information. In experiments with four pedestrian tracking benchmark datasets, our formulation attains accuracy that is comparable to, or better than, the state-of-the-art pedestrian trackers that must exploit calibration information and operate offline.

Video-based Analysis of Animal Behavior

Infrared Thermal Video Analysis of Bats

We have used an infrared thermal cameras to record Brazilian free-tailed bats in California, Massachusetts,and Texas and developed automated image analysis methods that detect, track, and count emerging bats. Censusing natural populations of bats is important for understanding the ecological and economic impact of these animals on terrestrial ecosystems. Colonies of Brazilian free-tailed bats are of particular interest because they represent some of the largest aggregations of mammals known to mankind. It is challenging to census these bats accurately, since they emerge in large numbers at night.

Past Projects

Large Lexicon Gesture Representation, Recognition, and Retrieval

This project involves research on computer-based recognition of ASL signs. One goal is development of a “look-up” capability for use as part of an interface with a multi-media sign language dictionary. The proposed system will enable a signer either to select a video clip corresponding to an unknown sign, or to produce a sign in front of a camera, for look-up. The computer will then find the best match(es) from its inventory of thousands of ASL signs. Knowledge about linguistic constraints of sign production will be used to improve recognition. Fundamental theoretical challenges include the large scale of the learning task (thousands of different sign classes), the availability of very few training examples per class, and the need for efficient retrieval of gesture/motion patterns in a large database.

Human Pose Estimation

The goal of this effort is to develop algorithms for articulated structure and motion estimation, given one or more image sequences. Articulated motion is exhibited by jointed structures like the human body and hands, as well as linkages more generally. Articulated structure and motion estimation algorithms are being developed that can automatically initialize themselves, estimate multiple plausible interpretations along with their likelihood, and provide reliable performance over extended sequences. To achieve these objectives, concepts from statistical machine learning, graphical models, multiple view geometry, and structure from motion are employed.


Gesture Analysis and Recognition

The aim of this project is to develop techniques for automatic analysis and recognition of human gestural communication. The complexity of simultaneous expression of linguistic information on the hands, the face, and the upper body creates special challenges for computer-based recognition. Results of this effort include algorithms for: localizing and tracking human hands, estimating hand pose and upper body pose, tracking and classifying head motions, and analysis of eye and facial gestures. Algorithms are also being developed for efficiently spotting and recognizing specific gestures of interest in video streams.


Layered Graphical Models for Tracking

Partial occlusions are commonplace in a variety of real world computer vision applications: surveillance, intelligent environments, assistive robotics, autonomous navigation, etc. While occlusion handling methods have been proposed, most methods tend to break down when confronted with numerous occluders in a scene. In this project, we are developing layered image-plane representations for tracking through substantial occlusions. An image-plane representation of motion around an object is associated with a pre-computed graphical model, which can be instantiated efficiently during online tracking.


Detector Families for Detection, Parameter Estimation and Tracking

The main goal of this project is to develop algorithms for simultaneous detection, parameter estimation, and tracking of objects that exhibit high variability. The project focus is on three areas: (1) methods for dimensionality reduction that incorporate knowledge of object dynamics, (2) models that combine a collection of simpler local models to efficiently and accurately approximate nonlinear motion dynamics in a state-based model for tracking, (3) algorithms that can detect an instance of the object class in the image, and at the same time estimate the object’s parameters.


Segmentation of Anatomic Structures

This project aims at developing automatic or semi-automatic methods for localizing and outlining anatomic structures in 2D and 3D data. This includes x-rays, computed tomography (CT) scans and magnetic resonance images (MRI). Our work has focused on structures of the chest, in particular, lungs, ribs, trachea, pulmonary fissures, pulmonary nodules, and blood vessels. A pulmonary fissure is a boundary between the lobes in the lungs. Our fissure segmentation method is based on an iterative, curve-growing process that adaptively weights local image information and prior knowledge of the shape of the fissure.


Detection and Classification in Medical Images

We have developed methods for automatically detecting and measuring pulmonary nodule growth. These growth measurements are essential for lung cancer screening but are currently made by time-consuming, inaccurate and inconsistent manual methods. Facilitating the diagnosis of lung cancer is important, because early detection and resection of small, growing, pulmonary nodules can improve the 5-year survival rate of patients from 15% to 67%.


Registration of Anatomical Structures

In this ongoing project, we develop methods to align anatomical structures in medical image data sets. We have focused on registering structures in the chest, such as lung surfaces and pulmonary nodules. Our approaches use rigid- and deformable-body transformations.


Shape-based Segmentation, Description, and Retrieval

The goal of this project is to develop automated methods for detecting, describing, and indexing shapes that appear in image and video databases. Retrieval by shape is perhaps one of the most challenging aspects of content-based image database search, due to image clutter, segmentation errors, etc. In addition, many shape classes of interest are related through deformations and/or may have variable structure. Methods are being developed that can detect, segment, and describe shapes in images despite clutter, shape deformation, and variable object structure.


Region-based Deformable Appearance Models

The aim of this project is to develop methods for tracking deforming objects. A mesh model is used to model the object’s shape and deformations, and a color texture map is used to model the object’s color appearance. Photometric variations are also modeled. Nonrigid shape registration and motion tracking are achieved by posing the problem in terms of an energy-based, robust minimization procedure, which provides robustness to occlusions, wrinkles, shadows, and specular highlights. The algorithms run at frame-rate, and are tailored to take advantage of texture mapping hardware available in many workstations, PC’s, and game consoles. The Active Blobs framework is one result of this effort.


Motion-based Retrieval and Motion-based Data Mining

The aim of this project is to develop methods for indexing, retrieval, and data mining of motion trajectories in video databases. Computer vision techniques are being devised for detection and tracking of moving objects, as well as estimation of statistical time-series models that describe each object’s motion, that can be used in motion-based indexing and retrieval. Algorithms are being developed that can discover clusters and other patterns in the extracted motion time-series data, and to identify common versus unusual motion patterns.


Retrieval and Classification Methods

The goal of this project is to develop scalable classification methods that can exploit the information available in large databases of training data. Given an object to classify, one important problem is how to correctly identify the most similar objects in the database. An equally important problem is how to retrieve those objects efficiently, despite having to search a very large space. Results of this effort include algorithms for fast nearest neighbor retrieval under computationally expensive distance measures, optimizing the accuracy of nearest neighbor classifiers, and designing query-sensitive distance measures that automatically identify, in high-dimensional spaces, the dimensions that are the most informative for each query object.


Placement and Control of Cameras in Video Sensor Networks

The goal of this project is to develop methods for determing the optimal position and choice of video cameras to cover a given area and to serve specific vision task(s), and algorithms for prediction, camera control, and scheduling of computer vision tasks within a networked collection of video cameras. A predictive framework is being developed that can accrue a statistical model of temporal associations between events of interest observed within a sensor network. Finally, algorithms are being formulated that can exploit the statistical models in scheduling sensor network resources to accomplish certain tasks, like tracking objects of interest, or identifying all individuals.


Shape and Motion Estimation from Multiple Views

The goal of this project is to automatically construct detailed 3D models of objects given multiple views. In one family of approaches developed in this project, the aim has been to reconstruct a 3D polygonal mesh model and color texture map from multiple views of an object. Efforts have also focused on the problem of estimating an object’s 3D motion field (scene flow) from multiple video streams. These methods explicitly account for uncertainties of the measurements as they affect the accuracy of the recovered model.


Content-based Retrieval of Images on the World Wide Web

The goal of this project is to develop algorithms for searching web for images. Visual cues (extracted from the image) and textual cues (extracted from the HTML document containing the image) can be exploited. The technical challenges associated with the project are to deal with the staggering scale of the world wide web, to formulate effective image representations and indexing strategies for very fast search based on image content, and to develop user interface techniques that make image search fast, intuitive, and accurate. These algorithms have been deployed in the ImageRover system.