Multimode spatiotemporal background modeling for complex scenes

The goal of this research is to model the background in scenes that contain stochastic motions caused e.g. by wind over water surface, in tree branches, or over the grass.

The background model of each pixel is defined based on the observation of its spatial neighborhood in a recent history, and includes up to $K \geq 1$ modes, ranked in decreasing order of occurrence frequency. Foreground regions can then be detected by comparing the intensity of an observed pixel to the high frequency modes of its background model. Our spatial-temporal background model is superior to traditional related algorithms in cases for which a pixel encounters modes that are frequent in the spatial neighborhood without being frequent enough in the actual pixel position.

We have also experimented an original assessment methodology for evaluating background models. In contrast to conventional evaluation methods, it does not require the collection of groundtruth videos, generally based on manual labeling of foreground regions. Instead, it relies on videos that do not contain any foreground object. The collection of those videos is much more easy, especially in intrusion detection contexts, which most often face empty scene.

Related reference: (Sun et al. 2012)

Multiview people detection

Keeping track of people who occlude each other using a set of widely spaced, calibrated, stationary, and(loosely) synchronized cameras is an important question because this kind of setup is common to applications ranging from (sport) event reporting to surveillance in public space. Here, we focus on the people detection problem, which is often considered as a preliminary step to the tracking problem. Detection is only based on foreground masks, i.e. the visual appearance of the color texture is not considered to detect people.

Our proposed detection method assumes people verticality, and sums the cumulative projection
of the multiple views’ foreground masks on a set of planes that are parallel to the ground plane. After summation, large projection values indicate the position of the player on the ground plane. This position is used as an anchor for the player bounding box projected in each
one of the views, as depicted in the figure below.

Considering the people detection problem in a multi-camera environment mitigates the difficulty caused by reflection, occlusion and shadow, compared to the single view case. the method can be implemented eficiently, using integral image techniques.

Related reference: (Delannay et al. 2009)