Given three or four synchronized videos taken at eye level and from different angles, we show that we can effectively detect and track people, even when the only available data comes from the binary output of a simple blob detector and the number of present individuals is a priori unknown.
We start from occupancy probability estimates in a top view and rely on a generative model to yield probability images to be compared with the actual input images. We then refine the estimates so that the probability images match the binary input images as well as possible. Finally, having performed this computation independently at each time step, we compute trajectories over tive by solving a convex constrained flow problem, which allows us accurately follow individuals across thousands of frames. Our algorithm yields metrically accurate trajectories for each one of them, in spite of very significant occlusions.
In short, we combine a mathematically well-founded generative model that works in each frame individually with a simple approach to global optimization. This yields excellent performance using very simple models that could be further improved.