ExSample: searching video databases through adaptive sampling of video frames

ExSample's goal is to process object search queries when we have a black box detector model, and a warehouse of video data without indexing. In video datasets, randomly sampling frames to inspect with an object detector can be far more effective a search strategy than any scanning based strategy, as scanning is expensive and contiguous frames redundant for most uses. ExSample exploits the empirical insight that results are usually not uniformly spread over time or over different individual files in video datasets. For example, wheelchairs are more likely to be present in cities rather than highways and more likely to appear during the day than at night in the same way as pedestrians are. Therefore, one way to help find results while saving on processing is to sample frames non-uniformly across the dataset according to this skew. This basic skewed sampling idea however can lead to over sampling, as redundancy in objects is a problem in video: if we sample frames from the same area too much, the positive results are likely to be views of the same objects some time apart.

Because ExSample does not know ahead of time where the skew is, the first problem is to learn this skew as it runs. ExSample approaches this first challenge by splitting the dataset into logical chunks, for example 20-minute intervals if a video is long, or individual video clips if they are shorter than that, and sampling from them, much as a multi-armed bandit model where each arm corresponds to a chunk of the dataset and the reward for an arm draw is finding a positive result. If the distribution of results in the different chunks is skewed, then there is potential for savings by allocating samples to the chunk with the largest number of results.

A naive bandit model approach in this setting would estimate the probability of finding a ``positive'' frame for each chunk, and over time allocate samples to the chunk with the largest hit-ratio, balancing between exploration and exploitation using techniques such as Thompson Sampling. This approach is too greedy, as the hit ratio is unable to distinguish two different wheelchairs appearing a few seconds apart from two frames showing the same wheelchair a few seconds apart. Users always have the option to scan nearby frames if that would be useful to them, hence exploiting this type of low hanging fruit can be left to the user and is not the metric of interest.

ExSample's contribution is to estimate the chances of finding a new result, a quantity that changes over time as frames are sampled from a given chunk, and use this estimate, as well as an estimate of its error to allocate samples.

The full paper (ICDE '22) is available at https://arxiv.org/abs/2005.09141