SeeSaw

An interactive system for ad-hoc searches on image databases

SeeSaw targets the vexing scenario where we want to search for an ad-hoc concept in an image database. The goal is to enable the user to search for objects in the database and find some examples, even if these objects are rare and there is no object detector model available for this object. In fact, SeeSaw is motivated by the need to help users develop a detector model or extend one for ad-hoc classes, and this is the reason we are searching the database in the first place. For this scenario, the first goal is to find a few examples in order to build a training and test sets and SeeSaw aims to help users construct these.

To enable ad-hoc searches, SeeSaw makes use of visual-semantic embedding models such as CLIP. The embedding model allows us to extract meaningful features from images which we can index and look up quickly, and also allows us to align these representations with text strings on a semantic level, so that searching for the string “wheelchair” is likely to help you find images with some content related to wheelchairs.

CLIP as a stand-alone solution for search on your own data falls short in multiple ways. One of them is highly variable accuracy for different queries. A search for wheelchairs on the BDD dataset takes more than 100 images to find a handful of examples. As with wheelchairs, there is a long tail of queries hard to anticipate ahead of time for which a user may not be able to quickly find results. One important insight for SeeSaw is that a small amount of user input at query time can go a long way in helping the user find useful results in practice, because errors are not random. Instead, different error types show up repeatedly. Exploiting this insight helps users find positive results faster if they provide input to the system.

User input to SeeSaw is in two forms: text strings describing the type of object of interest in natural language, as well as region-box feedback identifying useful results if any are shown. How this input is integrated into the decision of which images to show next is important, as user input can also worsen search results when compared to a non-interactive baseline. To incorporate feedback constructively, SeeSaw internally must balance the importance given to user feedback with the weight given to CLIP predictions.

A working paper with quantitative evaluations for an early version of SeeSaw is available at https://arxiv.org/abs/2208.06497