Index: The Case for Learned Indexes
The idea of replacing core components of a data management system through learned models has far-reaching implications for future systems designs.
Index structures are already models, because they "predict'' the location of a value given a key.
To see this, consider a B-Tree index in an in-memory database over the sorted primary key column (as shown at left). In this case, the B-Tree provides a mapping from a lookup key into a position inside the sorted array of records, with the guarantee that the key of the record at the position is equal or higher than the lookup key. For efficiency reasons, it is common not to index every single key, rather only the key of every nth record, i.e., the first key of a page. This helps to significantly reduce the number of keys the index has to store, without any significant performance penalty. Thus, the B-Tree is a model, in ML terminology a regression tree: it maps a key to a position with a min- and max-error (a min-error of 0 and a max-error of the page-size) and the guarantee that the key can be found in that region if it exists. Consequently, we can replace the index with any other type of model, including deep learning models, as long as they are also able to provide similar strong guarantees about the min- and max-error.
In a recent paper, ``"The Case for Learned Index Structures,'' we took a first step in exploring this idea for various index structures ranging from B-Trees to Bloom filters, with very promising results. For example, our learned B-Trees achieved faster look-up times than highly read-optimized B-Trees, such as FAST, while being significantly smaller in size. Similarly, our learned Bloom filters were 30% smaller on real-world data sets than traditional Bloom filters with the same false-positive rate.
In the future, we plan to significantly expand this idea. We will (1) explore other types of index structures (e.g., range filters, tries), (2) expand the work to multi-dimensional indexing structures, and (3) explore how index/storage co-optimization can help overall read and update performance.
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18).