Cardinality Estimation

Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Cardinality estimates used in current optimizers are flawed. They generally rely on “attributes independent assumption” and “Join-key uniformly distributed assumption” to minimize latency and overhead. These simplifying assumptions can lead to estimates with several magnitudes of error. Taking these estimates, the optimizer may generate catastrophic plans. Therefore, a surge of research has been proposed to improve cardinality estimation.

We propose using statistical knowledge and machine learning to analyze the data and the query workload in order to generate better cardinality estimation.

FactorJoin: A New Cardinality Estimation Framework for Join Queries

FactorJoin framework is dedicated to generating cardinality estimators that are effective, efficient, and friendly for system deployment (small model size, low training overhead, fast incremental update, and easy to debug and explain).

FactorJoin is the first framework to combine the efforts behind traditional and ML-enhance cardinality estimators: using join-histograms to efficiently handle joins coupled with learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin formulates the problem of estimating the cardinality of join queries in its general form as a factor graph inference problem involving only single-table data distributions. This formulation allows us to use the wisdom from the domain of probabilistic graphical model (PGM) inference as well as design a novel set of effective and efficient optimizations for cardinality estimation.

Experimental results show that FactorJoin has near-optimal effectiveness with very small model size, training overhead, and a very fast incremental update speed. It can achieve comparable or better performance than the state-of-the-art learning-based cardinality estimators with inference latency, training/updating time, and model size close to the traditional methods.

Robust Query Driven Cardinality Estimation under Changing Workloads

Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy.
We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates.

We show significantly greater generalization than past works across different workloads and databases.
For instance, a model trained with our techniques on a simple workload (JobLight-train), with 40K synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2x over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.