Historically, DBMSes have adopted a monolithic architecture, with the system under control of all aspects of data management, including data placement, layout, scheduling, allocation of computation and memory to queries, as well as query optimization and execution. In the cloud, data services are increasingly disaggregated, with a reliable storage layer (e.g., Amazon S3) managed by the cloud provider, file-based data structured data (e.g., Parquet files), often produced by applications outside the data management layer, and a variety of high-level interfaces to the data (e.g., SQL, machine learning applications, data visualization engines). This enables a separation of concerns, where each layer is managed and scaled independently, unlike the “shared nothing” designs of conventional DBMSes where physical nodes that store data are created for each processing node that operates on this data. The rise in popularity of data processing systems like Spark is driven by their successful adoption of this new style of cloud architecture.
While the success of these new systems demonstrates the advantages of this new way of architecting data-driven applications, disaggregation leaves a great deal of performance on the table. For example, when running TPC-H, the disaggregated version of Amazon Redshift, called Redshift Spectrum is at least 8X slower than Redshift itself. This inefficiency is partly due to the layered design of Spectrum, which requires it to load data from underlying S3 files, preventing it from sharing information about data structure and representation between the storage and query execution layers, and partly due to the fact that S3 is lower bandwidth and higher latency to local storage.
Because important metadata is not preserved by low-level cloud storage formats like Parquet, data processing systems operating on such data often lack performance-related metadata such as histograms, and other statistics that are key to good performance, preventing them from optimizing layouts for efficient storage, especially on lower-performance cloud storage where optimizing access is even more critical.
What is needed is a way to build efficient data systems on the cloud while maintaining the advantages of disaggregation. Our key insight is that cloud data systems can be much more efficient if they have a storage layer rich enough to support modern data storage optimizations that are at the heart of high-performance data analytics systems, including indexing, flexible multi-dimensional partitioning, compression, and ability to adapt to the workloads that run on them. To this end, we are building a cloud-optimized storage format, called self-organizing data containers (SDCs). By self-organizing we mean that the container has the flexibility to take on many possible layouts, and that it adapts to the clients’ workload as they interact with it. Unlike systems that implement transactional operations on cloud object stores like Delta Lake, Hudi, and Iceberg, SDCs are meta-data rich and capture data distributions and access patterns, and can represent complex storage layouts encompassing different partitioning and replication strategies.
Specifically, by tightly coupling meta-data, including distributional statistics, indexes, and access patterns, with the data itself, SDCs naturally contain the information needed to support efficient data access. Most cloud storage systems view data objects as immutable; however, they confound immutability of the logical contents of blocks (i.e., the set of records stored in each file) with the physical layout of data (i.e., whether records are column-oriented, compressed, etc). In SDCs, different physical representations, including summaries, aggregates, different columnar and row-oriented layouts, and data layout optimizations for modern hardware (GPU/CPU) can be represented in a single data object, and those layouts can be transformed and adapted over time as the access patterns shift. Hence, SDCs physically mutate over time, “self-organizing” into the optimal layout, even if the data itself remains immutable. Our vision towards such self-organization is influenced by our work on instance-optimized systems over the past several years like Flood, Tsunami, CopyRight, MTO, SageDB, etc. where we have shown that by building data layouts that adapt to both the data and the queries that run on them, dramatic performance gains are possible.
Once we have SDCs, many existing systems, including conventional relational databases, parallel data processing frameworks, and ML systems can be easily adapted to use our new table format. By building on our prior work on cross-layer optimization like Tupleware, Weld, Tuplex, etc., and instance-optimized storage systems, we believe we can construct next generation cloud-processing systems on SDCs that preserve the flexibility of conventional systems while offering order-of-magnitude performance gains that rival the performance of monolithic systems running on bare-metal hardware.
For more details, please refer to our CIDR 2022 paper.