DSM

DSM: A Data Science Management System

Unleashing the full potential of applying AI algorithms in big data enterprise environments will require a paradigm shift in the point-product algorithms and tools used today to a more integrated, end-to-end system with highly collaborative and visual interfaces. For example, current data scientists spend at least 80% of their time on locating data sets of interest in the enterprise and then transforming and cleaning the results into an integrated whole (or model). Even worse, decisions made during the cleaning process may have profound implications on the model quality and, in some situations, can even lead to wrong results.

Furthermore, building a machine learning model is an iterative process. A data scientist will build 10s to 100s of models before arriving at one that meets some acceptance criteria. Not only is this process very time-consuming (as iterations in some cases can take hours to days), but also there is currently no practical way for a data scientist to manage models that are built over time. As a result, a data scientist must attempt to “remember” previously constructed models and insights obtained from them. Finally, building models is often a collaborative process between a data scientist and a domain expert, something current tools are particularly bad at.

As part of DSAIL, we aim to build DSM, the first system to support data scientists for the entire data lifecycle including data discovery, transformation, cleaning, exploration and model building. More concretely, we will investigate:

(1) an incremental workflow management system to support the entire data science development process

(2) new techniques to support non-intrusive versioning of the data processing pipeline, to support multiple explorations of possible next steps

(3) Novel ways to store ALL of the data associated with a workflow project

(4) Novel ways to visualize project data to gain insights

(5) New techniques for data discovery and cleaning

(6) Techniques to automate the standard steps of data science and automatically warn the users about potential mistakes

As part of this project, we will build upon many existing projects by the Principal Investigators and extend and integrate the results into a single open-source system.

Links to projects:

PipelineDB

Kyrix

Data Civilizer

Northstar