Publications

Our researchers have been working for years on how to apply AI/ML to data systems and data systems to AI/ML Here’s a list of some of the papers that they’ve published.  Check back frequently for new publications coming out of the collaborative efforts of the DSAIL team.

 

2020

Philipp Eichmann, Emanuel Zgraggen, Carsten Binnig and Tim Kraska. 2020. FASTBench: A New Benchmark for Interactive Data Exploration. 2020 ACM SIGMOD/PODS.

 

Vikram Nathan, Jialin Ding, Tim Kraska and Mohammad Alizadeh. 2020. Learned Multi-Dimensional Indexing. 2020 ACM SIGMOD/PODS.

 

Matthias Jasny, Tobias Ziegler, Tim Kraska, Uwe Roehm and Carsten Binnig. 2020. DB4ML: An In-Memory Database Kernel with Machine Learning Support. 2020 ACM SIGMOD/PODS.

 

Matthew Perron, Raul Castro Fernandez, David DeWitt and Samuel Madden. 2020. Starling: A Scalable Query Engine on Cloud Function Services. 2020 ACM SIGMOD/PODS.

 

Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Hantian Zhang, Yinan Li, Jaeyoung Do, Donald Kossmann, Johannes Gehrke, David Lomet, Badrish Chandramouli and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index2020 ACM SIGMOD/PODS.

 

Ryan Marcus, Emily Zhang and Tim Kraska. 2020. CDFShop: Exploring and Optimizing Learned Index Structures. 2020 ACM SIGMOD/PODS. (demo paper)

 

Anil Shanbhag, Samuel Madden and Xiangyao Yu.  2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. 2020 ACM SIGMOD/PODS.

 

Favyen Bastani, Songtao He, Arjun Balasingam, Karthik Gopalakrishnan, Mohammad Alizadeh, Hari Balakrishnan, Michael Cafarella, Tim Kraska and Sam Madden. 2020. MIRIS: Fast Object Track Queries in Video. 2020 ACM SIGMOD/PODS.

 

Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo and Samuel Madden. 2020. Human-in-the-Loop Outlier Detection. 2020 ACM SIGMOD/PODS.

 

Erfan Zamanian, Julian Shun, Carsten Binnig and Tim Kraska. 2020. Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks. 2020 ACM SIGMOD/PODS.

 

Lei Cao, Huayi Zhang, Yizhou Yan, Samuel Madden and Elke A. Rundensteiner. 2020. Continuously Adaptive Similarity Search.  2020 ACM SIGMOD/PODS

 

Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska and Mohammad Alizadeh. 2020. Cost-Guided Cardinality Estimation: Focus Where it Matters.  Self-Managing Database Systems 2020 (SMDB 2020).

 

Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka and Andreas Krause. 2020. Distributionally Robust Bayesian Optimization. International Conference on Artificial Intelligence and Statistics (AISTATS) 2020.

 

Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi and Stefanie Jegelka. 2020. What Can Neural Networks Reason About? International Conference on Learning Representations (ICLR) 2020. (Spotlight).

 

Vikram Nathan, Jialin Ding, Mohammad Alizadeh and Tim Kraska. 2020. Learning Multi-dimensional Indexes. Northeast Database Day 2020.

 

Ryan Marcus and Tim Kraska. 2020. Learning to Multiplex Simple Query Optimizers. Northeast Database Day 2020.

 

Matthew J. Perron, Raul Castro Fernandez, David DeWitt and Samuel Madden. 2020. Starling: How to Build a Query Engine on Cloud Functions. 2020. Northeast Database Day 2020.

 

Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md and Tim Kraska. 2020. LISA: Towards Learned DNA Sequence Search. Northeast Database Day 2020 (poster).

 

Mengyuan Sun, Joana M. F. da Trindade, Samuel Madden, Julian Shun and Nesime Tatbul. 2020  In-memory Graph Partitioning for Efficient Temporal Graph Analytics on NVRAM. Northeast Database Day 2020 (poster).

 

Erfan Zamanian, Xiangyao Yu, Michael Stonebraker and Tim Kraska. 2020. Rethinking Database High Availability with RDMA Networks. Northeast Database Day 2020 (poster).

 

El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani and Michael Stonebraker. (2020) Dagger: A Data (not code) Debugger. Conference on Innovative Data Systems (CIDR) 2020.

 

2019

Hongzi Mao, Malte Schwarzkopf, Hao He and Mohammad Alizadeh. 2019. Towards Safe Online Reinforcement Learning in Computer Systems.  Machine Learning for Systems Workshop at Neural Information Processing Systems (NeurIPS) 2019.   

 

Vikram Nathan. Learned Multi-dimensional Indexing. Machine Learning for Systems Workshop at Neural Information Processing Systems (NeurIPS) 2019. (Contributed talk).

 

Haonan Wang, Hao He, Mohammad Alizadeh and Hongzi Mao. 2019. Learning Caching Policies with Subsampling. Machine Learning for Systems Workshop at Neural Information Processing Systems (NeurIPS) 2019.   

 

Zeyuan Shang, Emanuel Zgraggen and Tim Kraska. (2019) Alpine Meadow: A System for Interactive AutoML. MLSys: Workshop on Systems for ML at Neural Information Processing Systems (NeurIPS) 2019.

 

Zeyuan Shang, Emanuel Zgraggen, Philipp Eichmann and Tim Kraska. 2019. Niseko: a Large-Scale Meta-Learning Dataset. Workshop on Meta-Learning, Workshop on Systems for ML at Neural Information Processing Systems (NeurIPS) 2019.

 

Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. MLSys: Workshop on Systems for ML at Neural Information Processing Systems (NeurIPS) 2019. [GitHub]

 

Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md and Tim Kraska. 2019. LISA: Towards Learned DNA Sequence Search. MLSys: Workshop on Systems for ML at Neural Information Processing Systems (NeurIPS) 2019. (selected for oral presentation)

 

Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao and Mohammad Alizadeh. 2019. Learning Generalizable Device Placement Algorithms for Distributed Machine Learning. Neural Information Processing Systems (NeurIPS) 2019.

 

Zhijian Liu, Haotian Tang, Yujun Lin and Song Han. 2019. Point-Voxel CNN for Efficient 3D Deep Learning. Neural Information Processing Systems (NeurIPS) 2019.

 

Ligeng Zhu, Zhijian Liu and Song Han. 2019. Deep Leakage from Gradients. Neural Information Processing Systems (NeurIPS) 2019.

 

Joshua Robinson, Suvrit Sra and Stefanie Jegelka. 2019. Flexible Modeling of Diversity with Strongly Log-Concave Distributions. Neural Information Processing Systems (NeurIPS) 2019.

 

Matthew Staib and Stefanie Jegelka. 2019. Distributionally Robust Optimization and Generalization in Kernel Methods. Neural Information Processing Systems (NeurIPS) 2019.

 

Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, Ryan Marcus, Ravichandra Addanki, Mehrdad Khani, Songtao He, Vikram Nathan, Frank Cangialosi, Shaileshh Bojja Venkatakrishnan, Wei-Hung Weng, Song Han, Tim Kraska and Mohammad Alizadeh. 2019. Park: An Open Platform for Learning-Augmented Computer Systems. Neural Information Processing Systems (NeurIPS) 2019. [code] [blog post]

 

Ji Lin, Chuang Gan, Song Han. 2019. TSM: Temporal Shift Module for Efficient Video UnderstandingInternational Conference on Computer Vision (ICCV). [paper][demo][code][industry integration]

 

Lei Cao, Wenbo Tao, Sungtae An, Jing Jin (Massachusetts General Hospital), Yizhou Yan, Xiaoyu Liu, Wendong Ge (Massachusetts General Hospital), Adam Sah, Leilani Battle, Jimeng Sun, Remco Chang, Brandon Westover (Massachusetts General Hospital), Samuel Madden, Michael Stonebraker. 2019. Smile: A System to Support Machine Learning on EEG Data at Scale, VLDB 2019 (Industry Track paper)

 

El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao, Samuel Madden, Mourad Ouzzani, Nan Tang, Ahmed Elmagarmid. 2019. Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics. VLDB 2019. (Demo paper)

 

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang,  Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul.  2019. “Neo: A Learned Query Optimizer.” VLDB 2019. [pdf]

 

Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chun, Carsten Binnig, Eli Upfal, Tim Kraska. 2019.  Democratizing Data Science through Interactive Curation of ML Pipelines.  In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 1171-1188. DOI

 

Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca,Tim Kraska. 2019. FITing-Tree: A Data-aware Index Structure. In arXiv, 2019. [project page] [pdf]

 

Tobias Ziegler, Sumukha Tumkur Vani, Carsten Binnig, Rodrigo Fonseca ,Tim Kraska. 2019. Designing Distributed Tree-based Index Structures for Fast RDMA-capable Networks. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 741-758. DOI

 

Stratos Idreos and Tim Kraska. 2019. From Auto Tuning One Size Fits All to Self-Designed and Learned Data Intensive Systems.  In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 2054-2059. DOI

 

El Kindi Rezig, Mourad Ouzzani, Ahmed K. Elmagarmid, Walid G. Aref and Michael Stonebraker. 2019. Towards an End-to-End Human-Centric Data Cleaning Framework. HILDA@SIGMOD 2019: 1:1-1:7

 

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. SIGCOMM 2019: 270-288. [project page]

 

Vikram Nathan, Vibhaalakshmi Sivaraman, Ravichandra Addanki, Mehrdad Khani, Prateesh Goyal, Mohammad Alizadeh. 2019.  End-to-End Transport for Video QoE Fairness.” SIGCOMM 2019.

 

Hongzi Mao, Akshay Narayan, Parimarjan Negi, Hanrui Wang, Jiacheng Yang, Haonan Wang, Mehrdad Khani, Songtao He, Ravichandra Addanki, Ryan Marcus, Frank Cangialosi, Wei-Hung Weng, Song Han, Tim Kraska, Mohammad Alizadeh. 2019. Park: An Open Platform for Learning Augmented Computer Systems. RL for Real Life ICML 2019 Workshop (Best paper award).

 

Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell, Yuandong Tian, Mohammad Alizadeh, Eytan Bakshy. 2019.  Real-world Video Adaptation with Reinforcement Learning. RL for Real Life ICML 2019 Workshop.

 

Amy Zhao, Guha Balakrishnan, Fredo Durand, John V. Guttag, Adrian V. Dalca. 2019. Data Augmentation Using Learned Transformations for One-Shot Medical Image Segmentation.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8543-8553

 

Frederik D. Johansson, Rajesh Ranganath, David Sontag. 2019. Support and Invertibility in Domain-Invariant Representations. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (AI-STATS) (To appear), 2019.

 

Omer Gottesman, Fredrik D. Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, Leo Anthony Celi. 2019. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25(1): 16–18. 2019.

 

Lei Cao, Yizhou Yan, Samuel Madden, Elke Rundensteiner, Mathan Gopalsamy. 2019. Efficient Discovery of Sequence Outlier Patterns. PVLDB, 12(8): 920-932, 2019

 

Charlotte Bunne, David Alvarez-Melis, Andreas Krause, Stefanie Jegelka. Learning Generative Models across Incomparable Spaces. 2019. International Conference on Machine Learning (ICML), 2019.

 

Wenbo Tao, Xiaoyu Liu, Yedi Wang, Leilani Battle, Cagatay Demiralp, Remco Chang  Michael Stonebraker. Kyrix: Interactive Pan/Zoom Visualizations at Scale. 2019. Eurographics Conference on Visualization (EuroVis) 2019.

 

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. 2019. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. (Oral presentation)

 

Matthew Staib, Bryan Wilder, Stefanie Jegelka. Distributionally Robust Submodular Maximization. 2019. International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.

 

Kevin Hu, Michiel A. Bakker, Stephen Li, Tim Kraska, César Hidalgo. VizML: A Machine Learning Approach to Visualization Recommendation. 2019. In CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), 2019. Read a summary.

 

Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka. How Powerful are Graph Neural Networks? 2019. International Conference on Learning Representations (ICLR), 2019. (Oral Presentation)

 

Yeounoh Chung, Tim Kraska, Steven Euijong Whang, Neoklis Polyzotis. Slice Finder: Automated Data Slicing for Model Validation. 2019. IEEE International Conference on Data Engineering (ICDE), 2019.  (Short paper version to appear.)

 

Han Cai, Ligeng Zhu, Song Han. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. 2019. To appear in ICLR’19 (Seventh International Conference on Learning Representations).

 

Ji Lin, Chuang Gan, Song Han. Defensive Quantization: When Efficiency Meets Robustness. 2019. To appear in ICLR’19  (Seventh International Conference on Learning Representations)

 

Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh. Variance Reduction for Reinforcement Learning in Input-Driven Environments. 2019. To appear in ICLR‘19 (Seventh International Conference on Learning Representations).

 

Raul Castro Fernandez and Samuel Madden. Multi-Modal Data Exploration with a Learned Relational Embedding. North East Database Day 2019.

 

Ryan Marcus and Olga Papaemmanouil. Making ML-for-DB Easier: Operator Embeddings via Deep Learning. North East Database Day 2019.

 

Jialin Ding. Learned Multi-Dimensional Index for Data Warehouses. North East Database Day 2019. (Poster)

 

Leonhard Spiegelberg and Tim Kraska. Robust Data Centric Code Generation. North East Database Day 2019. (Poster)

 

Ani Kristo, Kapil Vaidya, Ugur Cetintemel, Tim Kraska. A Learned Sorting Algorithm. North East Database Day 2019. (Poster)

 

Manasi Vartak. Data Science is Growing Up: Building the DS/ML Infrastructure Backbone. North East Database Day 2019. (Poster)

 

Wenbo Tao, Xiaoyu Liu, Cagatay Demiralp, Remco Chang, Michael Stonebraker. Kyrix: Interactive Visual Data Exploration at Scale. 2019. Conference on Innovative Data Systems Research (CIDR) 2019. 

 

Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed Chi, Ani Kristo, Guillaume LeclercSamuel Madden, Hongzi Mao, Vikram Nathan. SageDB: A Learned Database System. 2019. Conference on Innovative Data Systems Research (CIDR) 2019.

 

2018

Songtao He, Favyen Bastani, Sofiane Abbar, Mohammad Alizadeh, Hari Balakrishnan, Sanjay Chawla , Samuel Madden. RoadRunner: Improving the Precision of Road Network Inference from GPS Trajectories. 2018. ACM SIGSPATIAL, Seattle, WA, November 2018 [PDF] [BibTex]

 

Favyen Bastani, Songtao He, Sofiane Abbar, Mohammad Alizadeh, Hari Balakrishnan, Sanjay Chawla, Samuel Madden. Machine-Assisted Map Editing. 2018. ACM SIGSPATIAL, Seattle, WA, November 2018. [PDF] [BibTex]

 

Irene Chen, Fredrik D. Johansson, David Sontag. Why is My Classifier Discriminatory? 2018. Neural Information Processing Systems (NeurIPS), 2018. SpotlightWatch video

 

Hongzou Lin and Stefanie Jegelka. ResNet with one-neuron hidden layers is a Universal Approximator. 2018. Neural Information Processing Systems (NeurIPS), 2018. Spotlight.

 

Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, Volkan Cevher. 2018. Adversarially Robust Optimization with Gaussian Processes. Neural Information Processing Systems (NeurIPS), 2018. Spotlight.

 

Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken Kawarabayashi, Stefanie Jegelka. 2018.  Representation Learning on Graphs with Jumping Knowledge Networks. 2018. International Conference on Machine Learning (ICML), 2018. Long talk.

 

Harini Suresh, Jen J. Gong, John V. Guttag. 2018. Learning Tasks for Multitask Learning: Heterogeneous Patient Populations in the ICU. Proceedings of the Knowledge Discovery and Data Mining Conference (KDD 2018), 2018.  

 

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18).

 

Manasi Vartak, Joana M. F. da Trindade, Samuel Madden, Matei Zaharia. 2018. MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis.  In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18).

 

Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han. 2018. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. ECCV’18 (European Conference on Computer Vision).

 

Michael L. Brodie, Michael Stonebraker, Ricardo Mayerhofer, Jialing Pei. The Case for the Co-evolution of Applications and Data, North East Database Day 2018 (NEDB 2018), January 19, 2018.

 

Yizhou Yan, Lei Cao, Samuel Madden, Elke Rundensteiner. 2018. SWIFT: Mining Representative Patterns from Large Event Streams. PVLDB, 12(3): 265-277, 2018.

 

Tim Kraska. Northstar: An Interactive Data Science System. 2018. PVLDB 11(12): 2150-2164 (2018)

 

Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, Emanuel Zgraggen. 2018. Towards Interactive Curation & Automatic Tuning of ML Pipelines. DEEM@SIGMOD 2018: 1:1-1:4

 

2017

Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Nan Tang. 2017. “The Data Civilizer System.” CIDR 2017.

 

Michael Stonebraker, Dong Deng, Michael L. Brodie. 2017. Application-Database Co-Evolution: A New Design and Development Paradigm. New England Database Day, (pp. 1–3). January 2017.

 

2016

Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, Matei Zaharia. 2016. MODELDB: A System for Machine Learning Model Management. HILDA 2016.

 

Michael Stonebraker, Dong Deng, Michael L. Brodie. 2016. Database Decay and How to Avoid It (pp. 1–10). Proceedings of the IEEE International Conference on Big Data, Washington, DC.  2016.

 

2015

Evan R. Sparks, Ameet Talwalkar, Michael J. Franklin, Michael I. Jordan, Tim Kraska. 2015. TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries. CoRR abs/1502.000068 2015.

 

2013

Tim Kraska, Ameet Talwalkar, John C. Duchi, Rean Griffith, Michael J. Franklin, Michael I. Jordan. 2013. MLbase: A Distributed Machine-learning System. CIDR 2013.