Research Projects

AnHai Doan

Data integration (DI) has been a long-standing challenge in the data management community. So far the vast majority of DI works have focused on developing DI algorithms. Going forward, we argue that far more efforts should be devoted to building DI systems, in order to advance the field. DI is engineering by nature. We cannot just keep developing DI algorithms in a vacuum. At some point we must build end-to-end systems to evaluate the algorithms, to integrate research and development efforts, and to make practical impacts. The question then is what kind of DI systems we should build, and how? In this direction we focus on identifying problems with current DI systems, then developing a radically new agenda for building DI systems. These new kinds of DI system have the following distinguishing characteristics:

1. They guide the user through the end-to-end DI workflow, step by step.
2. For each step, they provide automated or semi-automated tools to address the "pain points" of the step.
3. Tools seek to cover the entire DI workflow, not just a few steps as current DI systems often do.
4. Tools are being built on top of a data science and big data eco-system. Today the two most popular such eco-systems build on R and Python. We currently target the Python data science and big data eco-system.

Paris Koutris

Pricing Relational Data: Data is sold and bought everywhere in the web. This project aims to study the theoretical principles of pricing data, and design and implement a practical and flexible framework that will support pricing relational queries and other tasks (graph analytics, learning ML models).

Models for Large-scale Parallelism: There exists a wide spectrum of large-scale data management systems that exploit parallelism to speed up computation. The overarching goal of this project is to develop theoretical models that help us understand the limits and tradeoffs of parallelizing data processing tasks. Some of the key questions we study: How much synchronization is necessary to perform a certain task? What is the minimum amount of communication needed to compute queries over relational data? How does data skew effect the design of load-balanced algorithms?

Jignesh Patel

Quickstep: Quickstep is a next-generation data processing platform that starts with an underlying relational kernel core. The key design philosophy is to ensure that these kernels - and compositions of these kernels - can exploit the full potential of the underlying hardware. Often this design principle is called running at bare-metal speeds. The current roadmap is to produce a platform that can run relational database applications using SQL as the interface. The longer-term roadmap is to cover a broader class of application language surface including graph analytics, text analytics, and machine learning.

Theo Rekatsinas

ML-first Data Integration: We are exploring the fundamental connections between data cleaning and integration with statistical learning and inference. For example, in SLimFast, we proved that one can resolve noisy and conflicting information from heterogeneous data sources with formal guarantees. Our latest effort for ML-first data integration is HoloClean which is built on the idea of weak supervision and probabilistic inference, see our blog post. These systems transition from logic to probabilities in a way similar to the AI-revolution in the eighties

Quality-aware data source management: Our focus over the last few years has been to dramatically reduce the time domain experts spend in exploring, identifying, and analyzing valuable data for analytics. We are developing SourceSight a prototype data source management system that allows users to interactively explore a large number of heterogeneous data sources, and discover valuable sets of sources for diverse integration tasks.