Skip to main content

Research

AnHai Doan

Data integration (DI) has been a long-standing challenge in the data management community. So far the vast majority of DI works have focused on developing DI algorithms. Going forward, we argue that far more efforts should be devoted to building DI systems, in order to advance the field. DI is engineering by nature. We cannot just keep developing DI algorithms in a vacuum. At some point we must build end-to-end systems to evaluate the algorithms, to integrate research and development efforts, and to make practical impacts. The question then is what kind of DI systems we should build, and how? In this direction we focus on identifying problems with current DI systems, then developing a radically new agenda for building DI systems. These new kinds of DI system have the following distinguishing characteristics:

1. They guide the user through the end-to-end DI workflow, step by step.
2. For each step, they provide automated or semi-automated tools to address the "pain points" of the step.
3. Tools seek to cover the entire DI workflow, not just a few steps as current DI systems often do.
4. Tools are being built on top of a data science and big data eco-system. Today the two most popular such eco-systems build on R and Python. We currently target the Python data science and big data eco-system.

Paris Koutris

Pricing Relational Data: Data is sold and bought everywhere in the web. This project aims to study the theoretical principles of pricing data, and design and implement a practical and flexible framework that will support pricing relational queries and other tasks (graph analytics, learning ML models).

Models for Large-scale Parallelism: There exists a wide spectrum of large-scale data management systems that exploit parallelism to speed up computation. The overarching goal of this project is to develop theoretical models that help us understand the limits and tradeoffs of parallelizing data processing tasks. Some of the key questions we study: How much synchronization is necessary to perform a certain task? What is the minimum amount of communication needed to compute queries over relational data? How does data skew effect the design of load-balanced algorithms?

Jignesh Patel

Quickstep: Quickstep is a next-generation data processing platform that starts with an underlying relational kernel core. The key design philosophy is to ensure that these kernels - and compositions of these kernels - can exploit the full potential of the underlying hardware. Often this design principle is called running at bare-metal speeds. The current roadmap is to produce a platform that can run relational database applications using SQL as the interface. The longer-term roadmap is to cover a broader class of application language surface including graph analytics, text analytics, and machine learning.

Theo Rekatsinas

ML-first Data Integration: We are exploring the fundamental connections between data cleaning and integration with statistical learning and inference. For example, in SLimFast, we proved that one can resolve noisy and conflicting information from heterogeneous data sources with formal guarantees. Our latest effort for ML-first data integration is HoloClean which is built on the idea of weak supervision and probabilistic inference, see our blog post. These systems transition from logic to probabilities in a way similar to the AI-revolution in the eighties

Quality-aware data source management: Our focus over the last few years has been to dramatically reduce the time domain experts spend in exploring, identifying, and analyzing valuable data for analytics. We are developing SourceSight a prototype data source management system that allows users to interactively explore a large number of heterogeneous data sources, and discover valuable sets of sources for diverse integration tasks.

Xiangyao Yu

1000-Core Database: Computer architectures are moving towards many-core machines with dozens or even hundreds of cores on a single chip, which the current database management systems (DBMSs) are not designed for. We performed an evaluation of concurrency control for on-line transaction processing (OLTP) workloads on many-core chips. Our Analysis (VLDB'14) shows that all algorithms fail to scale to this level of parallelism. For each algorithm, we identified artificial and fundamental bottlenecks. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware. Our DBMS is open source on github (DBx1000).

IMP: Indirect Memory Prefetcher: Important applications like machine learning, graph analytics, and sparse linear algebra are dominated by irregular memory accesses which have little temporal or spatial locality and are difficult to prefetcher using traditional techniques. A majority of these irregular accesses come from indirect patterns of the form A[B[i]]. We propose an efficient hardware indirect memory prefetcher (IMP) to hide memory latency of this access pattern. We also propose a partial cacheline accessing mechanism to reduce the network and DRAM bandwidth pressure from the lack of spatial locality. Evaluated on seven applications, IMP showed 56% speedup on average (up to 2.3x) compared to a baseline streaming prefetchers on a 64 core system.

Near Cloud Storage Computing: Modern cloud platforms disaggregate computation and storage into separate services. In this project, we explored the idea of using limited computation inside the simple storage service (S3) offered by AWS to accelerate data analytics. We use the existing S3 Select feature to accelerate not only simple database operators like select and project, but also complex operators like join, group-by, and top-K. We propose optimization techniques for each individual operator and demonstrated more than 6x performance improvement over a set of representative queries.