Database Affiliates 2023

The Wisconsin Database Affiliates Program

Disseminate research results in data management
Facilitate academic-industry communication
Enable technology transfer
Connect students to potential employers
Solicit research and education feedback from industry

The annual workshop features research talks from faculty and students, industry talks, posters and demos, and social activities.

Schedule

The workshop will take place over two days.
October 12, Thursday, 8:30-4:30. Full-day workshop at Union South in room Northwoods (3rd floor). A detailed schedule will be announced soon.
October 13, Friday, 9:00-noon. Sponsor talks and one-on-one meetings betweeen companies and students.

October 12 (Northwoods@Union South)

2023

8:30 am

9:00 am

Breakfast and Mingle

9:00 am

9:20 am

Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries

Yifei Yang

This paper presents predicate transfer, a novel method that optimizes join performance by pre-filtering tables to reduce the join input sizes. Predicate transfer generalizes Bloom join, which conducts pre-filtering within a single join operation, to multi-table joins such that the filtering benefits can be significantly increased. Predicate transfer is inspired by the seminal theoretical results by Yannakakis, which uses semi-joins to pre-filter acyclic queries. Predicate transfer generalizes the theoretical results to any join graphs and use Bloom filters to replace semi-joins leading to significant speedup. Evaluation shows predicate transfer can outperform Bloom join by 3.1x on average on TPC-H benchmark.

9:20 am

9:40 am

Towards Large-Scale and Practical Data Analytics on GPUs

Bobbi Yogatama

Graphics Processing Units (GPUs) have shown great potential in accelerating data analytics queries due to their massive computational power and high memory bandwidth compared to CPUs. GPU databases have been extensively studied in both academia research and been deployed as commercial systems. Significant performance improvements ranging from ten times to several hundred times speedups have been reported. This massive speedup, however, is only applicable if the working sets fit in the GPU memory As the datasets grow, we will encounter two hardware bottlenecks in GPU databases: (1) limited GPU memory capacity and (2) limited CPU-GPU interconnect bandwidth. Our goal is to tackle these challenges and enable large-scale data analytics on GPUs through (1) data compression, (2) heterogeneous CPU-GPU databases, and (3) multi-GPU databases. In this talk, we will show how we explore the design space of each solution to get order-of-magnitude speedup compared to state-of-the-art approach. Finally, we will also discuss our ongoing effort to integrate the whole solutions into Camelot, a portable GPU-DBMS with DuckDB frontend.

9:40 am

10:00 am

F2: Designing a Key-Value Store for Large Skewed Workloads

Konstantinos Kanellis

Today’s key-value stores are either disk-optimized, focusing on large data and saturating device IOPS, or memory-optimized, focusing on high throughput with linear thread scaling assuming plenty of main memory. However, many practical workloads demand high performance for read and write working sets that are much larger than main memory, over a total data size that is even larger. They require judicious use of memory and disk, and today’s systems do not handle such workloads well. We present F2, a new key-value store design based on compartmentalization – it consists of five key components that work together in well-defined ways to achieve high throughput – saturating disk and memory bandwidths – while incurring low disk read and write amplification. A key design characteristic of F2 is that it separates the management of hot and cold data, across the read and write domains, and adapts the use of memory to optimize each case. Through a sequence of new latch-free system constructs, F2 solves the key challenge of maintaining high throughput with linear thread scalability in such a compartmentalized design. Detailed experiments on benchmark data validate our design’s superiority, in terms of throughput, over state-of-the-art key-value stores, when the available memory resources are scarce.

10:00 am

10:30 am

Keynote

10:30 am

10:45 am

Coffee Break

10:45 am

11:05 am

Return of the Antijoins

Hangdong Zhao

We study the complexity of evaluating Conjunctive Queries with negation (CQ¬). First, we present an algorithm with linear preprocessing time and constant delay enumeration for a class of CQs with negation called free-connex signed-acyclic queries. We show that no other queries admit such an algorithm subject to lower bound conjectures. Second, we extend our algorithm to Conjunctive Queries with negation and aggregation over a general semiring, which we call Functional Aggregate Queries with negation (FAQ¬). Such an algorithm achieves constant delay enumeration for the same class of queries, but with a slightly increased preprocessing time which includes an inverse Ackermann function. We show that this surprising appearance of the Ackermmann function is probably unavoidable for general semirings, but can be removed when the semiring has specific structure. Finally, we show an application of our results to computing the difference of CQs.

11:05 am

11:25 am

State-of-the-art Join Algorithms in Database (Theory) and Lower Bounds from Fine-grained Complexity

Austen Fan

In this talk, we will enjoy a tour to the state-of-the-art join algorithms for database queries, from Yannakakis algorithm, Worst-case Optimal Join, all the way to the state-of-the-art PANDA (Proof-Assisted eNtropic Degree-Aware) algorithm. We complement the algorithms with latest results on lower bounds for database queries, leveraging techniques from fine-grained complexity. The lower bounds are tight, or equivalently the PANDA algorithm is optimal, for several important classes of queries, including k-cycle, Loomis-Whitney Joins, Chordal Queries and etc..

11:25 am

11:45 am

Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

Derek Paulsen

Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf block- top-k tf/idf blocking in a distributed share-nothing fashion on a ing in depth. We develop Sparkly, which uses Lucene to perform Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly com- pletely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we pro- vide an in-depth analysis of Sparkly’s performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.

11:45 am

12:00 pm

Lightning Talks (for posters)

12:00 pm

2:00 pm

2:20 pm

Practical Perfect Hashing for OLAP Systems

Kevin Gaffney

A perfect hash function (PHF) maps a set of keys to a range of integers with no collisions. Compared to conventional hash methods, PHFs are attractive for their low space overhead and reduced control flow. Despite their advantages, there has been little investigation into the use of PHFs for online analytical processing (OLAP). This talk presents our recent work on practical perfect hashing for OLAP. We identify several promising applications for PHFs in OLAP and survey their current use in systems and research prototypes. We then evaluate existing PHF approaches and quantify their impact on query performance. Our results are encouraging: in a real OLAP system, PHFs achieve end-to-end speedups of 1.7X and 3.1X for join and aggregate queries, respectively. Nevertheless, there is room for improvement. Future approaches that simultaneously achieve low build time and high probe throughput could offer additional performance increases.

2:20 pm

2:40 pm

Rethinking the Encoding of Integers for Scans on Skewed Data

Martin Prammer

Bit-parallel scanning techniques are characterized by their ability to accelerate compute through the process known as early pruning. Early pruning techniques iterate over the bits of each value, searching for opportunities to end compute early, before processing each data value in its entirety. However, because of this iterative evaluation, the effectiveness of early pruning depends on the relative position of bits that can be used for pruning within each value. Due to this behavior, bit-parallel techniques have faced significant challenges when processing skewed data, especially when values contain many leading zeroes. This problem is further amplified by the inherent trade-off that bit-parallel techniques make between columnar scan and fetch performance: a storage layer that supports early pruning requires multiple memory accesses to fetch a single value. Thus, in the case of skewed data, bit-parallel techniques increase fetch latency without significantly improving scan performance when compared to baseline columnar implementations. To remedy this weakness, we transform the values in bit-parallel columns using novel encodings. We propose the concept of forward encodings: a family of encodings that shift pruning-relevant bits closer to the most significant bit. Using this concept, we propose two particular encodings: the Data Forward Encoding and the Extended Data Forward Encoding. We demonstrate the impact of these encodings using multiple real-world datasets. Across these datasets, forward encodings improve the current state-of-the-art bit-parallel technique's scan and fetch performance in many cases by 1.4x and 1.3x, respectively.

2:40 pm

3:00 pm

Performance Roulette: How Cloud Weather Affects ML-Based System Optimization

Johannes Freischuetz

As system complexity, workload diversity, and cloud computing adoption continue to grow, both operators and developers are turning to ML-based approaches for optimizing systems to a particular deployment and workload scenario in order to achieve the best performance and/or least cost. Most approaches to DBMS autotuning do not currently account for noise during sampling. We show that ML-based optimizers are sensative to noise.

3:00 pm

3:20 pm

Industry Talks

3:20 pm

3:45 pm

Break

3:45 pm

4:30 pm

What's Next Session

October 13 (Auditorium 1240@CS Department)

2023

9:00 am

9:30 am

Generative AI research efforts at GSL

Avrilia Floratou (Microsoft)

In this talk, I will first give an overview of the mission of the Microsoft Gray Systems Lab and will provide an overview of the various research initiatives the team is undertaking. I will then discuss in more details our research portfolio in the space of generative AI and LLMs with a special focus on natural language interfaces for data management.

Bio: Avrilia is a Principal Scientist Manager at Microsoft’s Gray Systems Lab. Her research broadly lies in the area of data management with a recent focus on leveraging Large Language Models (LLMs) to improve user productivity in domains such as data integration, data exploration and code migration. Her research interests also include semantics-aware data science and large-scale stream processing. Previously, she worked on leveraging data semantics and domain knowledge to improve the productivity of data scientists leading to the Semantic Link offering in Microsoft Fabric, Microsoft’s end-to-end, unified analytics platform. She also co-developed the Dhalion project, a framework used to define and apply SLO-based policies on long-running applications that has been deployed in various Microsoft services. Prior to her current role, she was a research scientist at IBM Almaden Research Center working on SQL-on-Hadoop engines and natural language interfaces for databases. Avrilia received her Ph.D. and M.Sc. in Computer Science from University of Wisconsin-Madison working with Prof. Jignesh M. Patel and her B.Sc. from University of Athens in Greece.

9:30 am

10:00 am

Scaling way up with NoSQL and SQL

Mike Marty (Google)

Google launched the Big Data era out of the sheer necessity of building massive applications on unreliable commodity hardware. This talk will discuss the context of how Google created large-scale NoSQL infrastructure, but then evolved to large-scale SQL-based query processing systems. I will discuss some systems challenges often overlooked such as the management of resources in a complex system built from micro-services.

Bio: Mike Marty is a Principal Engineer and Engineering Director at Google-Madison. He focuses on evolving F1 Query-- Google’s federated query engine-- as well as building out table-based storage systems. Prior to entering the world of data management, he spent a decade at Google focusing on high-performance networking and systems. His PhD is in Computer Architecture from the University of Wisconsin-Madison.

10:00 am

10:30 am

Secure & Private Data Collaboration with Snowflake

Ian Connolly (Snowflake)

Data collaboration is fundamental that transforms Snowflake from a cloud-native database that excels at data warehousing workloads into the Data Cloud. But enabling data collaboration in a safe and secure way is a significant engineering challenge. We’ll discuss the engineering and theoretical challenges that we’re working through to enable our customers to share and derive value from their most sensitive data, all within Snowflake.

Bio: Ian Connolly runs the Differential Privacy team at Snowflake. He was previously Head of Engineering at LeapYear Technologies, a San Francisco-based company which built a differentially private data platform until it was acquired by Snowflake in February 2023.

Lodging and Parking

Union South Hotel

Fluno Center

Doubletree

Parking (without a permit) on campus is very difficult. We recommend taking a Uber/Lyft/Taxi instead of renting a car.

Sponsorship

While the event is currently free to all attendants, we greatly appreciate any form of sponsorship. The benefits of an active sponsor include (1) having access to CVs of our graduate students, (2) 30-min sponsor talk on Friday, and (3) one-on-one interview with students on Friday. Please contact Xiangyao (yxy@cs.wisc.edu) if you are interested in any type of sponsorship.

Active sponsors

BLARE: Exploiting Structure in Regular Expression Queries

Ling Zhang

We present BLARE, Blazingly Fast Regular Expression, a regular expression matching framework that is inspired by the optimizer mechanism in database engines. BLARE decomposes a regex into multiple regex and string components and then creates evaluation strategies in which the components can be evaluated in an order that is not strictly a left-to-right translation of the input regex query. Rather than using a cost-based optimization approach, BLARE uses an adaptive runtime strategy based on a multi-armed bandit approach to find an efficient execution plan. BLARE is also modular and can be built on top of any existing regex library. We implement BLARE on 4 regex engines, and it achieves 1.6x to 168x improvement over two production workloads and one open-source workload.

Towards Auto-Tuning in Noisy Environments

Johannes Freischuetz

Buffer Management with Tiered Main Memory

Xiangpeng Hao

The scaling of per-GB DRAM cost has slowed down in recent years. Recent research has suggested that adding remote memory to a system can further reduce the overall memory cost while maintaining good performance. Remote memory (i.e., tiered memory), connected to host servers via high-speed interconnect protocols such as RDMA and CXL, is expected to deliver $100\times$ (less than 1us) lower latency than SSD and be more cost-effective than local DRAM through pooling and adopting cheaper memory technologies. Tiered memory opens up a large potential use cases within database systems. But previous work has only explored limited ways of using tiered memory. Our study instead provides a systematic study for DBMS to build tiered memory buffer management with respect to a wide range of tiered memory performance characteristics. Specifically, we study five different buffer management designs that leverage remote memory in different ways and evaluate them through a wide range of metrics including performance, tiered-memory latency sensitivity and cost-effectiveness. In addition, we propose a new memory provisioning strategy that guides DBMS to provision an optimal amount of local and remote memory for a given workload. Our evaluations show that while some designs achieve higher performance than others, no design can win in all measured dimensions.

Marlin: Lightweight Cluster Management for Elastic Cloud Databases with Storage Disaggregation

Wenjie Hu

Modern OLTP DBMS are moving to the public cloud because of the high cost performance provided by pay-as-you-go model for dynamic workloads. To enable this benefit, databases are supposed to be autoscaling, the ability to elastically scale out or scale in cluster resources in response to dynamic workload while guaranteeing the users’ service level agreements (SLA). Current live reconfiguration solutions present limitations. First, they complicate storage and consistency mechanisms by segregating application and system states. Second, they either compromise performance, reduce isolation levels, or are specifically designed for deterministic databases. Motivated by the inherent properties of disaggregated storage(i.e., high availability and shared access), we propose Marlin, a transactional lightweight cluster management, which consolidates storage and employs a unified concurrency control protocol for both application and system states. We further propose an optimization(i.e., Hybrid-Heat) to mitigate performance degradation by eliminating transaction abortions and minimizing the service downtime while maintaining serializable isolation. Our evaluation shows that Marlin with Hybrid-Heat incurs a mere 4.93% average transaction throughput drops during reconfiguration and can finish reconfiguration for 4 GB data within 31 seconds.

Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries

Yifei Yang

We present predicate transfer, a novel method that optimizes join performance by pre-filtering tables to reduce the join input sizes. Predicate transfer generalizes Bloom join, which conducts pre-filtering within a single join operation, to multi-table joins such that the filtering benefits can be significantly increased. Predicate transfer is inspired by the seminal theoretical results by Yannakakis, which uses semi-joins to pre-filter acyclic queries. Predicate transfer generalizes the theoretical results to any join graphs and use Bloom filters to replace semi-joins leading to significant speedup. Evaluation shows predicate transfer can outperform Bloom join by 3.1× on average on TPC-H benchmark.

Blink-hash: An Adaptive Hybrid Index for In-Memory Time-Series Databases

Hokeun Cha

"High-speed data ingestion is critical in time-series workloads that are driven by the growth of Internet of Things (IoT) applications. We observe that traditional tree-based indexes encounter severe scalability bottlenecks for time-series workloads that insert monotonically increasing timestamp keys into an index; all insertions go to a small memory region that sees extremely high contention. In this work, we present a new index design, Blink-hash, that enhances a tree-based index with hash leaf nodes to mitigate the contention of monotonic insertions — insertions go to random locations within a hash node (which is much larger than a B+-tree node) to reduce conflicts.We develop further optimizations (median approximation and lazy split) to accelerate hash node splits. We also develop structure adaptation optimizations to dynamically convert a hash node to B+-tree nodes for good scan performance. Our evaluation shows that Blink-hash achieves up to 91.3× higher throughput than conventional indexes in a time-series workload that monotonically inserts timestamps into an index, while showing comparable scan performance to a well-optimized B+-tree."

Conjunctive Queries with Negation and Aggregation: A Linear Time Characterization

Hangdong Zhao

Cornus: Atomic Commit for a Cloud DBMS with Storage Disaggregation

Zhihan Guo

Two-phase commit (2PC) is widely used in distributed databases to ensure atomicity of distributed transactions. Conventional 2PC was originally designed for the shared-nothing architecture and has two limitations: long latency due to two eager log writes on the critical path, and blocking of progress when a coordinator fails. Modern cloud-native databases are moving to a storage disaggregation architecture where storage is a shared highly-available service. Our key observation is that disaggregated storage enables protocol innovations that can address both the long-latency and blocking problems. We develop Cornus, an optimized 2PC protocol to achieve this goal. The only extra functionality Cornus requires is an atomic compare-and-swap capability in the storage layer, which many existing storage services already support. We present Cornus in detail and show how it addresses the two limitations. We also deploy it on real storage services including Azure Blob Storage and Redis. Empirical evaluations show that Cornus can achieve up to 1.9× latency reduction over conventional 2PC.

Practical Perfect Hashing for OLAP Systems

Kevin Gaffney

A perfect hash function (PHF) maps a set of keys to a range of integers with no collisions. Compared to conventional hash methods, PHFs are attractive for their low space overhead and reduced control flow. Despite their advantages, there has been little investigation into the use of PHFs for online analytical processing (OLAP). This poster presents our recent work on practical perfect hashing for OLAP. We identify several promising applications for PHFs in OLAP and survey their current use in systems and research prototypes. We then evaluate existing PHF approaches and quantify their impact on query performance. Our results are encouraging: in a real OLAP system, PHFs achieve end-to-end speedups of 1.7X and 3.1X for join and aggregate queries, respectively. Nevertheless, there is room for improvement. Future approaches that simultaneously achieve low build time and high probe throughput could offer additional performance increases.

Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis

Yunjia Zhang

There have been many decades of work on optimizing query processing in database management systems. Recently, modern machine learning (ML), and specifically reinforcement learning (RL), have gained increased attention as a means to develop a query optimizer (QO). In this work, we take a closer look at two recent state-of-the-art (SOTA) RL-based QO methods to better understand their behavior. We find that these RL-based methods do not generalize as well as it seems at first glance. Thus, we ask a simple question: How do SOTA RL-based QOs compare to a simple, modern, adaptive query processing approach? To answer this question, we choose two simple adaptive query processing techniques and implemented them in PostgreSQL. The first adapts an individual join operation on-the-fly and switches between a Nested Loop Join algorithm and a Hash Join algorithm to avoid sub-optimal join algorithm decisions. The second is a technique called Lookahead Information Passing (LIP), in which adaptive semijoin techniques are used to make a pipeline of join operations execute efficiently. To our surprise, we find that this simple adaptive query processing approach is not only competitive to the SOTA RL-based approaches but, in some cases, outperforms the RL-based approaches. The adaptive approach is also appealing because it does not require an expensive training step, and it is fully interpretable compared to the RL-based QO approaches. Further, the adaptive method works across complex query constructs that RL-based QO methods currently cannot optimize.

The Fine-Grained Complexity of Boolean Conjunctive Queries and Sum-Product Problems

Austen Fan

We study the fine-grained complexity of evaluating Boolean Conjunctive Queries and their generalization to sum-of-product problems over an arbitrary semiring. For these problems, we present a general semiring-oblivious reduction from the k-clique problem to any query structure (hypergraph). Our reduction uses the notion of embedding a graph to a hypergraph, first introduced by Marx. As a consequence of our reduction, we can show tight conditional lower bounds for many classes of hypergraphs, including cycles, Loomis-Whitney joins, some bipartite graphs, and chordal graphs. These lower bounds have a dependence on what we call the clique embedding power of a hypergraph H, which we believe is a quantity of independent interest. We show that the clique embedding power is always less than the submodular width of the hypergraph, and present a decidable algorithm for computing it. We conclude with many open problems for future research.

Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

Derek Paulsen

"Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no atten- tion. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf block- ing in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly com- pletely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we pro- vide an in-depth analysis of Sparkly’s performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility."

F2: Designing a Key-Value Store for Large Skewed Workloads

Konstantinos Kanellis

Rethinking the Encoding of Integers for Scans on Skewed Data

Martin Prammer

"Bit-parallel scanning techniques are characterized by their ability to accelerate compute through the process known as early pruning. Early pruning techniques iterate over the bits of each value, searching for opportunities to end compute early, before processing each data value in its entirety. However, because of this iterative evaluation, the effectiveness of early pruning depends on the relative position of bits that can be used for pruning within each value. Due to this behavior, bit-parallel techniques have faced significant challenges when processing skewed data, especially when values contain many leading zeroes. This problem is further amplified by the inherent trade-off that bit-parallel techniques make between columnar scan and fetch performance: a storage layer that supports early pruning requires multiple memory accesses to fetch a single value. Thus, in the case of skewed data, bit-parallel techniques increase fetch latency without significantly improving scan performance when compared to baseline columnar implementations. To remedy this weakness, we transform the values in bit-parallel columns using novel encodings. We propose the concept of forward encodings: a family of encodings that shift pruning-relevant bits closer to the most significant bit. Using this concept, we propose two particular encodings: the Data Forward Encoding and the Extended Data Forward Encoding. We demonstrate the impact of these encodings using multiple real-world datasets. Across these datasets, forward encodings improve the current state-of-the-art bit-parallel technique's scan and fetch performance in many cases by 1.4x and 1.3x, respectively."

Camelot v0.1: A Portable and High Performance GPU DBMS with LLVM Code Generation

Bobbi Yogatama

In this demo session, we present our initial progress on Camelot. Camelot is an ongoing effort in our research group to tackle various challenges in GPU DBMS including (1) limited GPU memory capacity, (2) supporting arbitrary complex queries, and (3) portability with existing CPU DBMS. Camelot uses the query parser and optimizer of DuckDB, while adding a GPU execution engine as an alternative backend to the native CPU execution engine in DuckDB. To achieve high performance, Camelot is powered by LLVM to automatically generate query specific and highly optimized GPU machine code. Our preliminary experiment on Star Schema Benchmarks shows that Camelot can achieve the same performance as handwritten query-specific CUDA implementation on GPUs.