Skip to main content

Database Affiliates 2023

Placeholder image for deomnstration

The Wisconsin Database Affiliates Program

  • Disseminate research results in data management
  • Facilitate academic-industry communication
  • Enable technology transfer
  • Connect students to potential employers
  • Solicit research and education feedback from industry

The annual workshop features research talks from faculty and students, industry talks, posters and demos, and social activities.

Schedule

The workshop will take place over two days.
October 12, Thursday, 8:30-4:30. Full-day workshop at Union South in room Northwoods (3rd floor). A detailed schedule will be announced soon.
October 13, Friday, 9:00-noon. Sponsor talks and one-on-one meetings betweeen companies and students.

October 12 (Northwoods@Union South)

2023

8:30 am
|
9:00 am
Breakfast and Mingle
9:00 am
|
9:20 am
Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries
Yifei Yang
This paper presents predicate transfer, a novel method that optimizes join performance by pre-filtering tables to reduce the join input sizes. Predicate transfer generalizes Bloom join, which conducts pre-filtering within a single join operation, to multi-table joins such that the filtering benefits can be significantly increased. Predicate transfer is inspired by the seminal theoretical results by Yannakakis, which uses semi-joins to pre-filter acyclic queries. Predicate transfer generalizes the theoretical results to any join graphs and use Bloom filters to replace semi-joins leading to significant speedup. Evaluation shows predicate transfer can outperform Bloom join by 3.1x on average on TPC-H benchmark.
9:20 am
|
9:40 am
Towards Large-Scale and Practical Data Analytics on GPUs
Bobbi Yogatama
Graphics Processing Units (GPUs) have shown great potential in accelerating data analytics queries due to their massive computational power and high memory bandwidth compared to CPUs. GPU databases have been extensively studied in both academia research and been deployed as commercial systems. Significant performance improvements ranging from ten times to several hundred times speedups have been reported. This massive speedup, however, is only applicable if the working sets fit in the GPU memory As the datasets grow, we will encounter two hardware bottlenecks in GPU databases: (1) limited GPU memory capacity and (2) limited CPU-GPU interconnect bandwidth. Our goal is to tackle these challenges and enable large-scale data analytics on GPUs through (1) data compression, (2) heterogeneous CPU-GPU databases, and (3) multi-GPU databases. In this talk, we will show how we explore the design space of each solution to get order-of-magnitude speedup compared to state-of-the-art approach. Finally, we will also discuss our ongoing effort to integrate the whole solutions into Camelot, a portable GPU-DBMS with DuckDB frontend.
9:40 am
|
10:00 am
F2: Designing a Key-Value Store for Large Skewed Workloads
Konstantinos Kanellis
Today’s key-value stores are either disk-optimized, focusing on large data and saturating device IOPS, or memory-optimized, focusing on high throughput with linear thread scaling assuming plenty of main memory. However, many practical workloads demand high performance for read and write working sets that are much larger than main memory, over a total data size that is even larger. They require judicious use of memory and disk, and today’s systems do not handle such workloads well. We present F2, a new key-value store design based on compartmentalization – it consists of five key components that work together in well-defined ways to achieve high throughput – saturating disk and memory bandwidths – while incurring low disk read and write amplification. A key design characteristic of F2 is that it separates the management of hot and cold data, across the read and write domains, and adapts the use of memory to optimize each case. Through a sequence of new latch-free system constructs, F2 solves the key challenge of maintaining high throughput with linear thread scalability in such a compartmentalized design. Detailed experiments on benchmark data validate our design’s superiority, in terms of throughput, over state-of-the-art key-value stores, when the available memory resources are scarce.
10:00 am
|
10:30 am
Keynote
10:30 am
|
10:45 am
Coffee Break
10:45 am
|
11:05 am
Return of the Antijoins
Hangdong Zhao
We study the complexity of evaluating Conjunctive Queries with negation (CQ¬). First, we present an algorithm with linear preprocessing time and constant delay enumeration for a class of CQs with negation called free-connex signed-acyclic queries. We show that no other queries admit such an algorithm subject to lower bound conjectures. Second, we extend our algorithm to Conjunctive Queries with negation and aggregation over a general semiring, which we call Functional Aggregate Queries with negation (FAQ¬). Such an algorithm achieves constant delay enumeration for the same class of queries, but with a slightly increased preprocessing time which includes an inverse Ackermann function. We show that this surprising appearance of the Ackermmann function is probably unavoidable for general semirings, but can be removed when the semiring has specific structure. Finally, we show an application of our results to computing the difference of CQs.
11:05 am
|
11:25 am
State-of-the-art Join Algorithms in Database (Theory) and Lower Bounds from Fine-grained Complexity
Austen Fan
In this talk, we will enjoy a tour to the state-of-the-art join algorithms for database queries, from Yannakakis algorithm, Worst-case Optimal Join, all the way to the state-of-the-art PANDA (Proof-Assisted eNtropic Degree-Aware) algorithm. We complement the algorithms with latest results on lower bounds for database queries, leveraging techniques from fine-grained complexity. The lower bounds are tight, or equivalently the PANDA algorithm is optimal, for several important classes of queries, including k-cycle, Loomis-Whitney Joins, Chordal Queries and etc..
11:25 am
|
11:45 am
Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching
Derek Paulsen
Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf block- top-k tf/idf blocking in a distributed share-nothing fashion on a ing in depth. We develop Sparkly, which uses Lucene to perform Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly com- pletely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we pro- vide an in-depth analysis of Sparkly’s performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.
11:45 am
|
12:00 pm
Lightning Talks (for posters)
2:00 pm
|
2:20 pm
Practical Perfect Hashing for OLAP Systems
Kevin Gaffney
A perfect hash function (PHF) maps a set of keys to a range of integers with no collisions. Compared to conventional hash methods, PHFs are attractive for their low space overhead and reduced control flow. Despite their advantages, there has been little investigation into the use of PHFs for online analytical processing (OLAP). This talk presents our recent work on practical perfect hashing for OLAP. We identify several promising applications for PHFs in OLAP and survey their current use in systems and research prototypes. We then evaluate existing PHF approaches and quantify their impact on query performance. Our results are encouraging: in a real OLAP system, PHFs achieve end-to-end speedups of 1.7X and 3.1X for join and aggregate queries, respectively. Nevertheless, there is room for improvement. Future approaches that simultaneously achieve low build time and high probe throughput could offer additional performance increases.
2:20 pm
|
2:40 pm
Rethinking the Encoding of Integers for Scans on Skewed Data
Martin Prammer
Bit-parallel scanning techniques are characterized by their ability to accelerate compute through the process known as early pruning. Early pruning techniques iterate over the bits of each value, searching for opportunities to end compute early, before processing each data value in its entirety. However, because of this iterative evaluation, the effectiveness of early pruning depends on the relative position of bits that can be used for pruning within each value. Due to this behavior, bit-parallel techniques have faced significant challenges when processing skewed data, especially when values contain many leading zeroes. This problem is further amplified by the inherent trade-off that bit-parallel techniques make between columnar scan and fetch performance: a storage layer that supports early pruning requires multiple memory accesses to fetch a single value. Thus, in the case of skewed data, bit-parallel techniques increase fetch latency without significantly improving scan performance when compared to baseline columnar implementations. To remedy this weakness, we transform the values in bit-parallel columns using novel encodings. We propose the concept of forward encodings: a family of encodings that shift pruning-relevant bits closer to the most significant bit. Using this concept, we propose two particular encodings: the Data Forward Encoding and the Extended Data Forward Encoding. We demonstrate the impact of these encodings using multiple real-world datasets. Across these datasets, forward encodings improve the current state-of-the-art bit-parallel technique's scan and fetch performance in many cases by 1.4x and 1.3x, respectively.
2:40 pm
|
3:00 pm
Performance Roulette: How Cloud Weather Affects ML-Based System Optimization
Johannes Freischuetz
As system complexity, workload diversity, and cloud computing adoption continue to grow, both operators and developers are turning to ML-based approaches for optimizing systems to a particular deployment and workload scenario in order to achieve the best performance and/or least cost. Most approaches to DBMS autotuning do not currently account for noise during sampling. We show that ML-based optimizers are sensative to noise.
3:00 pm
|
3:20 pm
Industry Talks
3:20 pm
|
3:45 pm
Break
3:45 pm
|
4:30 pm
What's Next Session

October 13 (Auditorium 1240@CS Department)

2023

9:00 am
|
9:30 am
Generative AI research efforts at GSL
Avrilia Floratou (Microsoft)
In this talk, I will first give an overview of the mission of the Microsoft Gray Systems Lab and will provide an overview of the various research initiatives the team is undertaking. I will then discuss in more details our research portfolio in the space of generative AI and LLMs with a special focus on natural language interfaces for data management.
Bio: Avrilia is a Principal Scientist Manager at Microsoft’s Gray Systems Lab. Her research broadly lies in the area of data management with a recent focus on leveraging Large Language Models (LLMs) to improve user productivity in domains such as data integration, data exploration and code migration. Her research interests also include semantics-aware data science and large-scale stream processing. Previously, she worked on leveraging data semantics and domain knowledge to improve the productivity of data scientists leading to the Semantic Link offering in Microsoft Fabric, Microsoft’s end-to-end, unified analytics platform. She also co-developed the Dhalion project, a framework used to define and apply SLO-based policies on long-running applications that has been deployed in various Microsoft services. Prior to her current role, she was a research scientist at IBM Almaden Research Center working on SQL-on-Hadoop engines and natural language interfaces for databases. Avrilia received her Ph.D. and M.Sc. in Computer Science from University of Wisconsin-Madison working with Prof. Jignesh M. Patel and her B.Sc. from University of Athens in Greece.
9:30 am
|
10:00 am
Scaling way up with NoSQL and SQL
Mike Marty (Google)
Google launched the Big Data era out of the sheer necessity of building massive applications on unreliable commodity hardware. This talk will discuss the context of how Google created large-scale NoSQL infrastructure, but then evolved to large-scale SQL-based query processing systems. I will discuss some systems challenges often overlooked such as the management of resources in a complex system built from micro-services.
Bio: Mike Marty is a Principal Engineer and Engineering Director at Google-Madison. He focuses on evolving F1 Query-- Google’s federated query engine-- as well as building out table-based storage systems. Prior to entering the world of data management, he spent a decade at Google focusing on high-performance networking and systems. His PhD is in Computer Architecture from the University of Wisconsin-Madison.
10:00 am
|
10:30 am
Secure & Private Data Collaboration with Snowflake
Ian Connolly (Snowflake)
Data collaboration is fundamental that transforms Snowflake from a cloud-native database that excels at data warehousing workloads into the Data Cloud. But enabling data collaboration in a safe and secure way is a significant engineering challenge. We’ll discuss the engineering and theoretical challenges that we’re working through to enable our customers to share and derive value from their most sensitive data, all within Snowflake.
Bio: Ian Connolly runs the Differential Privacy team at Snowflake. He was previously Head of Engineering at LeapYear Technologies, a San Francisco-based company which built a differentially private data platform until it was acquired by Snowflake in February 2023.

Lodging and Parking

    Union South Hotel. The hotel is in Union South, which is building where the workshop takes place. The building is right across the CS department.
    Doubletree. On campus, 15 min walk to Union South.
Parking (without a permit) on campus is very difficult. We recommend taking a Uber/Lyft/Taxi instead of renting a car.

Sponsorship

While the event is currently free to all attendants, we greatly appreciate any form of sponsorship. The benefits of an active sponsor include (1) having access to CVs of our graduate students, (2) 30-min sponsor talk on Friday, and (3) one-on-one interview with students on Friday. Please contact Xiangyao (yxy@cs.wisc.edu) if you are interested in any type of sponsorship.

Active sponsors