# Database Seminar

The database seminar is a weekly gathering of students and faculty interested in database systems. Each meeting lasts for an hour and consists of presentations of recent research papers, practice talks, or invited talks. Local speakers may choose to present their own research work or recent publications from database conferences or related fields. Students interested in database systems are strongly encouraged to attend the seminar.

## Current Schedule

• September 17 2018
X-DB, the Globally-Distributed Database System at Alibaba
Demai Ni
2310 CS 11:00 AM
Abstract :

Supporting Alibaba business running on hundreds of thousands of nodes, X-DB provides high availability, strong ACID, and horizontal scalability. To support one of the world’s largest and fast-growing e-commerce platform, X-DB manages very large data with innovative storage engine, Paxos-based deployment, and distributed-SQL engine. Fully utilizing new hardware in different areas including storage, network, multi-core, parallel and heterogeneous computing, X-DB co-designs hardware and software and is compatible with MySQL ecosystem. This talk will cover the overview of architecture design, and how technical innovation and business growth drive each other within Alibaba.

Speaker Bio :

Demai Ni is a Sr. Staff Software Engineer at Alibaba, developing globally-distributed OLTP database X-DB. Previously, Demai was a Principal Software Architect at Huawei American Research Lab, working on distributed cloud-native database LibrA (a MPP DB system), and the founding engineer for its SQL-on-Hadoop ELK. Demai started his professional career in Db2 for z/OS group at IBM, where his responsibilities including query transformation and optimization, and pureXML technology. Demai is also actively in Big Data projects, such as HBase, Hadoop, and Janusgraph. Demai received M.S. in Computer Sciences from UW-Madison, and Bachelor from Tsinghua University.

## Past Schedule

• April 23 2018
Experience with SIGMOD Programming Contest 2018 - Techniques Used and Lessons Learned
Jianqiao Zhu, Zuyu Zhang and Dylan Bacon
2310 CS 12:00 PM
Abstract :

This year's SIGMOD Programming Contest was about evaluating as fast as possible batches of SPJA (Selection-Projection-Join-Aggregation) queries on a set of immutable relations, under a single-node multicore and in-memory setting. We built from scratch a system called Robin that ranked 1st place in the contest, which was faster by a large margin (40%) than the runner-up solution. In this talk, we will introduce the key ingredients that constitute our solution, including various indexing, query optimization and query execution techniques derived from key observations of the workload characteristics. We will also introduce several C++ programming techniques for improving both human efficiency (productivity) and system efficiency when building a high-performance engine like Robin. These techniques include the extensive usage of lambda functions, variadic templates and template meta-programming. Finally, we will summarize the lessons learned from the contest, and how these techniques could be applied to a general database system like Apache Quickstep.

Speaker Bio :

Jianqiao Zhu is a Ph.D. candidate in the database group at UW-Madison, working with Prof Jignesh Patel. His research interests include query languages, query optimization, query processing kernels and data visualization.\nZuyu Zhang is a Ph.D. candidate working with Prof. Jignesh Patel in the Quickstep project. His primary research is database design, implementation, and evaluation, with a focus on distributed query processing.\nDylan Bacon is a Ph.D. student at UW Madison currently working in the database group under the direction of Professor Jignesh Patel. His undergraduate was in Applied Mathematics and Computer Science at UW Stout. The primary research interests of his are focused on traditional relational database systems, involving topics such as increasing the speed and efficiency of common mechanisms used by these systems such as joins, filtering, data storage and retrieval, and query optimization.

• April 16 2018
Deep Learning for Entity Matching - A Design Space Exploration
Sidharth Mudgal
2310 CS 12:00 PM
Abstract :

Entity matching (EM) finds data instances that refer to the same real-world entity. In this talk I will discuss applying deep learning (DL) to EM, to understand DL’s benefits and limitations. Many DL solutions have been developed for related matching tasks in natural language processing. I will categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power. To investigate the types of EM problems for which DL can be helpful, I will present results from an extensive empirical comparison of several DL solutions with Magellan, a state-of-the-art learning-based EM solution. Finally, I will analyze DL’s performance and discuss future research directions.

Speaker Bio :

Sidharth Mudgal is currently pursuing his Master's in Computer Science at UW-Madison, advised by Prof. AnHai Doan. His research interests lie at the intersection of natural language processing and data science. Recently, his work has focused on improving entity matching techniques using deep learning.

• March 19 2018
MatchCatcher - A Debugger for Blocking in Entity Matching
Han Li
2310 CS 12:00 PM
Abstract :

Blocking is a fundamental step in entity matching. Much work has examined the design and runtime of blockers. However, very little if any work has examined the problem of debugging blocking accuracy. In practice, blockers' accuracy can vary drastically, and using an inaccurate blocker can seriously degrade the overall quality of the entity matching process. In this work we describe the MatchCatcher solution to this problem. We show that the topic of debugging blockers opens up a very rich problem space. Within this space, we focus on the important problem of finding matches "killed off" by a blocker so that the user can examine these matches to determine whether the current blocker is too "aggressive" and what can be done to make it less so. We show how to find such matches using string similarity joins (SSJs), iterative user engagement, rank aggregation, and active/online learning. We describe experiments showing the utility of MatchCatcher. MatchCatcher has been open sourced, used extensively in data science classes and at several organizations, and has proven to be a highly effective debugging tool "in the wild". Interestingly it has also been used for unanticipated purposes, such as data profiling for entity matching.

Speaker Bio :

Han Li is a graduate student in the database group working with Prof. AnHai Doan. His primary research interest is in entity matching, focusing on debugging the blocking stage and exploring deep learning techniques for the matching stage in entity matching.

• February 26 2018
Welcome Weekend for prospective graduate students
DB Group meeting
2310 CS 12:00 PM
Abstract :

Meeting to discuss our plan for the CS Welcome Weekend (Mar 9th-10th) for prospective students.

Speaker Bio :

Welcome Weekend Organization

• February 5 2018
Scaling database systems to high-performance computers
Spyros Blanas
2310 CS 12:00 PM
Abstract :

Processing massive datasets quickly requires warehouse-scale computers. These computers consist of thousands of compute nodes, offer petabytes of main memory and are interconnected with RDMA-capable networks. However, they have very limited I/O bandwidth to shared, cold storage compared to clusters of commodity servers. Parallel database systems have been designed for small shared-nothing clusters and cannot fully utilize a high-performance computer. In addition, many massive datasets, especially in science, are not naturally represented as tables to be queried using SQL. Despite decades of data management research, many users still write format-specific, imperative code to sift through data. In this talk, we will first present ArrayBridge, a bi-directional array view mechanism for the HDF5 array file format. ArrayBridge allows scientists to use SciDB, TensorFlow and HDF5-based code in the same file-centric analysis pipeline without converting between file formats. Under the hood, ArrayBridge manages I/O to leverage the massive concurrency of warehouse-scale parallel file systems without modifying the HDF5 API and breaking backwards compatibility with legacy applications. Once the data has been loaded in memory, the bottleneck in many array-centric queries becomes the speed of data repartitioning between different nodes. We will present an RDMA-aware data shuffling abstraction that directly converses with the network adapter in InfiniBand verbs and can repartition data up to 4X faster than MPI. We conclude by highlighting research opportunities that need to be overcome for database systems to scale to warehouse-scale computers.

Speaker Bio :

Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high-performance database systems, and his current research goal is to build a database system for high-end scientific computing facilities. He received his Ph.D. at the University of Wisconsin–Madison. Part of his Ph.D. dissertation was commercialized in Microsoft SQL Server 2014 as the Hekaton in-memory transaction processing engine.

• December 4 2017
The Journey From Faster to More Predictable - Using Statistics to Build Better Data-Intensive Systems
Barzan Mozafari
2310 CS 12:00 PM
Abstract :

Much of the database research over the past four decades has focused on improving the overall performance of data-intensive applications, in terms of throughput and mean latency. This race for “faster” has neglected an important aspect of performance - predictability. Modern data management systems have turned into massive codebases of sophisticated algorithms and data structures, rendering them highly unpredictable in their performance. This has not only made their provisioning, tuning, and diagnosis increasingly challenging, but has also raised their total cost of ownership. For example, to cope with tail latencies, highly-qualified personnel are constantly occupied with performance tuning of their database applications, overprovisioning their hardware resources, and even resorting to expensive redundancies in data and computation. n this discussion, we begin our journey by vetting various notions of predictability that affect data-intensive applications, including variance and high-percentile latencies, graceful degradation, regression error, and user-specified deadlines. We then identify the internal sources of performance unpredictability. In particular, we propose a new type of performance profiler, called VProfiler. Unlike traditional profilers that focus on mean latency, VProfiler can pin point the root causes of performance variance in a complex codebase. Our findings lead us to develop a new scheduling algorithm, called VATS, which has now become the default policy in MySQL distributions. We then turn our attention to external sources of unpredictability, where the theory of Robust Optimization presents itself as a viable approach. Surprisingly, our journey from faster to more predictable comes full circle, as we discover that speed and predictability are not mutually exclusive; predictable performance is often faster too! We conclude by introducing a practical tool that uses past statistics to extract “workload intelligence” and predict future performance.

Speaker Bio :

Barzan Mozafari is a Morris Wellman Assistant Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Prior to that, he was a Postdoctoral Associate at MIT. He earned his Ph.D. in Computer Science from UCLA in 2011. His research career has led to several open-source projects, including DBSeer (an automated database diagnosis tool), BlinkDB (a massively parallel approximate query engine), and SnappyData (an HTAP engine that empowers Apache Spark with transactions and real-time analytics). He has won the National Science Foundation CAREER award, as well as several best paper awards in ACM SIGMOD and EuroSys. He is the founder of Michigan Software Experts, and a strategic advisor to SnappyData, a company that commercializes the ideas introduced by BlinkDB.

• November 13 2017
Lightweight Adaptive and Mixed Concurrency Control
Aaron Elmore
2310 CS 12:00 PM
Abstract :

New concurrency control protocols focus on enabling high throughput for large main-memory and many-core systems. These protocols are largely optimized for specific workloads, but due to unknown and shifting access patterns one specific algorithm cannot fit all varied workloads. Thus, it is desirable to choose the right concurrency control protocols for a given workload. To address this issue we present ACC, a database system that dynamically applies a concurrency control algorithm to partitions of the database according to workload characteristics and supports transaction execution across mixed concurrency control protocols. This talk will introduce a general framework for mixing multiple protocols without modifying them or introducing any overhead of coordinating conflicts across protocols, a set of simple and efficient features for selecting an ideal protocol, a method for reconfiguring protocols without stopping execution, and a prototype system. Our experiments show compared with static protocols, ACC achieves throughput by up to 9.9x, 2.2x, and 1.9x over a partitioned single-threaded concurrency control, OCC, and 2PL respectively.

Speaker Bio :

Aaron J. Elmore is an Assistant Professor in the Department of Computer Science, and the College of the University of Chicago. Aaron was previously a Postdoctoral Associate at MIT working with Mike Stonebraker on elastic and multitenant database systems, and Sam Madden on the DataHub project. Aaron's thesis on Elasticity Primitives for Database-as-a-Service was completed at the University of California, Santa Barbara under the supervision of Divy Agrawal and Amr El Abbadi. His research interests include elastic and adaptive database systems, collaborative data analytics, tools for data lake management, and data intensive systems.

• November 6 2017
Compressed Representation of Conjunctive Queries
Shaleen Deep
2310 CS 12:00 PM
Abstract :

Relational queries, and in particular join queries, often generate large output results when executed over huge datasets. Thus, it is often infeasible to store the materialized output if we plan to reuse it to answer further queries over the result. Motivated by this observation, we study the problem of constructing space-efficient compressed representations of the output of conjunctive queries, with the goal of supporting the efficient evaluation of queries over the intermediate compressed result, instead of the original input database. In particular, we examine an important tradeoff - minimizing the space necessary to store the compressed result, versus minimizing the answer time and delay when we answer queries over the result. Our main contribution is a novel data structure, which can be tuned to trade off space for answer time. The tradeoff allows us to control the space requirement of the data structure precisely, and depends both on the structure of the query and the type of queries we plan to ask over the query result. We also show how we can use the data structure in conjunction with query decomposition techniques, in order to efficiently represent the outputs for several classes of conjunctive queries.

Speaker Bio :

Shaleen Deep is a 3rd year graduate student working with Prof. Paris Koutris on problems related to query compression and query pricing.

• October 23 2017
Closing the loop on data analysis
Eugene Wu
2310 CS 12:00 PM
Abstract :

The rapid democratization of data has placed its access and analysis in the hands of the entire population. While the tools for rapid and large-scale data processing have continued to reduce the time to compute analysis results, the techniques to help users better and more easily visualize their data, clean and prepare their data, and understand what their results mean are still lacking. In this talk, I will provide an overview of our lab's recent work on addressing each stage of data analysis—data cleaning, data visualization, and explanation.

Speaker Bio :

Eugene Wu is broadly interested in technologies that help users play with their data. His goal is for users at all technical levels to effectively and quickly make sense of their information. His focus is in solutions that ultimately improve the interface between users and data, and borrows techniques from fields such as data management, systems, crowdsourcing, visualization, and HCI.

Eugene Wu received his Ph.D. from CSAIL at MIT, advised by the esteemed Sam Madden and Michael Stonebraker, in the database group. He spent the first half of 2015 at UC Berkeley before starting at Columbia University in Fall 2015.

• October 9 2017
HoloClean - Holistic Data Repairs with Probabilistic Inference
Theodoros Rekatsinas
2310 CS 12:00 PM
Abstract :

In this talk I will present HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative data repairing, which relies on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean finds data repairs with an average precision of ∼ 90% and an average recall of above ∼ 76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2× against state-of-the-art methods. At the end of the talk, I will also discuss a series of future research directions on the intersection of data cleaning and statistical learning and inference.

Speaker Bio :

Theodoros (Theo) Rekatsinas is an Assistant Professor in the Department of Computer Sciences at the University of Wisconsin-Madison. He is a member of the Database Group. He earned his Ph.D. in Computer Science from the University of Maryland and was a Moore Data Postdoctoral Fellow at Stanford University. His research interests are in data management, with a focus on data integration, data cleaning, and uncertain data. Theo's work on using quality-aware data integration techniques to forecast the emergence and progression of disease outbreaks received the Best Paper Award at SDM 2015. Theo was awarded the Larry S. Davis Doctoral Dissertation award in 2015.

• October 2 2017
Differential Privacy in Relational Databases
Ashwin Machanavajjhala
2310 CS 12:00 PM
Abstract :

An important constraint that algorithms must satisfy when analyzing sensitive data from individuals is privacy. Differential privacy has revolutionized the way we reason about privacy and has championed the need for data analysis algorithms with provable guarantees. Differential privacy and its variants have become the gold standard for exploring the tradeoffs between the privacy of individuals and the utility of the statistical insights mined from the data, and are in use by many commercial (e.g., Google and Apple) and government entities (e.g., US Census) for collecting and sharing sensitive user data. In today's talk, I will highlight key challenges in designing differentially private algorithms for relational databases, and highlight research from our group that try to address these challenges. In particular, I will describe our recent work on modernizing the data publication process for a US Census Bureau data product that uses relational data, called LODES/OnTheMap. In this work, we identified legal statutes and their current interpretations that regulate the publication of LODES/OnTheMap data, formulated these regulations mathematically, and designed algorithms for releasing tabular summaries that provably ensured these privacy regulations. Our solutions are able to release summaries of the data with error comparable or even better than that of current releases (which are not provably private), for reasonable settings of privacy parameters.

Speaker Bio :

Ashwin Machanavajjhala is an Assistant Professor in the Department of Computer Science, Duke University. Previously, he was a Senior Research Scientist in the Knowledge Management group at Yahoo! Research. His primary research interests lie in algorithms for ensuring privacy in statistical databases and augmented reality applications. He is a recipient of the ICDE 10 year Influential Paper award in 2017, a National Science Foundation Faculty Early CAREER award in 2013, and the 2008 ACM SIGMOD Jim Gray Dissertation Award Honorable Mention. Ashwin graduated with a Ph.D. from the Department of Computer Science, Cornell University and a B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Madras.

• September 25 2017
Understanding the Hog in HogWild! - Analysis of Parallel SGD in Shared Memory Settings
Navneet Potti
2310 CS 12:00 PM
Abstract :

Stochastic Gradient Descent (SGD) is the workhorse algorithm for various machine learning tasks. SGD is inherently a serial algorithm and there have been numerous attempts to improve its performance for modern multicore processors. HogWild! is a commonly used scheme to parallelize SGD. While its lock-free parallelism and optimal convergence rate guarantees suggest that it might scale linearly in multi-core settings, our empirical studies show that the scalability of HogWild! varies widely, depending on the characteristics of the machine learning task and dataset. In this talk, I will present a thorough analysis of when and why HogWild! scales poorly in the shared memory setting. In particular, our study explains and analytically models the hidden synchronization cost imposed by the cache coherence protocols in the “ubiquitious” modern x86 CPU hardware. The insights from our work allow us to step back and look at the bigger machine learning work ow in which HogWild! is run, allowing us to consider ways to use the hardware more effectively. Besides presenting a deeper understanding of the interactions between HogWild! and current multicore processors, our insights may also help designers of lock-free algorithms in general and designers of specialized hardware for such methods to better understand the costs associated with hidden synchronization, and thus develop more effective methods.

Speaker Bio :

Navneet Potti is a 5th year graduate student at UW-Madison, working with Prof Jignesh Patel on the Quickstep and Ava projects. His research interests broadly lie in all aspects of making data analytics faster and easier to use. He graduated from IIT Madras in 2010 and then worked for Goldman Sachs until 2013. He spent summers doing internships at IBM Almaden, working on their SQL-on-Hadoop system Big SQL; at Pivotal, working on their Greenplum database; and at Google Research, working on information extraction from emails.

• September 18 2017
Organizational Meeting
None
2310 CS 12:15 PM
Abstract :

Decide the logistics of the Fall 2017 database seminar.

• May 8 2017
Falcon - Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services
Paul Suganthan
2310 CS 12:15 PM
Abstract :

Many works have applied crowdsourcing to entity matching (EM). While promising, these approaches are limited in that they often require a developer to be in the loop. As such, it is difficult for an organization to deploy multiple crowdsourced EM solutions, because there are simply not enough developers. To address this problem, a recent work has proposed Corleone, a solution that crowdsources the entire EM workflow, requiring no developers. While promising, Corleone is severely limited in that it does not scale to large tables. We propose Falcon, a solution that scales up the hands-off crowdsourced EM approach of Corleone, using RDBMS-style query execution and optimization over a Hadoop cluster. Specifically, we define a set of operators and develop efficient implementations. We translate a hands-off crowdsourced EM workflow into a plan consisting of these operators, optimize, then execute the plan. These plans involve both machine and crowd activities, giving rise to novel optimization techniques such as using crowd time to mask machine time. Extensive experiments show that Falcon can scale up to tables of millions of tuples, thus providing a practical solution for hands-off crowdsourced EM, to build cloud-based EM services.

Speaker Bio :

Paul Suganthan is a graduate student in the Database Systems Group working with Prof. AnHai Doan. His primary research interest is in entity matching, focusing on scaling up entity matching workflows.

• April 17 2017
Accelerating Linear Algebra over Normalized Data
Lingjiao Chen
2310 CS 12:15 PM
Abstract :

Providing machine learning (ML) over relational data is a mainstream requirement for data management systems. Recently, there is growing interest in optimizing the performance of ML by exploiting the redundancy generated by joins, which is called ''factorized learning''. However, the existing work requires a manual rewrite of each ML implementation, which creates a massive development overhead when extending factorized learning to other ML algorithms.

In this talk, I will discuss how to mitigate this overhead by factorizing a common formal language to represent the computations of many ML algorithms - linear algebra. We introduce a new logical data type to represent normalized data and devise a framework of algebraic rewrite rules to convert a large set of linear algebra operations over denormalized data into operations over normalized data. This enables us to automatically "factorize" several popular ML algorithms, thus unifying and generalizing several prior works. We prototype our framework in the popular ML environment R and an industrial R-over-RDBMS tool. Experiments with both synthetic and real normalized data show that our framework also yields significant speed-ups, up to 36x on real data.

Joint work with Arun Kumar, Jeff Naughton and Jignesh M. Patel.

Speaker Bio :

Lingjiao is a graduate student in University of Wisconsin Madison. His research interest spans in data management (especially advanced data analytics), as well as machine learning and optimization. He is particularly interested in accelerating data analytics workload from both theory and system perspectives.

• April 10 2017
A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries
Bas Ketsman
2310 CS 12:15 PM
Abstract :

This talk presents results about the worst-case optimal communication cost for parallel evaluation of conjunctive queries. Particularly, we describe a multi-round algorithm that computes any conjunctive query with load m/p^{1/\rho*} per server, in the case where all input relations are binary. Here, \rho* denotes the fractional edge covering number of the query hypergraph, m is the maximum size of relations, and p is the number of servers.

In this setting two main prior results where known - First, for one-round algorithms over skew-free data, the optimal communication cost per server is m/p^{1/\tau*}, where m is the size of the largest input relation, and \tau* is the fractional vertex covering number of the query Hypergraph. Our algorithm exploits this result to obtain the mentioned bound when no skew is present. Second, it has been proven recently that for multi-round algorithms and unrestricted database instances, any algorithm requires at least m/p^{1/\rho*} communication cost per server. Hence, this algorithm is essentially optimal.

Joint work with Dan Suciu.

Speaker Bio :

Bas Ketsman is a Ph.D. candidate at Hasselt University, Belgium, advised by Prof. Frank Neven. He is a PhD Fellow of the Research Foundation Flanders (FWO) and a member of the research group Databases and Theoretical Computer Science. His research interests include database theory and cloud computing.

• April 3 2017
Dhalion - Self-Regulating Stream Processing in Heron
Avrilia Floratou
2310 CS 12:15 PM
Abstract :

In recent years, there has been an explosion of large-scale real-time analytics needs and a plethora of streaming systems have been developed to support such applications. These systems are able to continue stream processing even when faced with hardware and software failures. However, these systems do not address some crucial challenges facing their operators - the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLOs) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation. In this talk, we introduce the notion of self-regulating streaming systems and the key properties that they must satisfy. We then present the design and evaluation of Dhalion, a system that provides self-regulation capabilities to underlying streaming systems. We describe our implementation of the Dhalion framework on top of Twitter Heron, as well as a number of policies that automatically reconfigure Heron topologies to meet throughput SLOs, scaling resource consumption up and down as needed. We are in the process of open-sourcing Dhalion as part of the Heron project.

Speaker Bio :

Avrilia Floratou is a Senior Scientist at Microsoft's Cloud and Information Services Lab (CISL). Her research interests broadly lie in the area of data management with a recent focus on large-scale stream processing and real-time analytics. She is also an active contributor to Heron, collaborating with Twitter. Prior to her current role, she was a research scientist at IBM Almaden Research Center working on SQL-on-Hadoop engines and incorporating her research in the IBM Big SQL product offering. She received her Ph.D. and M.Sc. in Computer Science from University of Wisconsin-Madison and her B.S from University of Athens in Greece.

• March 27 2017
Detecting Errors in Labelled Data For Entity Matching
Haojun Zhang
2310 CS 12:15 PM
Abstract :

Labelled data are widely used in many areas, especially in tasks that use supervised machine learning. For example, in the entity matching pipeline, labelled data are used to debug the blocking step, train and evaluate matchers and the whole pipeline. With the emerging of deep learning, we can expect more and more large-size labelled datasets. However, labelled data are usually collected by manual approaches, automatic or semiautomatic approaches. Such approaches are error-prone, which in turns affects the usage of labelled data for various tasks.

In this talk, I will present an ongoing research project about detecting label errors in such datasets for entity matching. We develop an iterative tool to help the analyst find and correct the label errors. In each iteration, our tool will search for suspicious labels, rank them and return the top-k to the analyst for manual verification. Then our tool uses the verified labels as feedback to update the searching and ranking for next iterations. Our experiments on both real and synthetic datasets show that our tool can help the analyst find many label errors by only verifying a small subset of examples in the dataset. This is a joint work with Fatemah Panahi, AnHai Doan, Jeff Naughton.

Speaker Bio :

Haojun is a graduate student in the Database Systems Group working with Prof. AnHai Doan. His primary research interest is in entity matching and data cleaning, focusing on how to combine techniques such as machine learning, optimization and crowdsourcing to improve entity matching and data cleaning.

• March 13 2017
THEMIS - Fairness in Federated Stream Processing under Overload (SIGMOD 2016)
2310 CS 12:15 PM
Abstract :

Federated stream processing systems, which utilise nodes from multiple independent domains, can be found increasingly in multi-provider cloud deployments, internet-of-things systems, collaborative sensing applications and large-scale grid systems. To pool resources from several sites and take advantage of local processing, submitted queries are split into query fragments, which are executed collaboratively by different sites. When supporting many concurrent users, however, queries may exhaust available processing resources, thus requiring constant load shedding. Given that individual sites have autonomy over how they allocate query fragments on their nodes, it is an open challenge how to ensure global fairness on processing quality experienced by queries in a federated scenario.

We describe THEMIS, a federated stream processing system for resource-starved, multi-site deployments. It executes queries in a globally fair fashion and provides users with constant feedback on the experienced processing quality for their queries. THEMIS associates stream data with its source information content (SIC), a metric that quantifies the contribution of that data towards the query result, based on the amount of source data used to generate it. We provide the BALANCE-SIC distributed load shedding algorithm that balances the SIC values of result data. Our evaluation shows that the BALANCE-SIC algorithm yields balanced SIC values across queries, as measured by Jain’s Fairness Index. Our approach also incurs a low execution time overhead.

Speaker Bio :

Harshad is a PhD candidate in the Database Systems Group working with Prof. Jignesh M. Patel on the Quickstep project. His primary research interest is in resource management for in-memory database systems, specifically in designing query schedulers.

• February 6 2017
LIP - Looking Ahead Makes Query Plans Robust
Jianqiao Zhu, Navneet Potti
2310 CS 12:15 PM
Abstract :

Query optimizers and query execution engines cooperate to deliver high performance on complex analytic queries. Typically, the optimizer searches through the plan space and sends a selected plan to the execution engine. However, optimizers may at times miss the optimal plan, with some- times the disastrous impact on performance. In this paper, we develop the notion of robustness of a query evaluation strategy with respect to a space of query plans. We also propose a novel query execution strategy called Lookahead Information Passing (LIP) that is robust with respect to the space of (fully pipeline-able) left-deep query plan trees for in-memory star schema data warehouses. LIP ensures that execution times for the best and the worst case plans are far closer than without LIP. In fact, under certain assumptions of independent and uniform distributions, any plan in that space is theoretically guaranteed to execute in near-optimal time. LIP ensures that the execution time for every plan in the space is nearly-optimal. We also evaluate these claims using workloads that include skew and correlation. With LIP we make an initial foray into a novel way of thinking about robustness from the perspective of query evaluation, where we develop strategies (like LIP) that collapse plan sub-spaces in the overall global plan space.

Speaker Bio :

Jianqiao Zhu is a 5th year Ph.D. student in the database group at UW-Madison, working with Prof Jignesh Patel. His research interests include query languages, query optimization, query processing system design and data visualization.

Navneet Potti is a 4th-year graduate student at UW-Madison, working with Prof Jignesh Patel on the Quickstep project. His primary research interests broadly lie in all aspects of developing high-performance data analytics platforms, ranging from main memory database engines to machine learning. He is currently affiliated with Microsoft Gray Systems Lab and has spent summers doing internships at IBM Almaden as well as at Pivotal.

• January 23 2017
Organizational Meeting
None
2310 CS 12:15 PM
Abstract :

Decide the logistics of the spring 2017 database seminar.

• November 28 2016
CyborgCleaner - Toward End-to-End Data Cleaning with Machine and Human
2310 CS 12:15 PM
Abstract :

Most current data cleaning solutions solve only certain steps of the cleaning pipeline, not the entire pipeline. For example, many cleaning pipelines consist of two parts - a machine part that employs a cleaning algorithm, and a human part that employs a user to manually clean up the output of the algorithm. Current work has addressed only the machine part. This raises three serious problems. First, without guidance, users often end up doing something ad-hoc, suboptimal, and incorrect. Second, the machine part may optimize for the wrong goal, causing more work in the human part. Finally, focusing only on the machine part makes it very difficult to address the brittleness of the cleaning algorithm.

To address these problems, we argue for more attention to end-to-end data cleaning, focusing not just on the machine, but also on the human part. We propose an RDBMS-style solution strategy, which defines basic human operations, combines them with cleaning algorithms to form hybrid plans, estimates plan costs (e.g., in terms of human effort), then selects the best plan. As a case study, we examine the important problem of value normalization and develop an effective solution. Our work thus advocates for end-to-end data cleaning and shows that it is possible to apply an RDBMS-style approach to build complex machine-human solutions to this problem.

• November 21 2016
Design of Arbitrage Free Data Pricing Schemes
Shaleen Deep
2310 CS 12:15 PM
Abstract :

Motivated by a growing market that involves buying and selling data over the web, we study pricing schemes that assign value to queries issued over a database. Previous work studied pricing mechanisms that compute the price of a query by extending a data seller's explicit prices on certain queries, or investigated the properties that a pricing function should exhibit without detailing a generic construction. In this work, we present a formal framework for pricing queries over data that allows the construction of general families of pricing functions, with the main goal of avoiding arbitrage. We consider two types of pricing schemes - instance-independent schemes, where the price depends only on the structure of the query, and answer-dependent schemes, where the price also depends on the query output. Our main result is a complete characterization of the structure of pricing functions in both settings, by relating it to properties of a function over a lattice.

Speaker Bio :

Shaleen is a second year grad student working with Prof. Paris Koutris. His interest is in the area of database theory.

• November 14 2016
Artificial Intelligence and Machine Learning Innovation at Intuit
2310 CS 12:15 PM
Abstract :

Leaders from Intuit's Innovation & Advanced Technology Group look at the horizon in Artificial Intelligence and share what Intuit is doing in the space. With 37 million customers across personal finance, accounting, and tax products, Intuit is uniquely positioned to address this topic. Not only did Intuit use AI to create the "Tax Knowledge Engine", a system that understands the American tax code by analyzing thirty million tax returns facilitated through TurboTax annually, but also the company is exploring a wide variety of new modes of artificial intelligence and conversational user interfaces.

Speaker Bio :

Bharath Kadaba leads Intuit's Innovation and Advanced Technology group and serves as Intuit's Chief Innovation Officer. During Bharath's eight years with Intuit, he has led various teams across Intuit in exploring and creating right-for-me offerings for Global Small Business, Accountants and Consumers. Prior to Intuit, he was the SVP of Media Engineering at Yahoo, VP/GM of Application Integration at Siebel Systems, and led engineering at two high profile Internet startups in Silicon Valley. Early in his career he led research and development on Gigabit Internet Routers at the IBM T.J. Watson Labs. Bharath earned a Ph.D. in Information Theory and Networks from the University of Hawaii.

Michael J. Radwin leads Intuit's Innovation & Advanced Technology Data Science team, where he turns forward-looking data ideas into working prototypes and products. Prior to Intuit, Radwin was VP Engineering of Anchor Intelligence, which used machine learning ensemble methods to fight online advertising fraud. Earlier, Radwin was Director of Engineering at Yahoo!, where he built ad-targeting and personalization algorithms with neural networks and naive Bayesian classifiers, and scaled web platform technologies Apache and PHP.

Pari Lingampally is an alumnus of UW-Madison, who graduated with a degree in Computer Engineering in 2014. At Intuit, he has worked on multiple projects that ranged from a novel identity management system to an in-production algorithm used in TurboTax to help catch user typed errors. More recently, he has worked on an project that experimented with using Google's TensorFlow to do user prediction based on TurboTax clickstream data.

• November 7 2016
Database Systems Meet Non-Volatile Memory (NVRAM)
Per-Ake (Paul) Larson
2310 CS 12:15 PM
Abstract :

Byte addressable, non-volatile memory (NVRAM) with close to DRAM speed is becoming a reality. Low capacity DIMMs (10s of MBs) are already available and high capacity DIMMs (100s of GB) are expected in 2017. This talk is about how database systems, in particular, main-memory databases can benefit from NVRAM. It will begin with an outline of the characteristics of different types of NVRAM and how the operating systems provide applications access to NVRAM. Ensuring that data structures such as indexes in NVRAM can be recovered to a consistent state without data or memory loss after a crash is challenging. The talk will discuss what causes the difficulties and how they can be overcome. It will then show how NVRAM can be used to greatly reduce the latency of commit processing and replication. By storing a main-memory database, including indexes, in NVRAM it is possible to achieve near-instant recovery after a crash. The final part of the talk will discuss how this can be achieved.

Speaker Bio :

Paul has conducted research in the database field for over 35 years. He served as a Professor in the Department of Computer Science at the University of Waterloo for 15 years and as a Principal Researcher at Microsoft Research for close to 20 years. Paul is a Fellow of the ACM. He has worked in a variety of areas - file structures, materialized views, query processing, query optimization, column stores, and main-memory databases among others. Paul collaborated closely with the SQL Server team to drastically improve SQL Server performance by adding column store indexes, a novel main-memory engine (Hekaton), and support for real-time analytics.

• October 28 2016
Data Systems that are Easy to Design, Tune and Use
Stratos Idreos
2310 CS 11:00 PM
Abstract :

How far away are we from a future where a data management system sits in the critical path of everything we do? Already today we often go through a data system in order to do several basic tasks, e.g., to pay at the grocery store, to book a flight, to find out where our friends are and even to get coffee. Businesses and sciences are increasingly recognizing the value of storing and analyzing vast amounts of data. Other than the expected path towards an exploding number of data-driven businesses and scientific scenarios in the next few years, in this talk we also envision a future where data becomes readily available and its power can be harnessed by everyone. What both scenarios have in common is a need for new kinds of data systems that "just work"; they are easy to design, easy to tune, and easy to use. We will discuss this vision and our recent efforts towards 1) adaptive data systems that can adapt to data and access patterns on-the-fly, 2) self-designing data systems that make it easy to spin-off and test new data system architectures and 3) curious data systems that make it easy to explore data even if we do not know what queries to ask.

Speaker Bio :

Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory at Harvard SEAS. Stratos works on data system architectures with emphasis on how we can make it easy to design efficient data systems as applications and hardware keep evolving and on how we can make it easy to use these systems even for non-experts. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Early Career Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.

• October 24 2016
Learning-based Cost Management for Cloud Databases
Olga Papaemmanouil
2310 CS 12:15 PM
Abstract :

Cloud computing has become one of the most active areas of computer science research, in large part because it allows computing to behave like a general utility that is always available on demand. While existing cloud infrastructures and services reduce significantly the application development time, significant effort is still required by cloud data management applications to manage their monetary cost, for often this cost depends on a number of decisions including but not limited to performance goals, resource provisioning and workload allocation. These tasks depend on the application-specific workload characteristics and performance objectives and today their implementation burden is left on the application developers.
We argue for a substantial shift away from human-crafted solutions and towards leveraging machine learning algorithms to address the above challenges. These algorithms can be trained on application-specific properties and customized performance goals to automatically learn how to provision resources as well as schedule the execution of incoming query workloads with low cost. Towards this vision, we have developed WiSeDB, a learning-based cost management service for cloud-deployed data management applications. In this talk, I will discuss how WiSeDB uses (a) supervised learning toautomatically learn cost-effective models for guiding query placement, scheduling, and resource provisioning decisions for batch processing, and (b) reinforcement learning to offer low cost online processing solutions, while being adaptive to resource availability and decoupled from notoriously inaccurate performance prediction models.

Speaker Bio :

Olga Papaemmanouil is an Assistant Professor in the Department of Computer Science at Brandeis University since January 2009. Her research interest lies in the area of data management with a recent focus on cloud databases, data exploration, query optimization and query performance prediction. She is the recipient of an NSF Career Award (2013), an ACM SIGMOD Best Demonstration Award (2005) and a Paris Kanellakis Fellowship from Brown University (2002). She received a undergraduate degree in Computer Engineering and Informatics at the University of Patras, Greece, in 1999, a Sc.M. in Information Systems at the University of Economics and Business, in Athens, Greece, in 2002, and a Ph.D in Computer Science at Brown University, in 2008.

• October 17 2016
Self-diagnosing and self-healing indexes
Goetz Graefe
2310 CS 12:15 PM
Abstract :

Work done by - Goetz Graefe and Harumi Kuno at HP and Bernhard Seeger (U Marburg)
The integrity of B-tree structures can become compromised for many reasons. Even if all programming errors in a database management system could be avoided, failures in the various layers of hardware and software beneath the database kernel can result in inconsistencies. Since these inconsistencies manifest themselves in unpredictable and sometimes disastrous ways, all commercial database management systems include mechanisms to verify the integrity and trustworthiness of an individual index and of a set of related indexes, and all vendors recommend index verification as part of regular database maintenance similar to regular backups.
Instead, or in addition, we suggest verifying the B-tree structure in each traversal. With carefully designed tree structure and node contents, a root-to-leaf pass can verify all nodes along its path comprehensively, i.e., it can verify all B-tree invariants including its consistency constraints with respect to its siblings and neighbors. Thus, instead of verifying the B-tree structure and contents offline, i.e., instead of the traditional approach, normal application execution verifies the structure frequently and efficiently. Frequent verification and fast access to log records permits automatic, reliable, and efficient recovery of the correct, up-to-date page contents. Our contribution is an index structure that gives reliable and efficient access to all relevant log records and thus enables root cause analysis during testing and automatic recovery after deployment.

Speaker Bio :

Goetz Graefe recently joined Google's Madison office after almost 10 years with HP Labs, over 10 years with Microsoft SQL Server, and 7 years in academia. He graduated with an MS in 1984 and a PhD in 1987, both at UW-Madison. He has published about database query optimization, query execution, indexing, transactions, recovery, and concurrency control; and has meandered all over from the ivory tower to release management.
The work presented today was done at HP Labs.

• October 10 2016
Speeding up Machine Learning using Graphs and Codes
Dimitris Papailiopoulos
4310 CS 12:15 PM
Abstract :

I will talk about three simple combinatorial ideas to speed up parallel and distributed learning algorithms. We will start off with serial equivalence in asynchronous parallel machine learning, its significance, and how we can efficiently achieve it using a recent result on graph phase transitions. We will continue on the issue of stragglers (i.e., slow nodes) in distributed systems; here, we will use erasure codes to robustify gradient based algorithms against delays. In our last example, we will reduce the high communication cost of data-shuffling in distributed learning, using the seemingly unrelated notion of coded caching. For all the above, we will see real world experiments that indicate how these simple ideas can significantly improve the speed and accuracy of large-scale learning.

Speaker Bio :

I am an assistant professor of Electrical and Computer Engineering at the University of Wisconsin-Madison, and a faculty affiliate with the Optimization group at the Wisconsin Institute for Discovery. My research lies in the intersection of machine learning, coding theory, and distributed systems. I am particularly interested in coordination-avoiding algorithms for large-scale machine learning, and on using erasure codes to speed up distributed computation.

Before coming to Madison, I spent two wonderful years as a postdoc at UC Berkeley. At Berkeley, I was a member of the AMPLab and BLISS, and had the pleasure to collaborate with Kannan Ramchandran and Ben Recht, on several fun projects. I received my Ph.D. in 2014 from UT Austin, where I was fortunate to be advised by Alex Dimakis. Before UT, I spent 3.5 years as a grad student at USC. Before all that, I received my M.Sc. (2009) and ECE Diploma (2007) from the Technical University of Crete (TUC), located in the beautiful city of Chania.

• October 3 2016
Toward a System Building Agenda for Data Integration
AnHai Doan
2310 CS 12:15 PM
Abstract :

Data Integration (DI) has long been an important topic in data management, and will become even more so in the age of Big Data and data science. Most DI works so far have focused on developing algorithms. In this talk I argue that we should devote far more effort to building DI systems, in order to truly advance the field. Toward this goal, I begin by drawing on my recent industrial experience to discuss the limitations of current DI systems. Next, I propose an agenda to build a new kind of DI systems to address these limitations. These systems guide users through the DI workflow, step by step. They provide tools to address the pain points'' of the steps, and tools are built on top of the Python data science and big data ecosystem (PyData). I discuss how to foster an ecosystem of such tools within PyData, then use it to build DI systems for collaborative/cloud/crowd/lay user settings. As an example of such systems, I discuss hands-off'' DI systems, which solve the entire DI task using only crowdsourcing. Finally, I discuss ongoing work at Wisconsin, which suggests that these DI systems are highly promising and building them raises many interesting research challenges. If time permits, I will discuss how this DI agenda can be built on to develop a broader R&D agenda for data science.

• September 26 2016
Ava - From data to insights through conversations
Rogers Jeffrey Leo John and Navneet Potti
2310 CS 12:15 PM
Speaker Bio :

Rogers is a second year grad student working with Dr. Jignesh Patel. His interests are broadly in the areas of Machine Learning, NLP, and Data Science. He hopes to add some more to this bio during his time at Wisconsin.

Navneet Potti is a 4th year graduate student at UW-Madison, working with Prof Jignesh Patel on the Quickstep project. His primary research interests broadly lie in all aspects of developing high performance data analytics platforms, ranging from algorithmic problems in approximate computation and robust query optimization to engineering challenges in storage and compute optimization. He also works on developing systems to simplify data management and democratize data analytics. He graduated from IIT Madras in 2010 and then worked for Goldman Sachs until 2013. He is currently affiliated with Microsoft Gray Systems Lab and has spent summers doing internships at IBM Almaden, working on their SQL-on-Hadoop system Big SQL, as well as at Pivotal, working on their Greenplum database.

• September 19 2016