Skip to main content

Courses

Graduate Level

CS 774 - Data Exploration, Cleaning, and Integration for Data Science

Big Data is often said to deal with four Vs: volume, velocity, variety, and veracity. The focus is on variety and veracity challenges, which often arise in data science projects. In many such projects, data is often incorrect, hard to understand, and come from a variety of sources. Data scientists often spend 80% of their effort to explore, clean, and integrate this data, before analysis can be carried out to extract insights. As a result, managing variety and veracity has received significant attention. Study these topics, understand their challenges, and discuss solutions. These solutions often require data management, machine learning, big data scaling, cloud, crowdsourcing, and user interaction techniques.

Previous Offerings
Spring 2025 (Anhai Doan)
CS 764 - Topics in Database Management Systems

This course covers a number of advanced topics in the development of database management systems (DBMS) and the modern applications of databases. The topics discussed include advanced concurrency control and recovery, query processing and optimization, advanced access methods, parallel and distributed data systems, extensible data systems, implications of cloud computing for data platforms, and data analysis on large datasets.

Previous Offerings
Fall 2025 (Xiangyao Yu)
Fall 2023 (Goetz Graefe)
Spring 2020 (Goetz Graefe, Alan Halverson)
Fall 2019 (Jignesh Patel)
Fall 2018 (Alan Halverson)
Fall 2017 (Jignesh Patel)
Fall 2016 (Jignesh Patel)
CS 784 - Foundations of Database Management

The goal of this course is to cover foundational concepts in data management. We will study classic database theory, and also recent developments and new areas of research. The first part of the class will focus on query languages and their complexity, while the second part will focus more on advanced topics in data management, such as provenance, stream processing, privacy and uncertain data. Some of the topics we will cover will include conjunctive queries (query containment, query complexity, worst-case optimal algorithms), Datalog (semantics, evaluation, optimization techniques), parallel query evaluation, stream processing, uncertain/incomplete data (repairs, probabilistic databases), provenance, and differential privacy.

Previous Offerings
Fall 2024 (Paris Koutris)
Spring 2023 (Paris Koutris)
Spring 2022 (Paris Koutris)
Spring 2021 (Paris Koutris)
Spring 2017 (Paris Koutris)
Spring 2016 (AnHai Doan)
Fall 2015 (AnHai Doan)

Undergraduate Level

CS 564 - Database Management Systems - Design and Implementation

CS 564 is designed to give students a solid background in database management systems, particularly relational database management systems (DBMSs). We will examine such systems from two perspectives: that of a DBMS user, and that of a DBMS implementor. Approximately half of the course material will focus on the use of a DBMS. We will introduce the concept of a data model, the entity-relationship (ER) model, the relational model, and learn how to use the SQL query language. We will also cover logical and physical database design issues. The other half of the course will concentrate on DBMS implementation. We will cover file organization, various indexing methods, techniques for external sorting. We will also learn about how a DBMS implements a relational operator, and the basics of query optimization.

Previous Offerings
Fall 2025 (Paris Koutris, Anhai Doan)
Spring 2023 (Xiangyao Yu, Kevin Gaffney)
Fall 2022 (Paris Koutris, AnHai Doan)
Spring 2022 (Yannis Chronis, Shaleen Deep)
Fall 2021 (Paris Koutris)
Spring 2021 (Xiangyao Yu)
Fall 2020 (Paris Koutris)
Spring 2020 (Paris Koutris)
Fall 2019 (Goetz Graefe)
Spring 2019 (Ambuj Shatdal)
Fall 2017 (Theo Rekatsinas, Adel Ardalan)
Spring 2017 (Jignesh Patel)
Fall 2016 (Paris Koutris)
Fall 2016 (Ambuj Shatdal)
Fall 2015 (Paris Koutris)
Spring 2015 (AnHai Doan)
CS 639 - Data Management for Data Science

Data science incorporates practices from a variety of fields including statistics, machine learning, databases, distributed systems, algorithms, data warehousing, high-performance computing, and visualization. Thus, at a minimum, today's data scientist needs to have familiarity with: data processing and management tools like relational databases and NoSQL for processing large volumes of data; scripting languages like Python for quickly writing programs to clean and transform messy raw data; basic machine learning and data mining algorithms for analyzing the data; statistical computing environments for writing analysis scripts; and visualization tools for presentation and communication of analysis results. This class will study techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Students will learn how to model and reason about data, and how to process and manipulate it in various ways. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, MapReduce, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets.

Previous Offerings