Courses
Graduate Level
Big Data is often said to deal with four Vs: volume, velocity, variety, and veracity. The focus is on variety and veracity challenges, which often arise in data science projects. In many such projects, data is often incorrect, hard to understand, and come from a variety of sources. Data scientists often spend 80% of their effort to explore, clean, and integrate this data, before analysis can be carried out to extract insights. As a result, managing variety and veracity has received significant attention. Study these topics, understand their challenges, and discuss solutions. These solutions often require data management, machine learning, big data scaling, cloud, crowdsourcing, and user interaction techniques.
Previous OfferingsThis course covers a number of advanced topics in the development of database management systems (DBMS) and the modern applications of databases. The topics discussed include advanced concurrency control and recovery, query processing and optimization, advanced access methods, parallel and distributed data systems, extensible data systems, implications of cloud computing for data platforms, and data analysis on large datasets.
Previous OfferingsThe goal of this course is to cover foundational concepts in data management. We will study classic database theory, and also recent developments and new areas of research. The first part of the class will focus on query languages and their complexity, while the second part will focus more on advanced topics in data management, such as provenance, stream processing, privacy and uncertain data. Some of the topics we will cover will include conjunctive queries (query containment, query complexity, worst-case optimal algorithms), Datalog (semantics, evaluation, optimization techniques), parallel query evaluation, stream processing, uncertain/incomplete data (repairs, probabilistic databases), provenance, and differential privacy.
Previous OfferingsUndergraduate Level
CS 564 is designed to give students a solid background in database management systems, particularly relational database management systems (DBMSs). We will examine such systems from two perspectives: that of a DBMS user, and that of a DBMS implementor. Approximately half of the course material will focus on the use of a DBMS. We will introduce the concept of a data model, the entity-relationship (ER) model, the relational model, and learn how to use the SQL query language. We will also cover logical and physical database design issues. The other half of the course will concentrate on DBMS implementation. We will cover file organization, various indexing methods, techniques for external sorting. We will also learn about how a DBMS implements a relational operator, and the basics of query optimization.
Previous OfferingsData science incorporates practices from a variety of fields including statistics, machine learning, databases, distributed systems, algorithms, data warehousing, high-performance computing, and visualization. Thus, at a minimum, today's data scientist needs to have familiarity with: data processing and management tools like relational databases and NoSQL for processing large volumes of data; scripting languages like Python for quickly writing programs to clean and transform messy raw data; basic machine learning and data mining algorithms for analyzing the data; statistical computing environments for writing analysis scripts; and visualization tools for presentation and communication of analysis results. This class will study techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Students will learn how to model and reason about data, and how to process and manipulate it in various ways. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, MapReduce, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets.
Previous Offerings