layout	title	description	group	order
page	Syllabus	Course Syllabus	navigation	2

{% include JB/setup %}

This schedule is still under development and is subject to change.

{% capture dates %} 01/17/2017 01/19/2017 01/24/2017 01/26/2017 01/31/2017 02/02/2017 02/07/2017 02/09/2017 02/14/2017 02/16/2017 02/21/2017 02/23/2017 02/28/2017 03/02/2017 03/07/2017 03/09/2017 03/14/2017 03/16/2017 03/21/2017 03/23/2017 03/28/2017 03/30/2017 04/04/2017 04/06/2017 04/11/2017 04/13/2017 04/18/2017 04/20/2017 04/25/2017 04/27/2017 05/02/2017 05/04/2017 05/11/2017 {% endcapture %} {% assign dates = dates | split: " " %}

{% include syllabus_entry dates=dates %}

Course Overview [Gonzalez]

In this lecture we define and motivate the study of data science and outline the key ideas covered throughout the class.

Lecture Notes

Slides (pptx, pdf, pdf 6up)
Additional Optional Reading: Chapter 1 from Doing Data Science and Data Science from Scratch

Homework 1 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

The Data Science Lifecycle [Gonzalez]

In this lecture we introduce the data-science life-cycle and explore each stage by analyzing tweets from the 2016 presidential election.

Lecture Notes

Slides (pptx, pdf, pdf 6up)
SF Food Safety Demo (html, raw, data)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Problem Formulation and Experimental Design [Yu]

In this lecture we provide an overview of how to formulate hypothesis, identify sources of data, and construct basic experiments to collect data.

Lecture Notes

Slides (pptx, pdf, pdf 6up)

Homework 2 Released

Homework 1 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Data Wrangling [Hellerstein]

In this lecture we explore the challenges of data preparation (e.g., assessing, structuring, cleaning, and rolling up data) and the kinds of errors commonly found in the real world.

Lecture Notes

Slides (pptx, pdf, pdf 6up)
Wrangler Software (optional)
Additional reading for the curious:
- Quartz Bad Data Guide
- Bad Data Handbook (O'Reilly book, free on berkeley.edu networks)
- Research Directions in Data Wrangling, Heer et al. 2011.
- Quantitative Data Cleaning For Large Databases, Hellerstein 2008
- Exploratory Data Mining and Cleaning, Dasu and Johnson (book)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Exploratory Data Analysis [Nolan]

In this lecture we provide an overview of exploratory data analysis (EDA).

Lecture Notes

Slides (pptx, pdf, pdf 6up)
Additional reading for the curious:
- Exploratory Data Analysis, Tukey 1977 (book)
- Now You See It, Few 2009 (book)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Visualization and Communication [Nolan]

This lecture covers how to effectively visualize and communicate complex results to a broader audience.

Lecture Notes:

Slides (pptx, pdf, pdf 6up)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Advanced Python Data Science Tools [Gonzalez]

In this lecture we will introduce Pandas, dataframe manipulation, python visualization, and some of the batch oriented philosophy of scalable data processing.

Lecture Notes:

Summary Slides (pptx, pdf, pdf 6up)
Extended Notebook (html, ipynb)

Homework 3 Released

Homework 2 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Prediction and Inference [Yu]

In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python.

Lecture Notes:

Slides (pptx, pdf, pdf 6up)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Relational Algebra and SQL [Hellerstein]

In this lecture we introduce SQL and the relational model.

Lecture Notes:

Slides (pptx, pdf, pdf 6up). {% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

SQL Continued [Hellerstein]

In this lecture we will introduce data analysis techniques with a focus on aggregation and summary statistics.

Lecture Notes:

Slides (continued from last lecture)
Extended Notebook: (html no output, ipynb no output, data)
Additional resources for the curious
- CS186 Slides, 2016. PPTX, PDF, Lecture Video 1, Lecture Video 2, Lecture Video 3
- PostgreSQL Manual
- SQLfiddle {% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Advanced SQL [Hellerstein]

In this lecture we will cover SQL joins, views, and CTEs, as well as advanced aggregation including order statistics, window functions and user-defined aggregates.

Extended Notebook: (html no output, ipynb no output)
Gonzalez follow-up notebook (html, ipynb)

Homework 4 Released

Homework 3 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Basic Modeling using Statistical Distributions [Nolan]

In this lecture we provide an overview of several basic distributions and discuss some of the challenges of working with skewed data.

Lecture Notes:

Slides (pptx, pdf, pdf 6up). {% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Maximum Likelihood Estimation [Nolan]

In this lecture we fit basic models to data by applying the method of maximum likelihood estimation.

Slides (pptx, pdf, pdf 6up).

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Maximum Likelihood Estimation Continued [Nolan]

This lecture will continue discussion on the method of maximum likelihood.

Slides (pptx, pdf, pdf 6up).

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Midterm Review [Gonzalez]

Slides (pptx, pdf, pdf 6up). {% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Midterm

This may change in the weeks before class starts as we adjust the schedule. {% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Least Squares Regression and Hypothesis Testing [Yu]

In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.

Slides (pptx, pdf, pdf 6up).

Homework 4 Due

Homework 5 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Least Squares Regression and Hypothesis Testing [Yu]

In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.

Slides (pptx, pdf, pdf 6up).
Reading Chapter 3.1

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Feature Engineering, Over-fitting, and Cross Validation [Gonzalez]

In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex non-linear relationships on a wide range of data types. We will start to use scikit-learn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning -- over-fitting and discuss how cross-validation can be used to address over-fitting.

The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials.

Least-Squares Linear Regression: (html, ipynb)
Feature Engineering Part 1: (html, ipynb, data)
An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.
Optional reading: Chapter 3.1, 3.2.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Feature Engineering, Over-fitting, and Cross Validation Continued [Gonzalez]

In this lecture we continue the discussion from the last lecture pushing further into feature engineering.

Feature Engineering Part 1: (html, ipynb)
Feature Engineering Part 2: (html, ipynb, data)
An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.
Optional reading Chapter 2.1, 2.2

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Spring Break

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Spring Break

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Regularization and the Bias Variance tradeoff [Gonzalez]

In this lecture will continue our exploration of over-fitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions.

Slides: (pptx, pdf, handout)
Interactive Notebook on Cross Validation and the Bias Variance Tradeoff: (html, ipynb)
An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.
An alternative derivation of the Bias Variance Trade-Off provided by Professor Yu (pdf)

Homework 5 Due

Homework 6 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Logistic Regression [Gonzalez]

In this lecture we will finish our discussion on regularization and begin to study how linear models can be used to build classifiers through logistic regression.

Slides: (pptx, pdf, handout)
Interactive Notebook on Regularization: (html, ipynb)
Interactive Notebook on Logistic Regression: (html, ipynb)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Finish Logistic Regression and Start K-Means [Gonzalez and Yu]

In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with K-means work towards the more general EM algorithm.

Part 2 of Logistic Regression Slides: (pptx, pdf, handout)
We will continue to use the previous notebook on logistic regression.
K-Means Slides: (pptx, pdf, handout)

Additional Reading:

K-Means Clustering tutorial on scikit-learn

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Clustering and Expectation Maximization (EM) [Yu]

This lecture will continue to cover EM and more general mixed membership clustering techniques.

EM and Hierarchical Clustering Slides: pptx, pdf, handout

Optional Reading:

Silhouette analysis tutorial on scikit-learn.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Map-Reduce, Spark, and Big Data [Gonzalez]

In this lecture we will introduce the Map-Reduce model of distributed computation and then dive into the Apache Spark Map-Reduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing.

Slides: pptx, pdf, handout
Notebook demonstrating distributed least squares linear regression in Apache Spark Map-Reduce Cloud Notebook

Additional Reading:

The Apache Spark programming guide provides a fairly detailed overview of how to use Spark. Be sure to switch the code examples to Python by selecting the Python tab above each code snippet.
- Python RDD API
- Python Dataframe API
Databricks Cloud Apache Spark tutorial
Information about using Databricks Cloud which we will be using for homework.

Homework 6 Due

Homework 7 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Guest Lecturer on Data Science and Ethics [Charis Thompson]

Slides (ppt, pdf, handout)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Finish Discussion on Spark and Classification

In the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression.

Slides: (pptx, pdf, handout)
Notebooks: databricks cloud
Slides on Gender Bias: (pptx, pdf, handout)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]

In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major.

Slides: pdf

Homework 7 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

RRR Review [Hellerstein and Yu]

This will be part one of a two part exam review lecture to be held during the regular lecture slot.

Hellerstein Slides: pptx, pdf, handout.
Yu Slides: pptx, pdf, handout

Homework 7 Due (optional extension)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

RRR Review [Gonzalez and Nolan]

This will be part two of a two part exam review lecture to be held during the regular lecture slot.

Gonzalez Slides: pptx, pdf, handout
Nolan Slides: pptx, pdf, handout

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Final Exam

The final exam will be from 3:00 to 6:00 PM on Thursday, May 11, in 100 GPB (Genetics and Plant Biology). For details about exam scheduling visit the Berkeley Exam Calendar. {% include syllabus_entry end=true %}

Week	Lecture	Date	Topic

Files

syllabus.md

Latest commit

History

syllabus.md

File metadata and controls

Course Overview [Gonzalez]

Lecture Notes

Homework 1 Released

The Data Science Lifecycle [Gonzalez]

Lecture Notes

Problem Formulation and Experimental Design [Yu]

Lecture Notes

Homework 2 Released

Homework 1 Due

Data Wrangling [Hellerstein]

Lecture Notes

Exploratory Data Analysis [Nolan]

Lecture Notes

Visualization and Communication [Nolan]

Lecture Notes:

Advanced Python Data Science Tools [Gonzalez]

Lecture Notes:

Homework 3 Released

Homework 2 Due

Prediction and Inference [Yu]

Lecture Notes:

Relational Algebra and SQL [Hellerstein]

Lecture Notes:

SQL Continued [Hellerstein]

Lecture Notes:

Advanced SQL [Hellerstein]

Homework 4 Released

Homework 3 Due

Basic Modeling using Statistical Distributions [Nolan]

Lecture Notes:

Maximum Likelihood Estimation [Nolan]

Maximum Likelihood Estimation Continued [Nolan]

Midterm Review [Gonzalez]

Midterm

Least Squares Regression and Hypothesis Testing [Yu]

Homework 4 Due

Homework 5 Released

Least Squares Regression and Hypothesis Testing [Yu]

Feature Engineering, Over-fitting, and Cross Validation [Gonzalez]

Feature Engineering, Over-fitting, and Cross Validation Continued [Gonzalez]

Spring Break

Spring Break

Regularization and the Bias Variance tradeoff [Gonzalez]

Homework 5 Due

Homework 6 Released

Logistic Regression [Gonzalez]

Finish Logistic Regression and Start K-Means [Gonzalez and Yu]

Additional Reading:

Clustering and Expectation Maximization (EM) [Yu]

Optional Reading:

Map-Reduce, Spark, and Big Data [Gonzalez]

Additional Reading:

Homework 6 Due

Homework 7 Released

Guest Lecturer on Data Science and Ethics [Charis Thompson]

Finish Discussion on Spark and Classification

PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]

Homework 7 Due

RRR Review [Hellerstein and Yu]

Homework 7 Due (optional extension)

RRR Review [Gonzalez and Nolan]

Final Exam