layout | title | description | group | order |
---|---|---|---|---|
page |
Syllabus |
Course Syllabus |
navigation |
2 |
{% include JB/setup %}
This schedule is still under development and is subject to change.
{% capture dates %} 01/17/2017 01/19/2017 01/24/2017 01/26/2017 01/31/2017 02/02/2017 02/07/2017 02/09/2017 02/14/2017 02/16/2017 02/21/2017 02/23/2017 02/28/2017 03/02/2017 03/07/2017 03/09/2017 03/14/2017 03/16/2017 03/21/2017 03/23/2017 03/28/2017 03/30/2017 04/04/2017 04/06/2017 04/11/2017 04/13/2017 04/18/2017 04/20/2017 04/25/2017 04/27/2017 05/02/2017 05/04/2017 05/11/2017 {% endcapture %} {% assign dates = dates | split: " " %}
{% include syllabus_entry dates=dates %}
In this lecture we define and motivate the study of data science and outline the key ideas covered throughout the class.
-
Additional Optional Reading: Chapter 1 from Doing Data Science and Data Science from Scratch
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we introduce the data-science life-cycle and explore each stage by analyzing tweets from the 2016 presidential election.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we provide an overview of how to formulate hypothesis, identify sources of data, and construct basic experiments to collect data.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we explore the challenges of data preparation (e.g., assessing, structuring, cleaning, and rolling up data) and the kinds of errors commonly found in the real world.
- Slides (pptx, pdf, pdf 6up)
- Wrangler Software (optional)
- Additional reading for the curious:
- Quartz Bad Data Guide
- Bad Data Handbook (O'Reilly book, free on berkeley.edu networks)
- Research Directions in Data Wrangling, Heer et al. 2011.
- Quantitative Data Cleaning For Large Databases, Hellerstein 2008
- Exploratory Data Mining and Cleaning, Dasu and Johnson (book)
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we provide an overview of exploratory data analysis (EDA).
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
This lecture covers how to effectively visualize and communicate complex results to a broader audience.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will introduce Pandas, dataframe manipulation, python visualization, and some of the batch oriented philosophy of scalable data processing.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we introduce SQL and the relational model.
{% include syllabus_entry dates=dates %}
In this lecture we will introduce data analysis techniques with a focus on aggregation and summary statistics.
- Slides (continued from last lecture)
- Extended Notebook: (html no output, ipynb no output, data)
- Additional resources for the curious
- CS186 Slides, 2016. PPTX, PDF, Lecture Video 1, Lecture Video 2, Lecture Video 3
- PostgreSQL Manual
- SQLfiddle {% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will cover SQL joins, views, and CTEs, as well as advanced aggregation including order statistics, window functions and user-defined aggregates.
- Extended Notebook: (html no output, ipynb no output)
- Gonzalez follow-up notebook (html, ipynb)
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we provide an overview of several basic distributions and discuss some of the challenges of working with skewed data.
{% include syllabus_entry dates=dates %}
In this lecture we fit basic models to data by applying the method of maximum likelihood estimation.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
This lecture will continue discussion on the method of maximum likelihood.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
{% include syllabus_entry dates=dates %}
This may change in the weeks before class starts as we adjust the schedule. {% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.
-
Reading Chapter 3.1
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex non-linear relationships on a wide range of data types. We will start to use scikit-learn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning -- over-fitting and discuss how cross-validation can be used to address over-fitting.
The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials.
-
An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.
-
Optional reading: Chapter 3.1, 3.2.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we continue the discussion from the last lecture pushing further into feature engineering.
-
An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.
-
Optional reading Chapter 2.1, 2.2
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture will continue our exploration of over-fitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions.
-
Interactive Notebook on Cross Validation and the Bias Variance Tradeoff: (html, ipynb)
-
An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.
-
An alternative derivation of the Bias Variance Trade-Off provided by Professor Yu (pdf)
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will finish our discussion on regularization and begin to study how linear models can be used to build classifiers through logistic regression.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with K-means work towards the more general EM algorithm.
-
We will continue to use the previous notebook on logistic regression.
- K-Means Clustering tutorial on scikit-learn
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
This lecture will continue to cover EM and more general mixed membership clustering techniques.
- Silhouette analysis tutorial on scikit-learn.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will introduce the Map-Reduce model of distributed computation and then dive into the Apache Spark Map-Reduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing.
-
Notebook demonstrating distributed least squares linear regression in Apache Spark Map-Reduce Cloud Notebook
- The Apache Spark programming guide provides a fairly detailed overview of how to use Spark. Be sure to switch the code examples to Python by selecting the Python tab above each code snippet.
- Python RDD API
- Python Dataframe API
- Databricks Cloud Apache Spark tutorial
- Information about using Databricks Cloud which we will be using for homework.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression.
- Slides: (pptx, pdf, handout)
- Notebooks: databricks cloud
- Slides on Gender Bias: (pptx, pdf, handout)
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major.
- Slides: pdf
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
This will be part one of a two part exam review lecture to be held during the regular lecture slot.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
This will be part two of a two part exam review lecture to be held during the regular lecture slot.
{% include syllabus_entry end=true %}
{% include syllabus_entry dates=dates %}
The final exam will be from 3:00 to 6:00 PM on Thursday, May 11, in 100 GPB (Genetics and Plant Biology). For details about exam scheduling visit the Berkeley Exam Calendar. {% include syllabus_entry end=true %}
Week | Lecture | Date | Topic |
---|