Skip to content

Latest commit

 

History

History
669 lines (370 loc) · 26.8 KB

syllabus.md

File metadata and controls

669 lines (370 loc) · 26.8 KB
layout title description group order
page
Syllabus
Course Syllabus
navigation
2

{% include JB/setup %}

This schedule is still under development and is subject to change.

{% capture dates %} 01/17/2017 01/19/2017 01/24/2017 01/26/2017 01/31/2017 02/02/2017 02/07/2017 02/09/2017 02/14/2017 02/16/2017 02/21/2017 02/23/2017 02/28/2017 03/02/2017 03/07/2017 03/09/2017 03/14/2017 03/16/2017 03/21/2017 03/23/2017 03/28/2017 03/30/2017 04/04/2017 04/06/2017 04/11/2017 04/13/2017 04/18/2017 04/20/2017 04/25/2017 04/27/2017 05/02/2017 05/04/2017 05/11/2017 {% endcapture %} {% assign dates = dates | split: " " %}

{% include syllabus_entry dates=dates %}

Course Overview [Gonzalez]

In this lecture we define and motivate the study of data science and outline the key ideas covered throughout the class.

Lecture Notes

Homework 1 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

The Data Science Lifecycle [Gonzalez]

In this lecture we introduce the data-science life-cycle and explore each stage by analyzing tweets from the 2016 presidential election.

Lecture Notes

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Problem Formulation and Experimental Design [Yu]

In this lecture we provide an overview of how to formulate hypothesis, identify sources of data, and construct basic experiments to collect data.

Lecture Notes

Homework 2 Released

Homework 1 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Data Wrangling [Hellerstein]

In this lecture we explore the challenges of data preparation (e.g., assessing, structuring, cleaning, and rolling up data) and the kinds of errors commonly found in the real world.

Lecture Notes

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Exploratory Data Analysis [Nolan]

In this lecture we provide an overview of exploratory data analysis (EDA).

Lecture Notes

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Visualization and Communication [Nolan]

This lecture covers how to effectively visualize and communicate complex results to a broader audience.

Lecture Notes:

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Advanced Python Data Science Tools [Gonzalez]

In this lecture we will introduce Pandas, dataframe manipulation, python visualization, and some of the batch oriented philosophy of scalable data processing.

Lecture Notes:

Homework 3 Released

Homework 2 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Prediction and Inference [Yu]

In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python.

Lecture Notes:

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Relational Algebra and SQL [Hellerstein]

In this lecture we introduce SQL and the relational model.

Lecture Notes:

{% include syllabus_entry dates=dates %}

SQL Continued [Hellerstein]

In this lecture we will introduce data analysis techniques with a focus on aggregation and summary statistics.

Lecture Notes:

{% include syllabus_entry dates=dates %}

Advanced SQL [Hellerstein]

In this lecture we will cover SQL joins, views, and CTEs, as well as advanced aggregation including order statistics, window functions and user-defined aggregates.

Homework 4 Released

Homework 3 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Basic Modeling using Statistical Distributions [Nolan]

In this lecture we provide an overview of several basic distributions and discuss some of the challenges of working with skewed data.

Lecture Notes:

{% include syllabus_entry dates=dates %}

Maximum Likelihood Estimation [Nolan]

In this lecture we fit basic models to data by applying the method of maximum likelihood estimation.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Maximum Likelihood Estimation Continued [Nolan]

This lecture will continue discussion on the method of maximum likelihood.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Midterm Review [Gonzalez]

{% include syllabus_entry dates=dates %}

Midterm

This may change in the weeks before class starts as we adjust the schedule. {% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Least Squares Regression and Hypothesis Testing [Yu]

In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.

Homework 4 Due

Homework 5 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Least Squares Regression and Hypothesis Testing [Yu]

In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Feature Engineering, Over-fitting, and Cross Validation [Gonzalez]

In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex non-linear relationships on a wide range of data types. We will start to use scikit-learn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning -- over-fitting and discuss how cross-validation can be used to address over-fitting.

The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials.

  • Least-Squares Linear Regression: (html, ipynb)

  • Feature Engineering Part 1: (html, ipynb, data)

  • An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.

  • Optional reading: Chapter 3.1, 3.2.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Feature Engineering, Over-fitting, and Cross Validation Continued [Gonzalez]

In this lecture we continue the discussion from the last lecture pushing further into feature engineering.

  • Feature Engineering Part 1: (html, ipynb)

  • Feature Engineering Part 2: (html, ipynb, data)

  • An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.

  • Optional reading Chapter 2.1, 2.2

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Spring Break

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Spring Break

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Regularization and the Bias Variance tradeoff [Gonzalez]

In this lecture will continue our exploration of over-fitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions.

  • Slides: (pptx, pdf, handout)

  • Interactive Notebook on Cross Validation and the Bias Variance Tradeoff: (html, ipynb)

  • An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.

  • An alternative derivation of the Bias Variance Trade-Off provided by Professor Yu (pdf)

Homework 5 Due

Homework 6 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Logistic Regression [Gonzalez]

In this lecture we will finish our discussion on regularization and begin to study how linear models can be used to build classifiers through logistic regression.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Finish Logistic Regression and Start K-Means [Gonzalez and Yu]

In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with K-means work towards the more general EM algorithm.

  • Part 2 of Logistic Regression Slides: (pptx, pdf, handout)

  • We will continue to use the previous notebook on logistic regression.

  • K-Means Slides: (pptx, pdf, handout)

Additional Reading:

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Clustering and Expectation Maximization (EM) [Yu]

This lecture will continue to cover EM and more general mixed membership clustering techniques.

Optional Reading:

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Map-Reduce, Spark, and Big Data [Gonzalez]

In this lecture we will introduce the Map-Reduce model of distributed computation and then dive into the Apache Spark Map-Reduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing.

Additional Reading:

Homework 6 Due

Homework 7 Released

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Guest Lecturer on Data Science and Ethics [Charis Thompson]

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Finish Discussion on Spark and Classification

In the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]

In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major.

Homework 7 Due

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

RRR Review [Hellerstein and Yu]

This will be part one of a two part exam review lecture to be held during the regular lecture slot.

Homework 7 Due (optional extension)

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

RRR Review [Gonzalez and Nolan]

This will be part two of a two part exam review lecture to be held during the regular lecture slot.

{% include syllabus_entry end=true %}

{% include syllabus_entry dates=dates %}

Final Exam

The final exam will be from 3:00 to 6:00 PM on Thursday, May 11, in 100 GPB (Genetics and Plant Biology). For details about exam scheduling visit the Berkeley Exam Calendar. {% include syllabus_entry end=true %}

Week Lecture Date Topic
<script type="text/javascript"> var current_date = new Date(); var rows = document.getElementsByTagName("th"); var finished = false; for (var i = 1; i < rows.length && !finished; i++) { var r = rows[i]; if (r.id.startsWith("counter_")) { var fields = r.id.split("_") var week_div_id = "week_" + fields[2] var lecture_date = new Date(fields[1] + " 23:59:00") if (current_date <= lecture_date) { finished = true; r.style.background = "orange" r.style.color = "black" var week_td = document.getElementById(week_div_id) week_td.style.background = "#043361" week_td.style.color = "white" } } } </script>