-
Notifications
You must be signed in to change notification settings - Fork 2
2015.08.13: Titanic tutorial in Python
-
recap last time: 2015.07.30 Meeting notes
-
introductions: name, department, programming experience, especially Python and R
-
Morgan walking us through interactive Titanic tutorial with Python from Dataquest
-
Ben started in on interactive Titanic tutorial with R from DataCamp. Saved R code to kaggle_titanic/titanic_R.R.
-
discussion on next meetup:
-
Random Forest. continue with Titanic tutorials, eg Mission 75: Improving your submission | Dataquest, moving onto random forest and discussing machine learning / predictive regression techniques in Python and R. See also Machine Learning by Andrew Ng | Stanford's Open Classroom.
-
Visualization. touch base on packages for visualization in Python (matplotlib) and R (ggplot2, ggvis, shiny)
-
-
Morgan walking us through interactive Titanic tutorial with Python from Dataquest
-
Dataquest web interface using Python 3 (vs 2.7 installed by default on Mac). Biggest transition quirk:
# new Python 3 way print(titanic.head(5)) # old lazy Python 2.7 way print titanic.head(5)
-
What's PANDAS? Python Data Analysis Library. Add R style data frame to allow different variable types in different columns, plus aggregate features, etc.
-
3: Missing data. Why fill in age having NAs with the median values? To include other predictor data. Could try to build model using data with all NAs removed.
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
-
4: Non-numeric columns
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
-
5: Converting the Sex column
.at, .iat, .loc, .iloc and .ix.
-
6: Converting the Embarked column. Per titanic because titanic (C = Cherbourg; Q = Queenstown; S = Southampton). Note that there are many ways to slice and dice a PAANDAS object. See
.at
,.iat
,.loc
,.iloc
and.ix
methods: pandas indexing.titanic["Embarked"] = titanic["Embarked"].fillna("S") titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0 titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1 titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
-
7: On to machine learning!
-
titanic.shape[0]
refers to the dimensionality of the titanic dataset, ie how many rows
-
-
8: Cross validation Use
KFold()
to train on 2/3 and predict on other 1/3, and get predictions for all rows:- Combine the first two parts, train a model, make predictions on the third.
- Combine the first and third parts, train a model, make predictions on the second.
- Combine the second and third parts, train a model, make predictions on the second.
-
10: Evaluating error
import numpy as np
- How do you get modules and know if they're there? See installing python modules and especially
pip
. - Could change prediction error allowed
count = 0 accurate = 0 accuracy = 0 for p in predictions: if p == titanic["Survived"][count]: accurate += 1 count +=1 accuracy = accurate/(count-1) print(accuracy)
- How do you get modules and know if they're there? See installing python modules and especially
-