Skip to content

2015.08.13: Titanic tutorial in Python

Ben Best edited this page Sep 30, 2015 · 1 revision

Agenda

Titanic Tutorial with Python

  • Morgan walking us through interactive Titanic tutorial with Python from Dataquest titanic sinking Morgan presenting

    • Dataquest web interface using Python 3 (vs 2.7 installed by default on Mac). Biggest transition quirk:

      # new Python 3 way
      print(titanic.head(5))
      
      # old lazy Python 2.7 way
      print titanic.head(5)
    • What's PANDAS? Python Data Analysis Library. Add R style data frame to allow different variable types in different columns, plus aggregate features, etc.

    • 3: Missing data. Why fill in age having NAs with the median values? To include other predictor data. Could try to build model using data with all NAs removed.

      titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    • 4: Non-numeric columns

      titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
    • 5: Converting the Sex column

      .at, .iat, .loc, .iloc and .ix.
    • 6: Converting the Embarked column. Per titanic because titanic (C = Cherbourg; Q = Queenstown; S = Southampton). Note that there are many ways to slice and dice a PAANDAS object. See .at, .iat, .loc, .iloc and .ix methods: pandas indexing.

      titanic["Embarked"] = titanic["Embarked"].fillna("S")
      titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
      titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
      titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
    • 7: On to machine learning!

      • titanic.shape[0] refers to the dimensionality of the titanic dataset, ie how many rows
    • 8: Cross validation Use KFold() to train on 2/3 and predict on other 1/3, and get predictions for all rows:

      1. Combine the first two parts, train a model, make predictions on the third.
      2. Combine the first and third parts, train a model, make predictions on the second.
      3. Combine the second and third parts, train a model, make predictions on the second.
    • 10: Evaluating error

      import numpy as np
      • How do you get modules and know if they're there? See installing python modules and especially pip.
      • Could change prediction error allowed
      count = 0
      accurate = 0
      accuracy = 0
      for p in predictions:
          if p == titanic["Survived"][count]:
            accurate += 1
          count +=1
       accuracy = accurate/(count-1)
       print(accuracy)