diff --git a/tools-appendix/modules/python/pages/filtering-and-selecting.adoc b/tools-appendix/modules/python/pages/filtering-and-selecting.adoc index a83e23622..39c81252e 100644 --- a/tools-appendix/modules/python/pages/filtering-and-selecting.adoc +++ b/tools-appendix/modules/python/pages/filtering-and-selecting.adoc @@ -296,3 +296,282 @@ Age 87 AgeSubstitutionFlag 0 Name: 0, dtype: object ---- + +We can also use `iloc[]` to select the first row (index 0) and all columns using (:): +[source,python] +---- +myDF.iloc[0, :] +---- + +---- +Id 1 +ResidentStatus 1 +Education1989Revision 0 +Education2003Revision 2 +EducationReportingFlag 1 +MonthOfDeath 1 +Sex M +AgeType 1 +Age 87 +AgeSubstitutionFlag 0 +AgeRecode52 43 +AgeRecode27 23 +AgeRecode12 11 +InfantAgeRecode22 0 +PlaceOfDeathAndDecedentsStatus 4 +MaritalStatus M +DayOfWeekOfDeath 4 +CurrentDataYear 2014 +InjuryAtWork U +MannerOfDeath 7 +MethodOfDisposition C +Autopsy N +ActivityCode 99 +PlaceOfInjury 99 +Icd10Code I64 +CauseRecode358 238 +CauseRecode113 70 +InfantCauseRecode130 0 +CauseRecode39 24 +NumberOfEntityAxisConditions 1 +NumberOfRecordAxisConditions 1 +Race 1 +BridgedRaceFlag 0 +RaceImputationFlag 0 +RaceRecode3 1 +RaceRecode5 1 +HispanicOrigin 100 +HispanicOriginRaceRecode 6 +Name: 0, dtype: object +---- + +Next, we can use `iloc[]` to select multiple rows from a single column. The code below returns the first 7 observations in the datset from the 7th column (Age). + + +[source,python] +---- +myDF.iloc[0:7, 10] +---- + + +---- +0 43 +1 37 +2 41 +3 40 +4 38 +5 44 +6 42 +Name: AgeRecode52, dtype: int64 + + +---- + +Now, let's filter our dataset to only include a few columns and rename the DataFrame. + +[source,python] +---- +filtered_columns = ['Id', 'ResidentStatus', 'Sex', 'Age', 'Race', 'MaritalStatus'] +filtered_myDF = myDF[filtered_columns] +filtered_myDF + +---- + + +---- + Id ResidentStatus Sex Age Race MaritalStatus +0 1 1 M 87 1 M +1 2 1 M 58 1 D +2 3 1 F 75 1 W +3 4 1 M 74 1 D +4 5 1 M 64 1 D +... ... ... ... ... ... ... +---- + +Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, myDF.iloc[[0, 7, 9, 10], :] specifies the selection of rows 0, 7, 9, and 10 and all columns: + +[source,python] +---- +filtered_myDF.iloc[[0, 7, 9, 10], :] +---- + + +---- + Id ResidentStatus Sex Age Race MaritalStatus +0 1 1 M 87 1 M +7 8 1 M 55 2 S +9 10 1 M 23 1 S +10 11 1 F 79 1 W +---- + +== The loc function +DataFrames in pandas allow for label-based indexing and integer-based indexing. The `loc` function is for **label-based indexing**. + +In our dataset, our rows are integers, so we can use integers as our row labels. Let's extract the first observation for the column `Age` + +[source,python] +---- +filtered_myDF.loc[0, 'Age'] +---- + +---- +87 +---- + +Now, let's select all columns except one specific column using the `loc[]` function. Let's exclude the column `Race`: + +[source,python] +---- +filtered_myDF.loc[:, filtered_myDF.columns != 'Race'] +---- + +---- + Id ResidentStatus Sex Age MaritalStatus +0 1 1 M 87 M +1 2 1 M 58 D +2 3 1 F 75 W +3 4 1 M 74 D +4 5 1 M 64 D +... ... ... ... ... ... +---- + + +Select the first 10 observations for the column `Age` using `loc[]`: + + +[source,python] +---- +filtered_myDF.loc[:10, 'Age'] +---- + +---- +0 87 +1 58 +2 75 +3 74 +4 64 +5 93 +6 82 +7 55 +8 86 +9 23 +10 79 +Name: Age, dtype: int64 +---- + +== Filtering our Dataset + +Let's filter for one category. Let's try the `==` to see which death records are for females only. + +[source,python] +---- +filtered_myDF['Sex'] == "F" +---- + +---- +0 False +1 False +2 True +3 False +4 False + ... +2631166 False +2631167 True +2631168 False +2631169 False +2631170 False +Name: Sex, Length: 2631171, dtype: bool +---- + +When we evaluated the condition myDF['Sex'] == "F", it produced a Boolean series where each value corresponds to whether the condition was True or False for each row in the DataFrame. This Boolean series can be used to filter the DataFrame. + +If we want to see only the rows where the Sex column is "F" (females), we can use this condition directly to subset the DataFrame as shown below: + +[source,python] +---- +filtered_myDF[filtered_myDF['Sex'] == "F"] +---- + +---- + Id ResidentStatus Sex Age Race MaritalStatus +2 3 1 F 75 1 W +5 6 1 F 93 1 W +8 9 1 F 86 1 W +10 11 1 F 79 1 W +12 13 1 F 85 1 W + ... ... ... ... ... ... + +1299710 rows × 6 columns +---- + +We can also use `.loc` for filtering for females. + +[source,python] +---- +filtered_myDF.loc[filtered_myDF['Sex'] == "F"] +---- + +---- + Id ResidentStatus Sex Age Race MaritalStatus +2 3 1 F 75 1 W +5 6 1 F 93 1 W +8 9 1 F 86 1 W +10 11 1 F 79 1 W +12 13 1 F 85 1 W + ... ... ... ... ... ... +---- + +Now let's filter for two things. Let's filter for Females who are 114 years old. Suprisingly, some people do live that long based on our dataset! + +[source,python] +---- +filtered_myDF[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)] +---- + +---- + Id ResidentStatus Sex Age Race MaritalStatus +265482 265483 1 F 114 1 W +1304830 1304831 1 F 114 1 W +1372655 1372656 1 F 114 2 W +1981235 1981236 1 F 114 2 W +2407245 2407246 1 F 114 4 M +---- + +Another method that would get us the same results: + + +[source,python] +---- +filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)] +---- + +---- + Id ResidentStatus Sex Age Race MaritalStatus +265482 265483 1 F 114 1 W +1304830 1304831 1 F 114 1 W +1372655 1372656 1 F 114 2 W +1981235 1981236 1 F 114 2 W +2407245 2407246 1 F 114 4 M +---- + +=== Filtering and Modifying the Dataset + +Let's say there was a data entry mistake, and all females who are 114 should actually be 100 years old! Let's fix this data entry error. + +[source,python] +---- +filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114), 'Age'] = 100 + +---- + + +We can check whether it worked by trying to filter for females who are 114 again. The results should be 0 observations because we set them equal to 100. + +[source,python] +---- +filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)] +---- + +---- +Id ResidentStatus Sex Age Race MaritalStatus +---- diff --git a/tools-appendix/modules/python/pages/index.adoc b/tools-appendix/modules/python/pages/index.adoc index ffc62716a..46d2e7791 100644 --- a/tools-appendix/modules/python/pages/index.adoc +++ b/tools-appendix/modules/python/pages/index.adoc @@ -18,7 +18,6 @@ Python is largely known for its readability and versatility. Its design philosop * xref:plotly-examples.adoc[Data Visualization with plotly] * xref:writing-functions.adoc[Writing Functions in Python] * xref:writing-scripts.adoc[Writing Scripts in Python] -* xref:pandas-series.adoc[Pandas Series] * xref:pandas-dates-and-times.adoc[Handling Dates and Times in pandas] * xref:pandas-aggregate-functions.adoc[Applying Aggregate Functions in pandas] * xref:pandas-reshaping.adoc[Reshaping Data in pandas]