Skip to content

Commit

Permalink
Revert "Small Additional Edits for Formatting"
Browse files Browse the repository at this point in the history
This reverts commit 997d60b.
  • Loading branch information
arroyo38 committed Jan 2, 2025
1 parent 997d60b commit d4a52af
Show file tree
Hide file tree
Showing 2 changed files with 279 additions and 1 deletion.
279 changes: 279 additions & 0 deletions tools-appendix/modules/python/pages/filtering-and-selecting.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -296,3 +296,282 @@ Age 87
AgeSubstitutionFlag 0
Name: 0, dtype: object
----

We can also use `iloc[]` to select the first row (index 0) and all columns using (:):
[source,python]
----
myDF.iloc[0, :]
----

----
Id 1
ResidentStatus 1
Education1989Revision 0
Education2003Revision 2
EducationReportingFlag 1
MonthOfDeath 1
Sex M
AgeType 1
Age 87
AgeSubstitutionFlag 0
AgeRecode52 43
AgeRecode27 23
AgeRecode12 11
InfantAgeRecode22 0
PlaceOfDeathAndDecedentsStatus 4
MaritalStatus M
DayOfWeekOfDeath 4
CurrentDataYear 2014
InjuryAtWork U
MannerOfDeath 7
MethodOfDisposition C
Autopsy N
ActivityCode 99
PlaceOfInjury 99
Icd10Code I64
CauseRecode358 238
CauseRecode113 70
InfantCauseRecode130 0
CauseRecode39 24
NumberOfEntityAxisConditions 1
NumberOfRecordAxisConditions 1
Race 1
BridgedRaceFlag 0
RaceImputationFlag 0
RaceRecode3 1
RaceRecode5 1
HispanicOrigin 100
HispanicOriginRaceRecode 6
Name: 0, dtype: object
----

Next, we can use `iloc[]` to select multiple rows from a single column. The code below returns the first 7 observations in the datset from the 7th column (Age).


[source,python]
----
myDF.iloc[0:7, 10]
----


----
0 43
1 37
2 41
3 40
4 38
5 44
6 42
Name: AgeRecode52, dtype: int64
----

Now, let's filter our dataset to only include a few columns and rename the DataFrame.

[source,python]
----
filtered_columns = ['Id', 'ResidentStatus', 'Sex', 'Age', 'Race', 'MaritalStatus']
filtered_myDF = myDF[filtered_columns]
filtered_myDF
----


----
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
1 2 1 M 58 1 D
2 3 1 F 75 1 W
3 4 1 M 74 1 D
4 5 1 M 64 1 D
... ... ... ... ... ... ...
----

Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, myDF.iloc[[0, 7, 9, 10], :] specifies the selection of rows 0, 7, 9, and 10 and all columns:

[source,python]
----
filtered_myDF.iloc[[0, 7, 9, 10], :]
----


----
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
7 8 1 M 55 2 S
9 10 1 M 23 1 S
10 11 1 F 79 1 W
----

== The loc function
DataFrames in pandas allow for label-based indexing and integer-based indexing. The `loc` function is for **label-based indexing**.

In our dataset, our rows are integers, so we can use integers as our row labels. Let's extract the first observation for the column `Age`

[source,python]
----
filtered_myDF.loc[0, 'Age']
----

----
87
----

Now, let's select all columns except one specific column using the `loc[]` function. Let's exclude the column `Race`:

[source,python]
----
filtered_myDF.loc[:, filtered_myDF.columns != 'Race']
----

----
Id ResidentStatus Sex Age MaritalStatus
0 1 1 M 87 M
1 2 1 M 58 D
2 3 1 F 75 W
3 4 1 M 74 D
4 5 1 M 64 D
... ... ... ... ... ...
----


Select the first 10 observations for the column `Age` using `loc[]`:


[source,python]
----
filtered_myDF.loc[:10, 'Age']
----

----
0 87
1 58
2 75
3 74
4 64
5 93
6 82
7 55
8 86
9 23
10 79
Name: Age, dtype: int64
----

== Filtering our Dataset

Let's filter for one category. Let's try the `==` to see which death records are for females only.

[source,python]
----
filtered_myDF['Sex'] == "F"
----

----
0 False
1 False
2 True
3 False
4 False
...
2631166 False
2631167 True
2631168 False
2631169 False
2631170 False
Name: Sex, Length: 2631171, dtype: bool
----

When we evaluated the condition myDF['Sex'] == "F", it produced a Boolean series where each value corresponds to whether the condition was True or False for each row in the DataFrame. This Boolean series can be used to filter the DataFrame.

If we want to see only the rows where the Sex column is "F" (females), we can use this condition directly to subset the DataFrame as shown below:

[source,python]
----
filtered_myDF[filtered_myDF['Sex'] == "F"]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ...
1299710 rows × 6 columns
----

We can also use `.loc` for filtering for females.

[source,python]
----
filtered_myDF.loc[filtered_myDF['Sex'] == "F"]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ...
----

Now let's filter for two things. Let's filter for Females who are 114 years old. Suprisingly, some people do live that long based on our dataset!

[source,python]
----
filtered_myDF[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
----

Another method that would get us the same results:


[source,python]
----
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
----

=== Filtering and Modifying the Dataset

Let's say there was a data entry mistake, and all females who are 114 should actually be 100 years old! Let's fix this data entry error.

[source,python]
----
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114), 'Age'] = 100
----


We can check whether it worked by trying to filter for females who are 114 again. The results should be 0 observations because we set them equal to 100.

[source,python]
----
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
----
1 change: 0 additions & 1 deletion tools-appendix/modules/python/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ Python is largely known for its readability and versatility. Its design philosop
* xref:plotly-examples.adoc[Data Visualization with plotly]
* xref:writing-functions.adoc[Writing Functions in Python]
* xref:writing-scripts.adoc[Writing Scripts in Python]
* xref:pandas-series.adoc[Pandas Series]
* xref:pandas-dates-and-times.adoc[Handling Dates and Times in pandas]
* xref:pandas-aggregate-functions.adoc[Applying Aggregate Functions in pandas]
* xref:pandas-reshaping.adoc[Reshaping Data in pandas]
Expand Down

0 comments on commit d4a52af

Please sign in to comment.