Skip to content

Commit

Permalink
Additional changes to new tool document
Browse files Browse the repository at this point in the history
  • Loading branch information
arroyo38 committed Dec 19, 2024
1 parent 87869d3 commit 2b0afcb
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 83 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
= Basics of Programming

In this lesson, we will learn how to:
In this document, we will learn how to:

* Understand the basics of Python programming, including variables and data types.
Expand Down
178 changes: 96 additions & 82 deletions tools-appendix/modules/python/pages/filtering-and-selecting.adoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
= Filtering and Selecting

In this document we will show how to:

* Select and filter rows and columns using `[]`, `loc`, and `iloc`.
* Use `isin` for filtering specific values.
* Modify DataFrame values based on conditions.
* Summarize datasets with `describe()`.
== Basics and Simple Accessing Methods
It is not likely that you will always need an entire dataset when doing analysis, so being able to efficiently pick subsets of data is an important skill. `Pandas` has some vital indexing functions that add logical parameters when working with a `DataFrame` or `Series`.

Expand Down Expand Up @@ -358,25 +365,43 @@ myDF.iloc[0:7, 10]
Name: AgeRecode52, dtype: int64
----

Now, let's filter our dataset to only include a few columns and rename the DataFrame.

[source,python]
----
filtered_columns = ['Id', 'ResidentStatus', 'Sex', 'Age', 'Race', 'MaritalStatus']
filtered_myDF = myDF[filtered_columns]
filtered_myDF
----


----
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
1 2 1 M 58 1 D
2 3 1 F 75 1 W
3 4 1 M 74 1 D
4 5 1 M 64 1 D
... ... ... ... ... ... ...
----

Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, myDF.iloc[[0, 7, 9, 10], :] specifies the selection of rows 0, 7, 9, and 10 and all columns:

[source,python]
----
myDF.iloc[[0, 7, 9, 10], :]
filtered_myDF.iloc[[0, 7, 9, 10], :]
----


----
Id ResidentStatus Education1989Revision Education2003Revision EducationReportingFlag MonthOfDeath
0 1 1 0 2 1 1 ...
7 8 1 0 4 1 1 ...
9 10 1 0 3 1 1 ...
10 11 1 0 3 1 1 ...
4 rows × 38 columns
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
7 8 1 M 55 2 S
9 10 1 M 23 1 S
10 11 1 F 79 1 W
----

== The loc function
Expand All @@ -386,54 +411,52 @@ In our dataset, our rows are integers, so we can use integers as our row labels.

[source,python]
----
myDF.loc[0, 'Age']
filtered_myDF.loc[0, 'Age']
----

----
87
----

Now, let's select all columns except one specific column using the `loc[]` function. Let's exclude the column `Education1989Revision`:
Now, let's select all columns except one specific column using the `loc[]` function. Let's exclude the column `Race`:

[source,python]
----
myDF.loc[:, myDF.columns != 'Education1989Revision']
filtered_myDF.loc[:, filtered_myDF.columns != 'Race']
----

----
Id ResidentStatus Education2003Revision EducationReportingFlag MonthOfDeath Sex ...
0 1 1 2 1 1 M ...
1 2 1 2 1 1 M ...
2 3 1 7 1 1 F ...
3 4 1 6 1 1 M ...
4 5 1 3 1 1 M ...
... ... ... ...
2631171 rows × 37 columns
Id ResidentStatus Sex Age MaritalStatus
0 1 1 M 87 M
1 2 1 M 58 D
2 3 1 F 75 W
3 4 1 M 74 D
4 5 1 M 64 D
... ... ... ... ... ...
----


Select MonthOfDeath for the first 10 observations using `loc[]`:
Select the first 10 observations for the column `Age` using `loc[]`:


[source,python]
----
myDF.loc[:10, 'MonthOfDeath']
filtered_myDF.loc[:10, 'Age']
----

----
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
Name: MonthOfDeath, dtype: int64
0 87
1 58
2 75
3 74
4 64
5 93
6 82
7 55
8 86
9 23
10 79
Name: Age, dtype: int64
----

== Filtering our Dataset
Expand All @@ -442,7 +465,7 @@ Let's filter for one category. Let's try the `==` to see which death records are

[source,python]
----
myDF['Sex'] == "F"
filtered_myDF['Sex'] == "F"
----

----
Expand All @@ -458,8 +481,6 @@ myDF['Sex'] == "F"
2631169 False
2631170 False
Name: Sex, Length: 2631171, dtype: bool
----

When we evaluated the condition myDF['Sex'] == "F", it produced a Boolean series where each value corresponds to whether the condition was True or False for each row in the DataFrame. This Boolean series can be used to filter the DataFrame.
Expand All @@ -468,96 +489,89 @@ If we want to see only the rows where the Sex column is "F" (females), we can us

[source,python]
----
myDF[myDF['Sex'] == "F"]
filtered_myDF[filtered_myDF['Sex'] == "F"]
----

----
Id ResidentStatus Education1989Revision Education2003Revision EducationReportingFlag MonthOfDeath Sex
2 3 1 0 7 1 1 F ...
5 6 1 0 5 1 1 F ...
8 9 1 0 3 1 1 F ...
10 11 1 0 3 1 1 F ...
12 13 1 0 4 1 1 F ...
... ... ... ... ...
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ...
1299710 rows × 6 columns
----

We can also use `.loc` for filtering for females.

[source,python]
----
myDF.loc[myDF['Sex'] == "F"]
filtered_myDF.loc[filtered_myDF['Sex'] == "F"]
----

----
Id ResidentStatus Education1989Revision Education2003Revision EducationReportingFlag MonthOfDeath Sex
2 3 1 0 7 1 1 F ...
5 6 1 0 5 1 1 F ...
8 9 1 0 3 1 1 F ...
10 11 1 0 3 1 1 F ...
12 13 1 0 4 1 1 F ...
... ... ... ... ...
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ...
----

Now let's filter for two things. Let's filter for Females who are 114 years old. Suprisingly, some people do live that long based on our dataset!

[source,python]
----
myDF[(myDF['Sex'] == "F") & (myDF['Age'] == 114)]
filtered_myDF[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Education1989Revision Education2003Revision EducationReportingFlag MonthOfDeath Sex Age
265482 265483 1 0 1 1 7 F 114 ...
1304830 1304831 1 0 2 1 12 F 114 ...
1372655 1372656 1 0 1 1 3 F 114 ...
1981235 1981236 1 0 1 1 7 F 114 ...
2407245 2407246 1 0 0 0 5 F 114 ...
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
----

Another method that would get us the same results:


[source,python]
----
myDF.loc[(myDF['Sex'] == "F") & (myDF['Age'] == 114)]
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Education1989Revision Education2003Revision EducationReportingFlag MonthOfDeath Sex Age
265482 265483 1 0 1 1 7 F 114 ...
1304830 1304831 1 0 2 1 12 F 114 ...
1372655 1372656 1 0 1 1 3 F 114 ...
1981235 1981236 1 0 1 1 7 F 114 ...
2407245 2407246 1 0 0 0 5 F 114 ...
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
----

=== Filtering and Modiying the Dataset
=== Filtering and Modifying the Dataset

Let's say there was a data entry mistake, and all females who are 114 should actually be 100 years old! Let's fix this data entry error.

[source,python]
----
myDF.loc[(myDF['Sex'] == "F") & (myDF['Age'] == 114), 'Age'] = 100
----
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114), 'Age'] = 100
----
Id ResidentStatus Education1989Revision Education2003Revision EducationReportingFlag MonthOfDeath Sex Age
265482 265483 1 0 1 1 7 F 114 ...
1304830 1304831 1 0 2 1 12 F 114 ...
1372655 1372656 1 0 1 1 3 F 114 ...
1981235 1981236 1 0 1 1 7 F 114 ...
2407245 2407246 1 0 0 0 5 F 114 ...
----

We can check whether it worked by trying to filter for females who are 114 again. The results should be 0 observations.

We can check whether it worked by trying to filter for females who are 114 again. The results should be 0 observations because we set them equal to 100.

[source,python]
----
myDF.loc[(myDF['Sex'] == "F") & (myDF['Age'] == 114)]
print(myDF)
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Education1989Revision Education2003Revision EducationReportingFlag MonthOfDeath Sex Age
0 rows × 38 columns
Id ResidentStatus Sex Age Race MaritalStatus
----
12 changes: 12 additions & 0 deletions tools-appendix/modules/python/pages/intital-eda.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,18 @@ Exploratory Data Analysis or EDA is one of the most important steps when underst

The most common operations start with reading data into a DataFrame, accessing the DataFrames’s attributes, and using the DataFrame’s methods to perform operations on the underlying data or with other DataFrames.



In this document we will:

* Show how to load and inspect data using pandas (`read_csv`, `head`, `len`, `shape`).
* Explore data attributes (`columns`, `unique`, `value_counts`, `isin`).
* Perform data transformations (`rename`, `iloc`).
* Summarize datasets with `describe()`.
== Basic Functions in EDA

Here we list some commonly used functions used for EDA and DataFrames. You can explore how they get used in the code examples below.
Expand Down

0 comments on commit 2b0afcb

Please sign in to comment.