Additional changes to new tool document

TheDataMine · Dec 19, 2024 · 2b0afcb · 2b0afcb
1 parent 87869d3
commit 2b0afcb
Show file tree

Hide file tree

Showing 3 changed files with 109 additions and 83 deletions.
diff --git a/tools-appendix/modules/python/pages/basics-programming.adoc b/tools-appendix/modules/python/pages/basics-programming.adoc
@@ -1,6 +1,6 @@
 = Basics of Programming 
 
-In this lesson, we will learn how to: 
+In this document, we will learn how to: 
 
 * Understand the basics of Python programming, including variables and data types.
 

diff --git a/tools-appendix/modules/python/pages/filtering-and-selecting.adoc b/tools-appendix/modules/python/pages/filtering-and-selecting.adoc
@@ -1,5 +1,12 @@
 = Filtering and Selecting
 
+In this document we will show how to:
+
+* Select and filter rows and columns using `[]`, `loc`, and `iloc`.
+*  Use `isin` for filtering specific values.
+* Modify DataFrame values based on conditions.
+* Summarize datasets with `describe()`.
+
 == Basics and Simple Accessing Methods
 It is not likely that you will always need an entire dataset when doing analysis, so being able to efficiently pick subsets of data is an important skill. `Pandas` has some vital indexing functions that add logical parameters when working with a `DataFrame` or `Series`. 
 
@@ -358,25 +365,43 @@ myDF.iloc[0:7, 10]
 Name: AgeRecode52, dtype: int64
 
 
+----
+
+Now, let's filter our dataset to only include a few columns and rename the DataFrame. 
+
+[source,python]
+----
+filtered_columns = ['Id', 'ResidentStatus', 'Sex', 'Age', 'Race', 'MaritalStatus']
+filtered_myDF = myDF[filtered_columns]
+filtered_myDF
+
 ----
 
 
+----
+        Id 	ResidentStatus 	Sex 	Age 	Race 	MaritalStatus
+0 	    1 	        1 	    M 	    87 	    1 	        M
+1 	    2 	        1 	    M 	    58 	    1 	        D
+2 	    3 	        1 	    F 	    75 	    1 	        W
+3 	    4 	        1 	    M 	    74 	    1 	        D  
+4 	    5 	        1 	    M 	    64 	    1 	        D
+... 	... 	    ... 	... 	...     ...         ... 
+----
+
 Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, myDF.iloc[[0, 7, 9, 10], :] specifies the selection of rows 0, 7, 9, and 10 and all columns:
 
 [source,python]
 ----
-myDF.iloc[[0, 7, 9, 10], :]
+filtered_myDF.iloc[[0, 7, 9, 10], :]
 ----
 
 
 ----
-Id 	ResidentStatus 	Education1989Revision 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 
-0 	1 	1 	0 	2 	1 	1  	... 	
-7 	8 	1 	0 	4 	1 	1  	... 	
-9 	10 	1 	0 	3 	1 	1 	... 	
-10 	11 	1 	0 	3 	1 	1  	...
-
-4 rows × 38 columns
+    Id 	ResidentStatus 	Sex 	Age 	Race 	MaritalStatus
+0 	1 	    1 	        M 	    87 	    1 	        M
+7 	8 	    1 	        M 	    55 	    2 	        S
+9 	10 	    1 	        M 	    23 	    1       	S
+10 	11 	    1 	        F 	    79 	    1 	        W
 ----
 
 == The loc function 
@@ -386,54 +411,52 @@ In our dataset, our rows are integers, so we can use integers as our row labels.
 
 [source,python]
 ----
-myDF.loc[0, 'Age']
+filtered_myDF.loc[0, 'Age']
 ----
 
 ----
 87
 ----
 
-Now, let's select all columns except one specific column using the `loc[]` function. Let's exclude the column `Education1989Revision`:
+Now, let's select all columns except one specific column using the `loc[]` function. Let's exclude the column `Race`:
 
 [source,python]
 ----
-myDF.loc[:, myDF.columns != 'Education1989Revision']
+filtered_myDF.loc[:, filtered_myDF.columns != 'Race']
 ----
 
 ----
- 	Id 	ResidentStatus 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 	Sex  	... 
-0 	1 	1 	2 	1 	1 	M  	... 	
-1 	2 	1 	2 	1 	1 	M  	... 	
-2 	3 	1 	7 	1 	1 	F  	... 	
-3 	4 	1 	6 	1 	1 	M  	... 
-4 	5 	1 	3 	1 	1 	M  	... 
-    ... 	... 	... 	... 	
-
-2631171 rows × 37 columns
+        Id 	ResidentStatus 	Sex 	Age 	MaritalStatus
+0 	    1 	        1 	    M 	    87 	        M
+1 	    2 	        1 	    M 	    58 	        D
+2 	    3 	        1 	    F 	    75 	        W
+3 	    4 	        1 	    M 	    74 	        D
+4 	    5 	        1 	    M 	    64 	        D
+... 	... 	    ... 	...    ... 	       ...
 ----
 
 
-Select MonthOfDeath for the first 10 observations using `loc[]`:
+Select the first 10 observations for the column `Age` using `loc[]`:
 
 
 [source,python]
 ----
-myDF.loc[:10, 'MonthOfDeath']
+filtered_myDF.loc[:10, 'Age']
 ----
 
 ----
-0     1
-1     1
-2     1
-3     1
-4     1
-5     1
-6     1
-7     1
-8     1
-9     1
-10    1
-Name: MonthOfDeath, dtype: int64
+0     87
+1     58
+2     75
+3     74
+4     64
+5     93
+6     82
+7     55
+8     86
+9     23
+10    79
+Name: Age, dtype: int64
 ----
 
 == Filtering our Dataset 
@@ -442,7 +465,7 @@ Let's filter for one category. Let's try the `==` to see which death records are
 
 [source,python]
 ----
-myDF['Sex'] == "F"
+filtered_myDF['Sex'] == "F"
 ----
 
 ----
@@ -458,8 +481,6 @@ myDF['Sex'] == "F"
 2631169    False
 2631170    False
 Name: Sex, Length: 2631171, dtype: bool
-
-
 ----
 
 When we evaluated the condition myDF['Sex'] == "F", it produced a Boolean series where each value corresponds to whether the condition was True or False for each row in the DataFrame. This Boolean series can be used to filter the DataFrame.
@@ -468,96 +489,89 @@ If we want to see only the rows where the Sex column is "F" (females), we can us
 
 [source,python]
 ----
-myDF[myDF['Sex'] == "F"]
+filtered_myDF[filtered_myDF['Sex'] == "F"]
 ----
 
 ----
-Id 	ResidentStatus 	Education1989Revision 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 	Sex  	
-2 	3 	1 	0 	7 	1 	1 	F  	... 	
-5 	6 	1 	0 	5 	1 	1 	F 	... 	
-8 	9 	1 	0 	3 	1 	1 	F  	... 	
-10 	11 	1 	0 	3 	1 	1 	F  	... 	
-12 	13 	1 	0 	4 	1 	1 	F  	... 	
-... 	... 	... 	... 	... 	
+    Id 	ResidentStatus 	Sex 	Age 	Race 	MaritalStatus
+2 	3 	    1 	        F 	    75 	    1 	        W
+5 	6 	    1 	        F 	    93 	    1 	        W
+8 	9 	    1 	        F 	    86 	    1 	        W
+10 	11  	1 	        F 	    79 	    1 	        W
+12 	13 	    1 	        F 	    85 	    1 	        W
+    ... 	... 	    ... 	... 	... 	    ... 	
+
+1299710 rows × 6 columns
 ----
 
 We can also use `.loc` for filtering for females. 
 
 [source,python]
 ----
-myDF.loc[myDF['Sex'] == "F"]
+filtered_myDF.loc[filtered_myDF['Sex'] == "F"]
 ----
 
 ----
-Id 	ResidentStatus 	Education1989Revision 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 	Sex  	
-2 	3 	1 	0 	7 	1 	1 	F  	... 	
-5 	6 	1 	0 	5 	1 	1 	F 	... 	
-8 	9 	1 	0 	3 	1 	1 	F  	... 	
-10 	11 	1 	0 	3 	1 	1 	F  	... 	
-12 	13 	1 	0 	4 	1 	1 	F  	... 	
-... 	... 	... 	... 	... 	
+    Id 	ResidentStatus 	Sex 	Age 	Race 	MaritalStatus
+2 	3 	    1 	        F 	    75 	    1 	        W
+5 	6 	    1 	        F 	    93 	    1 	        W
+8 	9 	    1 	        F 	    86 	    1 	        W
+10 	11  	1 	        F 	    79 	    1 	        W
+12 	13 	    1 	        F 	    85 	    1 	        W
+    ... 	... 	    ... 	... 	... 	    ... 		
 ----
 
 Now let's filter for two things. Let's filter for Females who are 114 years old. Suprisingly, some people do live that long based on our dataset!
 
 [source,python]
 ----
-myDF[(myDF['Sex'] == "F") & (myDF['Age'] == 114)]
+filtered_myDF[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
 ----
 
 ----
-Id 	ResidentStatus 	Education1989Revision 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 	Sex  	Age 	
-265482  	265483 	    1 	0 	1 	1 	7 	F  	114  	... 	
-1304830 	1304831 	1 	0 	2 	1 	12 	F 	114 	... 	
-1372655 	1372656 	1 	0 	1 	1 	3 	F  	114  	... 	
-1981235 	1981236 	1 	0 	1 	1 	7 	F 	114 	... 	
-2407245 	2407246 	1 	0 	0 	0 	5 	F  	114  	... 		
+            Id 	    ResidentStatus 	Sex 	Age 	Race 	MaritalStatus
+265482 	    265483 	    1 	        F 	    114 	1 	        W
+1304830 	1304831 	1 	        F 	    114 	1 	        W
+1372655 	1372656 	1 	        F 	    114 	2 	        W
+1981235 	1981236 	1 	        F 	    114 	2 	        W
+2407245 	2407246 	1 	        F 	    114 	4 	        M	
 ----
 
 Another method that would get us the same results: 
 
 
 [source,python]
 ----
-myDF.loc[(myDF['Sex'] == "F") & (myDF['Age'] == 114)]
+filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
 ----
 
 ----
-Id 	ResidentStatus 	Education1989Revision 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 	Sex  	Age 	
-265482  	265483 	    1 	0 	1 	1 	7 	F  	114  	... 	
-1304830 	1304831 	1 	0 	2 	1 	12 	F 	114 	... 	
-1372655 	1372656 	1 	0 	1 	1 	3 	F  	114  	... 	
-1981235 	1981236 	1 	0 	1 	1 	7 	F 	114 	... 	
-2407245 	2407246 	1 	0 	0 	0 	5 	F  	114  	... 		
+            Id 	    ResidentStatus 	Sex 	Age 	Race 	MaritalStatus
+265482 	    265483 	    1 	        F 	    114 	1 	        W
+1304830 	1304831 	1 	        F 	    114 	1 	        W
+1372655 	1372656 	1 	        F 	    114 	2 	        W
+1981235 	1981236 	1 	        F 	    114 	2 	        W
+2407245 	2407246 	1 	        F 	    114 	4 	        M	
 ----
 
-=== Filtering and Modiying the Dataset
+=== Filtering and Modifying the Dataset
 
 Let's say there was a data entry mistake, and all females who are 114 should actually be 100 years old! Let's fix this data entry error. 
 
 [source,python]
 ----
-myDF.loc[(myDF['Sex'] == "F") & (myDF['Age'] == 114), 'Age'] = 100
-----
+filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114), 'Age'] = 100
 
 ----
-Id 	ResidentStatus 	Education1989Revision 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 	Sex  	Age 	
-265482  	265483 	    1 	0 	1 	1 	7 	F  	114  	... 	
-1304830 	1304831 	1 	0 	2 	1 	12 	F 	114 	... 	
-1372655 	1372656 	1 	0 	1 	1 	3 	F  	114  	... 	
-1981235 	1981236 	1 	0 	1 	1 	7 	F 	114 	... 	
-2407245 	2407246 	1 	0 	0 	0 	5 	F  	114  	... 		
-----
 
-We can check whether it worked by trying to filter for females who are 114 again. The results should be 0 observations. 
+
+We can check whether it worked by trying to filter for females who are 114 again. The results should be 0 observations because we set them equal to 100. 
 
 [source,python]
 ----
-myDF.loc[(myDF['Sex'] == "F") & (myDF['Age'] == 114)]
-print(myDF)
+filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
 ----
 
 ----
-Id 	ResidentStatus 	Education1989Revision 	Education2003Revision 	EducationReportingFlag 	MonthOfDeath 	Sex  	Age 	
-0 rows × 38 columns
+Id 	ResidentStatus 	Sex 	Age 	Race 	MaritalStatus
 ----
diff --git a/tools-appendix/modules/python/pages/intital-eda.adoc b/tools-appendix/modules/python/pages/intital-eda.adoc
@@ -4,6 +4,18 @@ Exploratory Data Analysis or EDA is one of the most important steps when underst
 
 The most common operations start with reading data into a DataFrame, accessing the DataFrames’s attributes, and using the DataFrame’s methods to perform operations on the underlying data or with other DataFrames.
 
+
+
+In this document we will:
+
+* Show how to load and inspect data using pandas (`read_csv`, `head`, `len`, `shape`).
+
+* Explore data attributes (`columns`, `unique`, `value_counts`, `isin`).
+
+* Perform data transformations (`rename`, `iloc`).
+
+* Summarize datasets with `describe()`.
+
 == Basic Functions in EDA
 
 Here we list some commonly used functions used for EDA and DataFrames. You can explore how they get used in the code examples below.