Skip to content

Commit

Permalink
Making changes to Matplotlib document
Browse files Browse the repository at this point in the history
  • Loading branch information
arroyo38 committed Jan 2, 2025
1 parent d4a52af commit 5789d3f
Show file tree
Hide file tree
Showing 5 changed files with 129 additions and 67 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
133 changes: 71 additions & 62 deletions tools-appendix/modules/python/pages/filtering-and-selecting.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -246,20 +246,21 @@ myDF[['ResidentStatus', 'Age']]
The output of selecting multiple columns using the double brackets is a pandas `DataFrame`:

----
ResidentStatus Age
0 1 87
1 1 58
2 1 75
3 1 74
4 1 64
... ... ...
2631166 3 84
2631167 3 74
2631168 3 7
2631169 4 49
2631170 3 39
2631171 rows × 2 columns
ResidentStatus Age
0 1 87
1 1 58
2 1 75
3 1 74
4 1 64
... ... ...
2631166 3 84
2631167 3 74
2631168 3 7
2631169 4 49
2631170 3 39
[2631171 rows x 2 columns]
----

== The iloc function
Expand Down Expand Up @@ -295,9 +296,10 @@ AgeType 1
Age 87
AgeSubstitutionFlag 0
Name: 0, dtype: object
----

We can also use `iloc[]` to select the first row (index 0) and all columns using (:):
We can also use `iloc[]` to select the first row (index 0) and all columns using `(:)` :
[source,python]
----
myDF.iloc[0, :]
Expand Down Expand Up @@ -379,16 +381,17 @@ filtered_myDF


----
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
1 2 1 M 58 1 D
2 3 1 F 75 1 W
3 4 1 M 74 1 D
4 5 1 M 64 1 D
... ... ... ... ... ... ...
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
1 2 1 M 58 1 D
2 3 1 F 75 1 W
3 4 1 M 74 1 D
4 5 1 M 64 1 D
... ... ... ... ... ... ...
----

Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, myDF.iloc[[0, 7, 9, 10], :] specifies the selection of rows 0, 7, 9, and 10 and all columns:
Finally, let's try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, `myDF.iloc[[0, 7, 9, 10], :]` specifies the selection of rows 0, 7, 9, and 10 and all columns:

[source,python]
----
Expand All @@ -397,11 +400,12 @@ filtered_myDF.iloc[[0, 7, 9, 10], :]


----
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
7 8 1 M 55 2 S
9 10 1 M 23 1 S
10 11 1 F 79 1 W
Id ResidentStatus Sex Age Race MaritalStatus
0 1 1 M 87 1 M
7 8 1 M 55 2 S
9 10 1 M 23 1 S
10 11 1 F 79 1 W
----

== The loc function
Expand All @@ -426,13 +430,14 @@ filtered_myDF.loc[:, filtered_myDF.columns != 'Race']
----

----
Id ResidentStatus Sex Age MaritalStatus
0 1 1 M 87 M
1 2 1 M 58 D
2 3 1 F 75 W
3 4 1 M 74 D
4 5 1 M 64 D
... ... ... ... ... ...
Id ResidentStatus Sex Age MaritalStatus
0 1 1 M 87 M
1 2 1 M 58 D
2 3 1 F 75 W
3 4 1 M 74 D
4 5 1 M 64 D
... ... ... ... ... ...
----


Expand Down Expand Up @@ -493,15 +498,15 @@ filtered_myDF[filtered_myDF['Sex'] == "F"]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ...
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ... ...
[1299710 rows × 6 columns]
1299710 rows × 6 columns
----

We can also use `.loc` for filtering for females.
Expand All @@ -512,13 +517,15 @@ filtered_myDF.loc[filtered_myDF['Sex'] == "F"]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ...
Id ResidentStatus Sex Age Race MaritalStatus
2 3 1 F 75 1 W
5 6 1 F 93 1 W
8 9 1 F 86 1 W
10 11 1 F 79 1 W
12 13 1 F 85 1 W
... ... ... ... ... ... ...
[1299710 rows × 6 columns]
----

Now let's filter for two things. Let's filter for Females who are 114 years old. Suprisingly, some people do live that long based on our dataset!
Expand All @@ -529,12 +536,13 @@ filtered_myDF[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
----

Another method that would get us the same results:
Expand All @@ -546,12 +554,13 @@ filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
----

----
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
Id ResidentStatus Sex Age Race MaritalStatus
265482 265483 1 F 114 1 W
1304830 1304831 1 F 114 1 W
1372655 1372656 1 F 114 2 W
1981235 1981236 1 F 114 2 W
2407245 2407246 1 F 114 4 M
----

=== Filtering and Modifying the Dataset
Expand Down
1 change: 1 addition & 0 deletions tools-appendix/modules/python/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Python is largely known for its readability and versatility. Its design philosop
* xref:plotly-examples.adoc[Data Visualization with plotly]
* xref:writing-functions.adoc[Writing Functions in Python]
* xref:writing-scripts.adoc[Writing Scripts in Python]
* xref:pandas-series.adoc[Pandas Series]
* xref:pandas-dates-and-times.adoc[Handling Dates and Times in pandas]
* xref:pandas-aggregate-functions.adoc[Applying Aggregate Functions in pandas]
* xref:pandas-reshaping.adoc[Reshaping Data in pandas]
Expand Down
62 changes: 57 additions & 5 deletions tools-appendix/modules/python/pages/matplotlib.adoc
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
= matplotlib
= Matplotlib

When starting with Python, the most common plotting package is often `matplotlib`. It is an easy and straightforward plotting tool, with a surprising amount of depth. Like any package, it also has pluses and minuses.

Importing `matplotlib` for use in a project is pretty straightforward:

* <<barplot, barplot>>
* <<boxplot, boxplot>>
* <<Barplots Using Matplotlib, Barplots Using Matplotlib>>
* <<Boxplots Using Matplotlib, Boxplots Using Matplotlib>>
* <<Histograms Using Matplotlib, Histograms Using Matplotlib>>
* <<Scatterplots Using Matplotlib, Scatterplots Using Matplotlib>>
[source,python]
----
Expand All @@ -26,7 +28,7 @@ For those of us who aren't familiar with MATLAB the `pyplot` functionality creat

{sp}+

== barplot
== Barplots Using Matplotlib

Barplots can take many forms. They are most often utilized when comparing change over time or comparisons between categories for a data set. As with many of the plotting types `matplotlib` has the built-in `barplot` function to create the visualizations.

Expand Down Expand Up @@ -289,7 +291,7 @@ plt.close()

This just starts to scratch the surface of what is possible with `matplotlib` but it does show the deep customization that is possible via the package.

== boxplot
== Boxplots Using Matplotlib

`boxplot` is a function that creates a https://en.wikipedia.org/wiki/Box_plot[boxplot]. While that may not be very surprising, it is surprising how helpful boxplots can be in summarizing your data. Boxplots show a number of different measures related to the data such as quartiles, upper and lower bounds, and potential outliers. They can also he helpful to identify general trends between groups or over time. However, it should be noted there may be better plots for specific use cases.

Expand Down Expand Up @@ -466,3 +468,53 @@ plt.close()
image::box_6.png[Boxplot with better color, width=792, height=500, loading=lazy, title="Boxplot with better color"]

Now we have a good looking boxplot! Hopefully this demonstration showed how helpful boxplots can be when interpreting data. It also shows how `matplotlib` plots can be further customized, to fit the needs of the visualization!

== Histograms Using Matplotlib

A histogram is a way to visualize the distribution of numerical data. In Python, it groups data points into intervals (called bins) and uses bars to represent the frequency of data falling within each interval. The height of each bar shows how many data points are in that range.

Let's visualize the precipitation data in our dataset by plotting a histogram with Matplotlib.


[source,python]
----
myDF = pd.read_csv("/anvil/projects/tdm/data/precip/precip.csv")
plt.hist(myDF['precip'], bins=10, edgecolor='black')
plt.title('Histogram of Precipitation')
plt.xlabel('Precipitation (inches)')
plt.ylabel('Frequency')
plt.show()
----


image::matplot-histogram-aa.png[Plotting a histogram, width=792, height=500, loading=lazy, title="Histogram in Matplotlib"]



== Scatterplots Using Matplotlib

A scatter plot is a way to visualize the relationship between two variables. In Python, it uses individual points plotted on a Cartesian plane, where the position of each point is determined by its values for the two variables. Scatter plots are useful for identifying patterns, trends, or correlations in the data.

Let's visualize the precipitation data in our dataset by plotting a scatter plot with Matplotlib.

[source,python]
----
import pandas as pd
import matplotlib.pyplot as plt
myDF = pd.read_csv("/anvil/projects/tdm/data/precip/precip.csv")
plt.scatter(myDF['place'].iloc[:10], myDF['precip'].iloc[:10], color='blue')
plt.title("Scatter Plot of Precipitation (Top 10 Places)")
plt.xlabel("Place")
plt.ylabel("Precipitation (inches)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
----

When creating plots, it's improtant to try to understand the overall trends they reveal. From the plot, we observe that among the first 10 places, Mobile, Phoenix, and Little Rock have the highest precipitation levels.

image::matplot-scatterplot-aa.png[Plotting a scatterplot, width=792, height=500, loading=lazy, title="Scatterplot in Matplotlib"]

0 comments on commit 5789d3f

Please sign in to comment.