diff --git a/docs/pandas_3/images/groupby_demo.png b/docs/pandas_3/images/groupby_demo.png index f87b62e8..5a0a8b68 100644 Binary files a/docs/pandas_3/images/groupby_demo.png and b/docs/pandas_3/images/groupby_demo.png differ diff --git a/docs/pandas_3/pandas_3.html b/docs/pandas_3/pandas_3.html index 7e85f1e2..281e4355 100644 --- a/docs/pandas_3/pandas_3.html +++ b/docs/pandas_3/pandas_3.html @@ -341,7 +341,7 @@

Pandas III

4.1 Custom Sorts

First, let’s finish our discussion about sorting. Let’s try to solve a sorting problem using different approaches. Assume we want to find the longest baby names and sort our data accordingly.

We’ll start by loading the babynames dataset. Note that this dataset is filtered to only contain data from California.

-
+
Code
# This code pulls census data and loads it into a DataFrame
@@ -472,7 +472,7 @@ 

4.1.1 Approach 1: Create a Temporary Column

One method to do this is to first start by creating a column that contains the lengths of the names.

-
+
# Create a Series of the length of each name
 babyname_lengths = babynames["Name"].str.len()
 
@@ -548,7 +548,7 @@ 

+
# Sort by the temporary column
 babynames = babynames.sort_values(by="name_lengths", ascending=False)
 babynames.head(5)
@@ -621,7 +621,7 @@

+
# Drop the 'name_length' column
 babynames = babynames.drop("name_lengths", axis='columns')
 babynames.head(5)
@@ -691,7 +691,7 @@

4.1.2 Approach 2: Sorting using the key Argument

Another way to approach this is to use the key argument of .sort_values(). Here we can specify that we want to sort "Name" values by their length.

-
+
babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False).head()
@@ -759,7 +759,7 @@

4.1.3 Approach 3: Sorting using the map Function

We can also use the map function on a Series to solve this. Say we want to sort the babynames table by the number of "dr"’s and "ea"’s in each "Name". We’ll define the function dr_ea_count to help us out.

-
+
# First, define a function to count the number of times "dr" or "ea" appear in each name
 def dr_ea_count(string):
     return string.count('dr') + string.count('ea')
@@ -839,7 +839,7 @@ 

+
# Drop the `dr_ea_count` column
 babynames = babynames.drop("dr_ea_count", axis = 'columns')
 babynames.head(5)
@@ -911,10 +911,10 @@

4.2 Aggregating Data with .groupby

Up until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandas GroupBy objects. Our goal is to group together rows that fall under the same category and perform an operation that aggregates across all rows in the category.

Let’s say we wanted to aggregate all rows in babynames for a given year.

-
+
babynames.groupby("Year")
-
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10fbc0920>
+
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11ded1a90>

What does this strange output mean? Calling .groupby (documentation) has generated a GroupBy object. You can imagine this as a set of “mini” sub-DataFrames, where each subframe contains all of the rows from babynames that correspond to a particular year.

@@ -924,7 +924,7 @@

+
babynames[["Year", "Count"]].groupby("Year").agg("sum").head(5)
@@ -977,7 +977,7 @@

, with a single row for each unique year in the original babynames DataFrame.

There are many different aggregation functions we can use, all of which are useful in different applications.

-
+
babynames[["Year", "Count"]].groupby("Year").agg("min").head(5)
@@ -1021,7 +1021,7 @@

+
babynames[["Year", "Count"]].groupby("Year").agg("max").head(5)
@@ -1065,7 +1065,7 @@

+
# Same result, but now we explicitly tell pandas to only consider the "Count" column when summing
 babynames.groupby("Year")[["Count"]].agg("sum").head(5)
@@ -1119,7 +1119,7 @@

4.2.1 Aggregation Functions

Because of this fairly broad requirement, pandas offers many ways of computing an aggregation.

In-built Python operations – such as sum, max, and min – are automatically recognized by pandas.

-
+
# What is the minimum count for each name in any year?
 babynames.groupby("Name")[["Count"]].agg("min").head()
@@ -1164,7 +1164,7 @@

-
+
# What is the largest single-year count of each name?
 babynames.groupby("Name")[["Count"]].agg("max").head()
@@ -1210,7 +1210,7 @@

As mentioned previously, functions from the NumPy library, such as np.mean, np.max, np.min, and np.sum, are also fair game in pandas.

-
+
# What is the average count for each name across all years?
 babynames.groupby("Name")[["Count"]].agg("mean").head()
@@ -1266,7 +1266,7 @@

The latter two entries in this list – "first" and "last" – are unique to pandas. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where multiple columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.

Let’s illustrate this with an example. Say we add a new column to babynames that contains the first letter of each name.

-
+
# Imagine we had an additional column, "First Letter". We'll explain this code next week
 babynames["First Letter"] = babynames["Name"].str[0]
 
@@ -1331,7 +1331,7 @@ 

Aggregating using “first”

-
+
babynames_new.groupby("Name").agg({"First Letter":"first", "Year":"max"}).head()
@@ -1386,7 +1386,7 @@

4.2.2 Plotting Birth Counts

Let’s use .agg to find the total number of babies born in each year. Recall that using .agg with .groupby() follows the format: df.groupby(column_name).agg(aggregation_function). The line of code below gives us the total number of babies born in each year.

-
+
Code
babynames.groupby("Year")[["Count"]].agg(sum).head(5)
@@ -1396,7 +1396,7 @@ 

# babynames.groupby("Year").sum(numeric_only=True)

-
/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_22808/390646742.py:1: FutureWarning:
+
/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_93014/390646742.py:1: FutureWarning:
 
 The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
 
@@ -1446,7 +1446,7 @@

Here’s an illustration of the process:

aggregation

Plotting the Dataframe we obtain tells an interesting story.

-
+
Code
import plotly.express as px
@@ -1454,9 +1454,9 @@ 

px.line(puzzle2, y = "Count")

-