Code
# This code pulls census data and loads it into a DataFrame
@@ -472,7 +472,7 @@
4.1.1 Approach 1: Create a Temporary Column
One method to do this is to first start by creating a column that contains the lengths of the names.
-
+
# Create a Series of the length of each name
= babynames["Name"].str.len()
babyname_lengths
@@ -548,7 +548,7 @@
+
# Sort by the temporary column
= babynames.sort_values(by="name_lengths", ascending=False)
babynames 5) babynames.head(
@@ -621,7 +621,7 @@
+
# Drop the 'name_length' column
= babynames.drop("name_lengths", axis='columns')
babynames 5) babynames.head(
@@ -691,7 +691,7 @@
4.1.2 Approach 2: Sorting using the key
Argument
Another way to approach this is to use the key
argument of .sort_values()
. Here we can specify that we want to sort "Name"
values by their length.
-
+
"Name", key=lambda x: x.str.len(), ascending=False).head() babynames.sort_values(
@@ -759,7 +759,7 @@
4.1.3 Approach 3: Sorting using the map
Function
We can also use the map
function on a Series
to solve this. Say we want to sort the babynames
table by the number of "dr"
’s and "ea"
’s in each "Name"
. We’ll define the function dr_ea_count
to help us out.
-
+
# First, define a function to count the number of times "dr" or "ea" appear in each name
def dr_ea_count(string):
return string.count('dr') + string.count('ea')
@@ -839,7 +839,7 @@
+
# Drop the `dr_ea_count` column
= babynames.drop("dr_ea_count", axis = 'columns')
babynames 5) babynames.head(
@@ -911,10 +911,10 @@ 4.2 Aggregating Data with .groupby
Up until this point, we have been working with individual rows of DataFrame
s. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame
. To do this, we’ll use pandas
GroupBy
objects. Our goal is to group together rows that fall under the same category and perform an operation that aggregates across all rows in the category.
Let’s say we wanted to aggregate all rows in babynames
for a given year.
-
+
"Year") babynames.groupby(
-<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10fbc0920>
+<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11ded1a90>
What does this strange output mean? Calling .groupby
(documentation) has generated a GroupBy
object. You can imagine this as a set of “mini” sub-DataFrame
s, where each subframe contains all of the rows from babynames
that correspond to a particular year.
@@ -924,7 +924,7 @@
+
"Year", "Count"]].groupby("Year").agg("sum").head(5) babynames[[
@@ -977,7 +977,7 @@ , with a single row for each unique year in the original babynames
DataFrame.
There are many different aggregation functions we can use, all of which are useful in different applications.
-
+
"Year", "Count"]].groupby("Year").agg("min").head(5) babynames[[
@@ -1021,7 +1021,7 @@
+
"Year", "Count"]].groupby("Year").agg("max").head(5) babynames[[
@@ -1065,7 +1065,7 @@
+
# Same result, but now we explicitly tell pandas to only consider the "Count" column when summing
"Year")[["Count"]].agg("sum").head(5) babynames.groupby(
@@ -1119,7 +1119,7 @@ 4.2.1 Aggregation Functions
Because of this fairly broad requirement, pandas
offers many ways of computing an aggregation.
In-built Python operations – such as sum
, max
, and min
– are automatically recognized by pandas
.
-
+
# What is the minimum count for each name in any year?
"Name")[["Count"]].agg("min").head() babynames.groupby(
@@ -1164,7 +1164,7 @@
-
+
# What is the largest single-year count of each name?
"Name")[["Count"]].agg("max").head() babynames.groupby(
@@ -1210,7 +1210,7 @@
As mentioned previously, functions from the NumPy
library, such as np.mean
, np.max
, np.min
, and np.sum
, are also fair game in pandas
.
-
+
# What is the average count for each name across all years?
"Name")[["Count"]].agg("mean").head() babynames.groupby(
@@ -1266,7 +1266,7 @@
The latter two entries in this list – "first"
and "last"
– are unique to pandas
. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where multiple columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.
Let’s illustrate this with an example. Say we add a new column to babynames
that contains the first letter of each name.
-
+
# Imagine we had an additional column, "First Letter". We'll explain this code next week
"First Letter"] = babynames["Name"].str[0]
babynames[
@@ -1331,7 +1331,7 @@
Aggregating using “first”
-
+
"Name").agg({"First Letter":"first", "Year":"max"}).head() babynames_new.groupby(
@@ -1386,7 +1386,7 @@
4.2.2 Plotting Birth Counts
Let’s use .agg
to find the total number of babies born in each year. Recall that using .agg
with .groupby()
follows the format: df.groupby(column_name).agg(aggregation_function)
. The line of code below gives us the total number of babies born in each year.
-
+
Code
"Year")[["Count"]].agg(sum).head(5)
@@ -1396,7 +1396,7 @@ babynames.groupby(
# babynames.groupby("Year").sum(numeric_only=True)
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_22808/390646742.py:1: FutureWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_93014/390646742.py:1: FutureWarning:
The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
@@ -1446,7 +1446,7 @@
Here’s an illustration of the process:
Plotting the Dataframe
we obtain tells an interesting story.
-
+
Code
import plotly.express as px
@@ -1454,9 +1454,9 @@
= "Count") px.line(puzzle2, y
-