seaborn-documentation

TheDataMine · Jan 3, 2025 · 87dcdf5 · 87dcdf5
1 parent 717bd5c
commit 87dcdf5
Show file tree

Hide file tree

Showing 8 changed files with 305 additions and 13 deletions.
diff --git a/tools-appendix/modules/python/images/corr-matrix-seaborn.png b/tools-appendix/modules/python/images/corr-matrix-seaborn.png
diff --git a/tools-appendix/modules/python/images/histogram-seaborn-aa.png b/tools-appendix/modules/python/images/histogram-seaborn-aa.png
diff --git a/tools-appendix/modules/python/images/histograms-seaborn-aa.png b/tools-appendix/modules/python/images/histograms-seaborn-aa.png
diff --git a/tools-appendix/modules/python/images/scatterplot-seaborn-aa.png b/tools-appendix/modules/python/images/scatterplot-seaborn-aa.png
diff --git a/tools-appendix/modules/python/images/seaborn-barchart-aa.png b/tools-appendix/modules/python/images/seaborn-barchart-aa.png
diff --git a/tools-appendix/modules/python/images/seaborn-boxplot-aa.png b/tools-appendix/modules/python/images/seaborn-boxplot-aa.png
diff --git a/tools-appendix/modules/python/nav.adoc b/tools-appendix/modules/python/nav.adoc
@@ -1,14 +1,14 @@
 * xref:index.adoc[Python]
-* xref:introduction-to-jupyter-lab.adoc[Introduction to Jupyter]
-* xref:basics-programming.adoc[Basics of Programming]
-* xref:lists-dictionaries-tuples-loops.adoc[Introduction: Lists, Tuples, Dictionaries]
-* xref:eda.adoc[Basics of Exploratory Data Analysis (EDA)]
-* xref:control-flow.adoc[Control Flow]
-* xref:filtering-and-selecting.adoc[Filtering and Selecting]
-* xref:matplotlib.adoc[Data Visualization with matplotlib]
-* xref:plotly-examples.adoc[Data Visualization with plotly]
-* xref:writing-functions.adoc[Writing Functions in Python]
-* xref:writing-scripts.adoc[Writing Scripts in Python]
-* xref:pandas-dates-and-times.adoc[Handling Dates and Times in pandas]
-* xref:pandas-aggregate-functions.adoc[Applying Aggregate Functions in pandas]
-* xref:pandas-reshaping.adoc[Reshaping Data in pandas]
+** xref:introduction-to-jupyter-lab.adoc[Introduction to Jupyter]
+** xref:basics-programming.adoc[Basics of Programming]
+** xref:lists-dictionaries-tuples-loops.adoc[Introduction: Lists, Tuples, Dictionaries]
+** xref:eda.adoc[Basics of Exploratory Data Analysis (EDA)]
+** xref:control-flow.adoc[Control Flow]
+** xref:filtering-and-selecting.adoc[Filtering and Selecting]
+** xref:matplotlib.adoc[Data Visualization with matplotlib]
+** xref:plotly-examples.adoc[Data Visualization with plotly]
+** xref:writing-functions.adoc[Writing Functions in Python]
+** xref:writing-scripts.adoc[Writing Scripts in Python]
+** xref:pandas-dates-and-times.adoc[Handling Dates and Times in pandas]
+** xref:pandas-aggregate-functions.adoc[Applying Aggregate Functions in pandas]
+** xref:pandas-reshaping.adoc[Reshaping Data in pandas]
diff --git a/tools-appendix/modules/python/pages/seaborn-examples.adoc b/tools-appendix/modules/python/pages/seaborn-examples.adoc
@@ -0,0 +1,292 @@
+= Seaborn
+
+Seaborn is a Python library built on top of Matplotlib. It’s great for exploring data because it works well with pandas DataFrames and includes built-in themes and statistical plotting options like scatterplots, boxplots, and heatmaps!
+
+
+In this document, we will:
+
+* Create different types of visualizations, including bar plots, scatterplots, boxplots histograms, and heatmaps.
+
+* Modify axis labels, titles, and other graph elements to improve readability.
+
+* Explore dataset preparation techniques, including handling missing values and filtering data.
+
+* Analyze trends and patterns in the data by applying various Seaborn visualization techniques.
+
+
+== Understanding the Data Prior to Plotting
+
+Just as we did when learning `plotly`, we will use the following dataset(s) to understand seaborn:
+
+`/anvil/projects/tdm/data/zillow/Metro_time_series.csv`
+
+Let's review the dataset's columns to identify which ones might be suitable for creating a plot. We will use the `.columns` function again.
+
+[source,python]
+----
+import plotly.express as px
+import pandas as pd
+
+myDF = pd.read_csv("/anvil/projects/tdm/data/zillow/Metro_time_series.csv")
+myDF.columns
+----
+
+----
+Index(['Date', 'RegionName', 'AgeOfInventory', 'DaysOnZillow_AllHomes',
+       'InventorySeasonallyAdjusted_AllHomes', 'InventoryRaw_AllHomes',
+       'InventorySeasonallyAdjusted_BottomTier',
+       'InventorySeasonallyAdjusted_MiddleTier',
+       'InventorySeasonallyAdjusted_TopTier',
+       'MedianListingPricePerSqft_1Bedroom',
+       'MedianListingPricePerSqft_2Bedroom',
+       'MedianListingPricePerSqft_3Bedroom',
+       'MedianListingPricePerSqft_4Bedroom',
+       'MedianListingPricePerSqft_5BedroomOrMore',
+       'MedianListingPricePerSqft_AllHomes',
+       'MedianListingPricePerSqft_CondoCoop',
+       'MedianListingPricePerSqft_DuplexTriplex',
+       'MedianListingPricePerSqft_SingleFamilyResidence',
+       'MedianListingPrice_1Bedroom', 'MedianListingPrice_2Bedroom',
+       'MedianListingPrice_3Bedroom', 'MedianListingPrice_4Bedroom',
+       'MedianListingPrice_5BedroomOrMore', 'MedianListingPrice_AllHomes',
+       'MedianListingPrice_CondoCoop', 'MedianListingPrice_DuplexTriplex',
+       'MedianListingPrice_SingleFamilyResidence',
+       'MedianPctOfPriceReduction_AllHomes',
+       'MedianPctOfPriceReduction_CondoCoop',
+       'MedianPctOfPriceReduction_SingleFamilyResidence',
+       'MedianPriceCutDollar_AllHomes', 'MedianPriceCutDollar_CondoCoop',
+       'MedianPriceCutDollar_SingleFamilyResidence',
+       'MedianRentalPricePerSqft_1Bedroom',
+       'MedianRentalPricePerSqft_2Bedroom',
+       'MedianRentalPricePerSqft_3Bedroom',
+       'MedianRentalPricePerSqft_4Bedroom',
+       'MedianRentalPricePerSqft_5BedroomOrMore',
+       'MedianRentalPricePerSqft_AllHomes',
+       'MedianRentalPricePerSqft_CondoCoop',
+       'MedianRentalPricePerSqft_DuplexTriplex',
+       'MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits',
+       'MedianRentalPricePerSqft_SingleFamilyResidence',
+       'MedianRentalPricePerSqft_Studio', 'MedianRentalPrice_1Bedroom',
+       'MedianRentalPrice_2Bedroom', 'MedianRentalPrice_3Bedroom',
+       'MedianRentalPrice_4Bedroom', 'MedianRentalPrice_5BedroomOrMore',
+       'MedianRentalPrice_AllHomes', 'MedianRentalPrice_CondoCoop',
+       'MedianRentalPrice_DuplexTriplex',
+       'MedianRentalPrice_MultiFamilyResidence5PlusUnits',
+       'MedianRentalPrice_SingleFamilyResidence', 'MedianRentalPrice_Studio',
+       'ZHVIPerSqft_AllHomes', 'PctOfHomesDecreasingInValues_AllHomes',
+       'PctOfHomesIncreasingInValues_AllHomes',
+       'PctOfHomesSellingForGain_AllHomes',
+       'PctOfHomesSellingForLoss_AllHomes',
+       'PctOfListingsWithPriceReductionsSeasAdj_AllHomes',
+       'PctOfListingsWithPriceReductionsSeasAdj_BottomTier',
+       'PctOfListingsWithPriceReductionsSeasAdj_CondoCoop',
+       'PctOfListingsWithPriceReductionsSeasAdj_MiddleTier',
+       'PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence',
+       'PctOfListingsWithPriceReductionsSeasAdj_TopTier',
+       'PctOfListingsWithPriceReductions_AllHomes',
+       'PctOfListingsWithPriceReductions_BottomTier',
+       'PctOfListingsWithPriceReductions_CondoCoop',
+       'PctOfListingsWithPriceReductions_MiddleTier',
+       'PctOfListingsWithPriceReductions_SingleFamilyResidence',
+       'PctOfListingsWithPriceReductions_TopTier', 'PriceToRentRatio_AllHomes',
+       'Sale_Counts_Msa', 'Sale_Counts_Seas_Adj_Msa', 'Sale_Prices_Msa',
+       'InventoryTierShare_BottomTier_AllHomes',
+       'InventoryTierShare_MiddleTier_AllHomes',
+       'InventoryTierShare_TopTier_AllHomes', 'ZHVI_1bedroom', 'ZHVI_2bedroom',
+       'ZHVI_3bedroom', 'ZHVI_4bedroom', 'ZHVI_5BedroomOrMore',
+       'ZHVI_AllHomes', 'ZHVI_BottomTier', 'ZHVI_CondoCoop', 'ZHVI_MiddleTier',
+       'ZHVI_SingleFamilyResidence', 'ZHVI_TopTier', 'ZRI_AllHomes',
+       'ZRI_AllHomesPlusMultifamily', 'ZriPerSqft_AllHomes',
+       'Zri_MultiFamilyResidenceRental', 'Zri_SingleFamilyResidenceRental'],
+      dtype='object')
+----
+
+As we can see, there are many columns in the dataset! It's essential to understand the available columns before creating any visualizations.
+
+You can use the .head() function and the .tail() function to get a further preview of the data. 
+
+
+
+== Barplots Using Seaborn
+
+Let's plot a barplot of `MedianListingPrice_1Bedroom` and the first 10 regions that are not null using `Seaborn` to show how the code syntax is a little bit different than `plotly` and `matplotlib`. There are some missing values in our dataset, so we will again, filter for when they are not null as we did when using `plotly`. 
+
+[source,python]
+----
+import pandas as pd
+myDF = pd.read_csv("/anvil/projects/tdm/data/zillow/Metro_time_series.csv")
+----
+
+[source,python]
+----
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+# Filter the dataset
+filteredDF = myDF[myDF['MedianListingPrice_1Bedroom'].notnull()][['RegionName', 'MedianListingPrice_1Bedroom']].head(10)
+
+plt.figure(figsize=(10, 6))
+sns.barplot(data=filteredDF, x='RegionName', y='MedianListingPrice_1Bedroom')
+plt.title('Median Listing Price for 1-Bedroom Homes by Region (First 10)')
+plt.xlabel('Region')
+plt.ylabel('Median Listing Price ($)')
+plt.xticks(rotation=45)
+plt.tight_layout()
+plt.show()
+----
+image::seaborn-barchart-aa.png[Initial Bar Plot using Seaborn, width=792, height=500, loading=lazy, title="Bar Chart Seaborn"]
+
+
+Notice how we had to import `seaborn` and `matplotlib` since they are two libraries that complement each other. Seaborn's `barplot` function was used to create the bar chart with aggregation and it's default aesthetics.
+
+`Seaborn` leverages `matplotlib` for its back-end plotting engine.
+We import matplotlib's pyplot to customize the plot further, such as adding titles, labels, or adjusting layouts. 
+
+For example: 
+
+* `plt.title`, `plt.xlabel`, and `plt.ylabel`  add titles and axis labels.
+
+* `plt.xticks` rotates the x-axis labels for better readability.
+
+
+
+== Boxplots Using Seaborn
+Now let's use Seaborn to plot a box-plot. 
+
+A box plot (also known as a box-and-whisker plot) is a way of displaying the distribution of data based on the statistics:
+
+*  Minimum
+* First quartile (Q1)
+* Median (Q2)
+* Third quartile (Q3)
+* Maximum
+
+Remember, a box plot requires numerical data to display the distribution and summarize its key statistical properties.  Let's use the column `DaysOnZillow_AllHomes` and create a new column called `Year` to create a box-plot in `seaborn`. Note, we will need to clean up the `Date` column to be able to create the `Year` column. 
+
+[source,python]
+----
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+# Clean up year
+myDF['Year'] = pd.to_datetime(myDF['Date']).dt.year
+
+filtered_days_zillow = myDF[myDF['DaysOnZillow_AllHomes'].notnull()][['Year', 'DaysOnZillow_AllHomes']]
+
+plt.figure(figsize=(12, 6))
+sns.boxplot(data=filtered_days_zillow, x='Year', y='DaysOnZillow_AllHomes')
+plt.title('Box Plot of Days on Zillow by Year')
+plt.xlabel('Year')
+plt.ylabel('Days on Zillow (All Homes)')
+plt.xticks(rotation=45)
+plt.tight_layout()
+plt.show()
+----
+image::seaborn-boxplot-aa.png[Initial Box Plot using Seaborn, width=792, height=500, loading=lazy, title="Box Plot Seaborn"]
+
+
+== Histograms with Seaborn
+
+This example demonstrates how to create histograms using the Seaborn. The code snippet plots a histogram to analyze the distribution of the number of days properties remain listed on Zillow.
+
+[source,python]
+----
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+myDF = pd.read_csv("/anvil/projects/tdm/data/zillow/Metro_time_series.csv")
+filtered_data = myDF[myDF['DaysOnZillow_AllHomes'].notnull()]
+
+plt.figure(figsize=(10, 6))
+sns.histplot(data=filtered_data, x='DaysOnZillow_AllHomes', bins=100, kde=False)
+plt.title('Histogram of Days on Zillow')
+plt.xlabel('Days on Zillow')
+plt.ylabel('Frequency')
+plt.tight_layout()
+plt.show()
+----
+
+image::histograms-plotly-aa.png[Histograms using Seaborn, width=792, height=500, loading=lazy, title="Histogram using Seaborn"]
+
+Similar to the plotly example, the histogram helps us understand the distribution of the number of days the houses are on zillow. It seems that some houses genuinely stay on Zillow for much longer. It may be due to location, pricing, or other factors.
+
+
+== Scatterplots with Seaborn
+
+Let's plot a scatterplot of `MedianListingPrice_1Bedroom` vs `DaysOnZillow_AllHomes` using Seaborn. 
+
+[source,python]
+----
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+myDF = pd.read_csv("/anvil/projects/tdm/data/zillow/Metro_time_series.csv")
+filteredDF_scatter = myDF.dropna(subset=['DaysOnZillow_AllHomes', 'MedianListingPrice_1Bedroom'])
+
+plt.figure(figsize=(10, 6))
+sns.scatterplot(
+    data=filteredDF_scatter,
+    x='DaysOnZillow_AllHomes',
+    y='MedianListingPrice_1Bedroom'
+)
+
+plt.title('Days on Zillow vs Median Listing Price for 1-Bedroom Homes')
+plt.xlabel('Days on Zillow (All Homes)')
+plt.ylabel('Median Listing Price (1-Bedroom)')
+plt.tight_layout()
+plt.show()
+----
+image::scatterplot-seaborn-aa.png[Scattterplots using Seaborn, width=792, height=500, loading=lazy, title="Scatterplots using Seaborn"]
+
+Similar to the `plotly` example, we can see that there appears to be a general downward trend between the number of days a home is listed on Zillow `DaysOnZillow_AllHomes` and the median listing price for 1-bedroom homes `MedianListingPrice_1Bedroom`. This suggests that higher-priced 1-bedroom homes tend to spend fewer days on Zillow. Also, we can see that a significant concentration of points is clustered around lower listing prices (below $400K) and shorter listing durations (less than 150 days).
+
+
+== Heatmap with Seaborn
+
+A heatmap is a chart that shows data as colors, making it easy to see patterns, trends, or relationships in a table. Each cell in the heatmap represents a value, and the color shows how big or small the value is.
+
+* In a correlation heatmap, you can quickly see how two variables are related.
+* Darker colors might mean stronger relationships, while lighter ones mean weaker.
+
+
+Correlation measures the statistical relationship between two variables, such as:
+
+* Positive correlation: As one variable increases, so does the other.
+* Negative correlation: As one variable increases, the other decreases.
+
+The example below is a heatmap we create using the `seaborn` library. The graph focuses on numeric variables because a correlation matrix requires numerical data to calculate the relationships between variables. Below, we will filter for columns that seem relevant for understanding real estate trends. 
+
+[source,python]
+----
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+myDF = pd.read_csv("/anvil/projects/tdm/data/zillow/Metro_time_series.csv")
+
+relevant_columns = [
+    'AgeOfInventory', 
+    'DaysOnZillow_AllHomes', 
+    'InventoryRaw_AllHomes', 
+    'ZHVI_AllHomes', 
+    'ZHVI_BottomTier', 
+    'ZHVI_TopTier', 
+    'ZRI_AllHomes', 
+    'ZriPerSqft_AllHomes'
+]
+
+filtered_relevant_data = myDF[relevant_columns].dropna()
+correlation_matrix = filtered_relevant_data.corr()
+
+plt.figure(figsize=(10, 8))
+sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
+plt.title('Correlation Heatmap (Relevant Columns)')
+plt.show()
+----
+
+image::corr-matrix-seaborn.png[Corr Matrix using Seaborn, width=792, height=500, loading=lazy, title="Heatmap using Seaborn"]
+
+If we look at the heatmap, we can gain a better understanding of some of the real estate metrics in the data. It seems that time-related metrics `AgeOfInventory`, `DaysOnZillow` have weaker, generally negative correlations with price-related metrics (ZHVI metrics), hinting that homes staying on the market longer might reflect less competitive pricing or lower demand.