-
Notifications
You must be signed in to change notification settings - Fork 884
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update doc page on using entity sets with Woodwork (#1532)
* refactor to jupyter notebook * use dataframes * use dataframe in comments * remove rst file * use EntitySet * add link to Woodwork * use target_dataframe_name * update comment on adding relationships
- Loading branch information
1 parent
af195aa
commit fbd56b7
Showing
2 changed files
with
342 additions
and
168 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,342 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Representing Data with EntitySets\n", | ||
"\n", | ||
"An ``EntitySet`` is a collection of dataframes and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools take ``dataframes`` and ``relationships`` as separate arguments, it is recommended to create an ``EntitySet``, so you can more easily manipulate your data as needed.\n", | ||
"\n", | ||
"## The Raw Data\n", | ||
"\n", | ||
"Below we have two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a merge of transactions, sessions, and customers so that the result looks like something you might see in a log file:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import featuretools as ft\n", | ||
"data = ft.demo.load_mock_customer()\n", | ||
"transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n", | ||
"\n", | ||
"transactions_df.sample(10)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"And the second dataframe is a list of products involved in those transactions." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"products_df = data[\"products\"]\n", | ||
"products_df" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Creating an EntitySet\n", | ||
"\n", | ||
"First, we initialize an ``EntitySet``. If you'd like to give it a name, you can optionally provide an ``id`` to the constructor." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es = ft.EntitySet(id=\"customer_data\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Adding dataframes\n", | ||
"\n", | ||
"To get started, we add the transactions dataframe to the `EntitySet`. In the call to ``add_dataframe``, we specify three important parameters:\n", | ||
"\n", | ||
"* The ``index`` parameter specifies the column that uniquely identifies rows in the dataframe.\n", | ||
"* The ``time_index`` parameter tells Featuretools when the data was created.\n", | ||
"* The ``logical_types`` parameter indicates that \"product_id\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from woodwork.logical_types import Categorical, PostalCode\n", | ||
"\n", | ||
"es = es.add_dataframe(\n", | ||
" dataframe_name=\"transactions\",\n", | ||
" dataframe=transactions_df,\n", | ||
" index=\"transaction_id\",\n", | ||
" time_index=\"transaction_time\",\n", | ||
" logical_types={\n", | ||
" \"product_id\": Categorical,\n", | ||
" \"zip_code\": PostalCode,\n", | ||
" },\n", | ||
")\n", | ||
"\n", | ||
"es" | ||
] | ||
}, | ||
{ | ||
"cell_type": "raw", | ||
"metadata": { | ||
"raw_mimetype": "text/restructuredtext" | ||
}, | ||
"source": [ | ||
".. currentmodule:: featuretools\n", | ||
"\n", | ||
".. note ::\n", | ||
"\n", | ||
" You can also display your `EntitySet` structure graphically by calling :meth:`.EntitySet.plot`.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This method associates each column in the dataframe to a [Woodwork](https://woodwork.alteryx.com/) logical type. Each logical type can have an associated standard semantic tag that helps define the column data type. If you don't specify the logical type for a column, it gets inferred based on the underlying data. The logical types and semantic tags are listed in the schema of the dataframe. For more information on working with logical types and semantic tags, take a look at the [Woodwork documention](https://woodwork.alteryx.com/)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es[\"transactions\"].ww.schema" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now, we can do that same thing with our products dataframe." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es = es.add_dataframe(\n", | ||
" dataframe_name=\"products\",\n", | ||
" dataframe=products_df,\n", | ||
" index=\"product_id\")\n", | ||
"\n", | ||
"es" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"With two dataframes in our `EntitySet`, we can add a relationship between them.\n", | ||
"\n", | ||
"## Adding a Relationship\n", | ||
"\n", | ||
"We want to relate these two dataframes by the columns called \"product_id\" in each dataframe. Each product has multiple transactions associated with it, so it is called the **parent dataframe**, while the transactions dataframe is known as the **child dataframe**. When specifying relationships, we need four parameters: the parent dataframe name, the parent column name, the child dataframe name, and the child column name. Note that each relationship must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es = es.add_relationship(\"products\", \"product_id\", \"transactions\", \"product_id\")\n", | ||
"es" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now, we see the relationship has been added to our `EntitySet`.\n", | ||
"\n", | ||
"## Creating a dataframe from an existing table\n", | ||
"\n", | ||
"When working with raw data, it is common to have sufficient information to justify the creation of new dataframes. In order to create a new dataframe and relationship for sessions, we \"normalize\" the transaction dataframe." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es = es.normalize_dataframe(\n", | ||
" base_dataframe_name=\"transactions\",\n", | ||
" new_dataframe_name=\"sessions\",\n", | ||
" index=\"session_id\",\n", | ||
" make_time_index=\"session_start\",\n", | ||
" additional_columns=[\n", | ||
" \"device\",\n", | ||
" \"customer_id\",\n", | ||
" \"zip_code\",\n", | ||
" \"session_start\",\n", | ||
" \"join_date\",\n", | ||
" ],\n", | ||
")\n", | ||
"es" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Looking at the output above, we see this method did two operations:\n", | ||
"\n", | ||
"1. It created a new dataframe called \"sessions\" based on the \"session_id\" and \"session_start\" columns in \"transactions\"\n", | ||
"2. It added a relationship connecting \"transactions\" and \"sessions\"\n", | ||
"\n", | ||
"If we look at the schema from the transactions dataframe and the new sessions dataframe, we see two more operations that were performed automatically:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es[\"transactions\"].ww.schema" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es[\"sessions\"].ww.schema" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"1. It removed \"device\", \"customer_id\", \"zip_code\" and \"join_date\" from \"transactions\" and created a new columns in the sessions dataframe. This reduces redundant information as the those properties of a session don't change between transactions.\n", | ||
"2. It copied and marked \"session_start\" as a time index column into the new sessions dataframe to indicate the beginning of a session. If the base dataframe has a time index and ``make_time_index`` is not set, ``normalize_dataframe`` will create a time index for the new dataframe. In this case it would create a new time index called \"first_transactions_time\" using the time of the first transaction of each session. If we don't want this time index to be created, we can set ``make_time_index=False``.\n", | ||
"\n", | ||
"If we look at the dataframes, we can see what ``normalize_dataframe`` did to the actual data." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es[\"sessions\"].head(5)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es[\"transactions\"].head(5)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To finish preparing this dataset, create a \"customers\" dataframe using the same method call." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"es = es.normalize_dataframe(\n", | ||
" base_dataframe_name=\"sessions\",\n", | ||
" new_dataframe_name=\"customers\",\n", | ||
" index=\"customer_id\",\n", | ||
" make_time_index=\"join_date\",\n", | ||
" additional_columns=[\"zip_code\", \"join_date\"],\n", | ||
")\n", | ||
"\n", | ||
"es" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Using the EntitySet\n", | ||
"\n", | ||
"Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let's build a feature matrix for each product in our dataset." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name=\"products\")\n", | ||
"\n", | ||
"feature_matrix" | ||
] | ||
}, | ||
{ | ||
"cell_type": "raw", | ||
"metadata": { | ||
"raw_mimetype": "text/restructuredtext" | ||
}, | ||
"source": [ | ||
"As we can see, the features from DFS use the relational structure of our `EntitySet`. Therefore it is important to think carefully about the dataframes that we create.\n", | ||
"\n", | ||
"Dask and Koalas EntitySets\n", | ||
"~~~~~~~~~~~~~~~~~~~~~~~~~~\n", | ||
"\n", | ||
"EntitySets can also be created using Dask dataframes or Koalas dataframes. For more information refer to :doc:`../guides/using_dask_entitysets` and :doc:`../guides/using_koalas_entitysets`." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"celltoolbar": "Raw Cell Format", | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.