-
Notifications
You must be signed in to change notification settings - Fork 260
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #3 from shivamsanju/feature-notebook-lineage
Moved callback to constructor
- Loading branch information
Showing
22 changed files
with
929 additions
and
242 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,19 @@ | ||
--- | ||
layout: default | ||
title: Feathr Feature Generation | ||
title: Feature Generation and Materialization | ||
parent: Feathr Concepts | ||
--- | ||
|
||
# Feature Generation | ||
Feature generation is the process to create features from raw source data into a certain persisted storage. | ||
# Feature Generation and Materialization | ||
|
||
User could utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused(usually in offline setting). Feature generation is also useful in generating embedding features. Embedding distill information from large data and it is usually more compact. | ||
Feature generation (also known as feature materialization) is the process to create features from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference). | ||
|
||
User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact. | ||
|
||
## Generating Features to Online Store | ||
When we need to serve the models online, we also need to serve the features online. We provide APIs to generate features to online storage for future consumption. For example: | ||
|
||
When the models are served in an online environment, we also need to serve the corresponding features in the same online environment as well. Feathr provides APIs to generate features to online storage for future consumption. For example: | ||
|
||
```python | ||
client = FeathrClient() | ||
redisSink = RedisSink(table_name="nycTaxiDemoFeature") | ||
|
@@ -21,12 +24,16 @@ settings = MaterializationSettings("nycTaxiMaterializationJob", | |
client.materialize_features(settings) | ||
``` | ||
|
||
([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings), | ||
[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink) | ||
More reference on the APIs: | ||
|
||
- [MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings) | ||
- [RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink) | ||
|
||
In the above example, we define a Redis table called `nycTaxiDemoFeature` and materialize two features called `f_location_avg_fare` and `f_location_max_fare` to Redis. | ||
|
||
It is also possible to backfill the features for a previous time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivilant to `BackfillTime(start=now, end=now, step=timedelta(days=1))`). | ||
## Feature Backfill | ||
|
||
It is also possible to backfill the features for a particular time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivalent to `BackfillTime(start=now, end=now, step=timedelta(days=1))`). | ||
|
||
```python | ||
client = FeathrClient() | ||
|
@@ -39,29 +46,34 @@ settings = MaterializationSettings("nycTaxiMaterializationJob", | |
client.materialize_features(settings) | ||
``` | ||
|
||
([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime), | ||
[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features)) | ||
Note that if you don't have features available in `now`, you'd better specify a `BackfillTime` range where you have features. | ||
|
||
## Consuming the online features | ||
Also, Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have | ||
`BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance. | ||
|
||
```python | ||
client.wait_job_to_finish(timeout_sec=600) | ||
More reference on the APIs: | ||
|
||
- [BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime) | ||
- [client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features) | ||
|
||
res = client.get_online_features('nycTaxiDemoFeature', '265', [ | ||
'f_location_avg_fare', 'f_location_max_fare']) | ||
|
||
|
||
## Consuming features in online environment | ||
|
||
After the materialization job is finished, we can get the online features by querying the `feature table`, corresponding `entity key` and a list of `feature names`. In the example below, we query the online features called `f_location_avg_fare` and `f_location_max_fare`, and query with a key `265` (which is the location ID). | ||
|
||
```python | ||
res = client.get_online_features('nycTaxiDemoFeature', '265', ['f_location_avg_fare', 'f_location_max_fare']) | ||
``` | ||
|
||
([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features)) | ||
More reference on the APIs: | ||
- [client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features) | ||
|
||
After we finish running the materialization job, we can get the online features by querying the feature name, with the | ||
corresponding keys. In the example above, we query the online features called `f_location_avg_fare` and | ||
`f_location_max_fare`, and query with a key `265` (which is the location ID). | ||
## Materializing Features to Offline Store | ||
|
||
## Generating Features to Offline Store | ||
This is useful when the feature transformation is compute intensive and features can be re-used. For example, you have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training pipeline. In this case, you should consider generating features to offline. | ||
|
||
This is a useful when the feature transformation is computation intensive and features can be re-used. For example, you | ||
have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training | ||
pipeline. In this case, you should consider generate features to offline. Here is an API example: | ||
The API call is very similar to materializing features to online store, and here is an API example: | ||
|
||
```python | ||
client = FeathrClient() | ||
|
@@ -73,14 +85,14 @@ settings = MaterializationSettings("nycTaxiMaterializationJob", | |
client.materialize_features(settings) | ||
``` | ||
|
||
This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path: | ||
This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path: | ||
`abfss://[email protected]/materialize_offline_test_data/df0/daily/2022/05/21` | ||
|
||
You can also specify a BackfillTime so the features will be generated for those dates. For example: | ||
You can also specify a `BackfillTime` so the features will be generated only for those dates. For example: | ||
|
||
```Python | ||
backfill_time = BackfillTime(start=datetime( | ||
2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1)) | ||
2020, 5, 10), end=datetime(2020, 5, 20), step=timedelta(days=1)) | ||
offline_sink = HdfsSink(output_path="abfss://[email protected]/materialize_offline_test_data/") | ||
settings = MaterializationSettings("nycTaxiTable", | ||
sinks=[offline_sink], | ||
|
@@ -89,8 +101,32 @@ settings = MaterializationSettings("nycTaxiTable", | |
backfill_time=backfill_time) | ||
``` | ||
|
||
This will generate features only for 2020/05/20 for me and it will be in folder: | ||
`abfss://[email protected]/materialize_offline_test_data/df0/daily/2020/05/20` | ||
This will generate features from `2020/05/10` to `2020/05/20` and the output will have 11 folders, from | ||
`abfss://[email protected]/materialize_offline_test_data/df0/daily/2020/05/10` to `abfss://[email protected]/materialize_offline_test_data/df0/daily/2020/05/20`. Note that currently Feathr only supports materializing data in daily step (i.e. even if you specify an hourly step, the generated features in offline store will still be presented in a daily hierarchy). | ||
|
||
You can also specify the format of the materialized features in the offline store by using `execution_configurations` like below. Please refer to the [documentation](../how-to-guides/feathr-job-configuration.md) here for those configuration details. | ||
|
||
```python | ||
|
||
from feathr import HdfsSink | ||
offlineSink = HdfsSink(output_path="abfss://[email protected]/materialize_offline_data/") | ||
# Materialize two features into a Offline store. | ||
settings = MaterializationSettings("nycTaxiMaterializationJob", | ||
sinks=[offlineSink], | ||
feature_names=["f_location_avg_fare", "f_location_max_fare"]) | ||
client.materialize_features(settings, execution_configurations={ "spark.feathr.outputFormat": "parquet"}) | ||
|
||
``` | ||
|
||
For reading those materialized features, Feathr has a convenient helper function called `get_result_df` to help you view the data. For example, you can use the sample code below to read from the materialized result in offline store: | ||
|
||
```python | ||
|
||
path = "abfss://[email protected]/materialize_offline_test_data/df0/daily/2020/05/20/" | ||
res = get_result_df(client=client, format="parquet", res_url=path) | ||
``` | ||
|
||
More reference on the APIs: | ||
|
||
([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings), | ||
[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSink)) | ||
- [MaterializationSettings API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings) | ||
- [HdfsSink API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSource) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.