forked from feast-dev/feast
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into mariadb_offline_store
- Loading branch information
Showing
50 changed files
with
2,536 additions
and
1,552 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# [Alpha] Vector Database | ||
**Warning**: This is an _experimental_ feature. To our knowledge, this is stable, but there are still rough edges in the experience. Contributions are welcome! | ||
|
||
## Overview | ||
Vector database allows user to store and retrieve embeddings. Feast provides general APIs to store and retrieve embeddings. | ||
|
||
## Integration | ||
Below are supported vector databases and implemented features: | ||
|
||
| Vector Database | Retrieval | Indexing | | ||
|-----------------|-----------|----------| | ||
| Pgvector | [x] | [ ] | | ||
| Elasticsearch | [x] | [x] | | ||
| Milvus | [ ] | [ ] | | ||
| Faiss | [ ] | [ ] | | ||
|
||
|
||
## Example | ||
|
||
See [https://github.com/feast-dev/feast-workshop/blob/rag/module_4_rag](https://github.com/feast-dev/feast-workshop/blob/rag/module_4_rag) for an example on how to use vector database. | ||
|
||
### **Prepare offline embedding dataset** | ||
Run the following commands to prepare the embedding dataset: | ||
```shell | ||
python pull_states.py | ||
python batch_score_documents.py | ||
``` | ||
The output will be stored in `data/city_wikipedia_summaries.csv.` | ||
|
||
### **Initialize Feast feature store and materialize the data to the online store** | ||
Use the feature_tore.yaml file to initialize the feature store. This will use the data as offline store, and Pgvector as online store. | ||
|
||
```yaml | ||
project: feast_demo_local | ||
provider: local | ||
registry: | ||
registry_type: sql | ||
path: postgresql://@localhost:5432/feast | ||
online_store: | ||
type: postgres | ||
pgvector_enabled: true | ||
vector_len: 384 | ||
host: 127.0.0.1 | ||
port: 5432 | ||
database: feast | ||
user: "" | ||
password: "" | ||
|
||
|
||
offline_store: | ||
type: file | ||
entity_key_serialization_version: 2 | ||
``` | ||
Run the following command in terminal to apply the feature store configuration: | ||
```shell | ||
feast apply | ||
``` | ||
|
||
Note that when you run `feast apply` you are going to apply the following Feature View that we will use for retrieval later: | ||
|
||
```python | ||
city_embeddings_feature_view = FeatureView( | ||
name="city_embeddings", | ||
entities=[item], | ||
schema=[ | ||
Field(name="Embeddings", dtype=Array(Float32)), | ||
], | ||
source=source, | ||
ttl=timedelta(hours=2), | ||
) | ||
``` | ||
|
||
Then run the following command in the terminal to materialize the data to the online store: | ||
|
||
```shell | ||
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S") | ||
feast materialize-incremental $CURRENT_TIME | ||
``` | ||
|
||
### **Prepare a query embedding** | ||
```python | ||
from batch_score_documents import run_model, TOKENIZER, MODEL | ||
from transformers import AutoTokenizer, AutoModel | ||
|
||
question = "the most populous city in the U.S. state of Texas?" | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER) | ||
model = AutoModel.from_pretrained(MODEL) | ||
query_embedding = run_model(question, tokenizer, model) | ||
query = query_embedding.detach().cpu().numpy().tolist()[0] | ||
``` | ||
|
||
### **Retrieve the top 5 similar documents** | ||
First create a feature store instance, and use the `retrieve_online_documents` API to retrieve the top 5 similar documents to the specified query. | ||
|
||
```python | ||
from feast import FeatureStore | ||
store = FeatureStore(repo_path=".") | ||
features = store.retrieve_online_documents( | ||
feature="city_embeddings:Embeddings", | ||
query=query, | ||
top_k=5 | ||
).to_dict() | ||
|
||
def print_online_features(features): | ||
for key, value in sorted(features.items()): | ||
print(key, " : ", value) | ||
|
||
print_online_features(features) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# DuckDB offline store | ||
|
||
## Description | ||
|
||
The duckdb offline store provides support for reading [FileSources](../data-sources/file.md). It can read both Parquet and Delta formats. DuckDB offline store uses [ibis](https://ibis-project.org/) under the hood to convert offline store operations to DuckDB queries. | ||
|
||
* Entity dataframes can be provided as a Pandas dataframe. | ||
|
||
## Getting started | ||
In order to use this offline store, you'll need to run `pip install 'feast[duckdb]'`. | ||
|
||
## Example | ||
|
||
{% code title="feature_store.yaml" %} | ||
```yaml | ||
project: my_project | ||
registry: data/registry.db | ||
provider: local | ||
offline_store: | ||
type: duckdb | ||
online_store: | ||
path: data/online_store.db | ||
``` | ||
{% endcode %} | ||
## Functionality Matrix | ||
The set of functionality supported by offline stores is described in detail [here](overview.md#functionality). | ||
Below is a matrix indicating which functionality is supported by the DuckDB offline store. | ||
| | DuckdDB | | ||
| :----------------------------------------------------------------- | :---- | | ||
| `get_historical_features` (point-in-time correct join) | yes | | ||
| `pull_latest_from_table_or_query` (retrieve latest feature values) | yes | | ||
| `pull_all_from_table_or_query` (retrieve a saved dataset) | yes | | ||
| `offline_write_batch` (persist dataframes to offline store) | yes | | ||
| `write_logged_features` (persist logged features to offline store) | yes | | ||
|
||
Below is a matrix indicating which functionality is supported by `IbisRetrievalJob`. | ||
|
||
| | DuckDB| | ||
| ----------------------------------------------------- | ----- | | ||
| export to dataframe | yes | | ||
| export to arrow table | yes | | ||
| export to arrow batches | no | | ||
| export to SQL | no | | ||
| export to data lake (S3, GCS, etc.) | no | | ||
| export to data warehouse | no | | ||
| export as Spark dataframe | no | | ||
| local execution of Python-based on-demand transforms | yes | | ||
| remote execution of Python-based on-demand transforms | no | | ||
| persist results in the offline store | yes | | ||
| preview the query plan before execution | no | | ||
| read partitioned data | yes | | ||
|
||
To compare this set of functionality against other offline stores, please see the full [functionality matrix](overview.md#functionality-matrix). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.