Skip to content

Commit

Permalink
Support Embeddings in mltransform (#29564)
Browse files Browse the repository at this point in the history
* Make base.py framework agnostic and add helper transforms

* Add tests for base.py

* Add sentence-transformers

* Add tensorflow hub

* Add vertex_ai

* Make TFTProcessHandler a PTransform

* raise RuntimeError in ArtifactsFetcher when it is used for embeddings

* Add JsonPickle to requirements

* Add tox tests

* Mock frameworks in pydocs

Fix tox.ini

Fix pydoc

Fix indent in pydoc

* Add Row type check

* Remove requires_chaining

* change name of PTransformProvider to MLTransformProvider

* remove batch_len in utility fun

* Change type annotation and redundant comments

* Remove get_transforms method

* remove requires_chaining from tft

* add tests to sentence-transformers

* Pass inference_args to RunInference

* Add TODO GH issue

* refactor variables in vertex_ai embeddings

* remove try/catch and throw error if options is empty for GCS artifact location

* Refactor NotImplementedError message

* remove tensorflow hub from this PR

* Add _validate_transform  method

* add more tests

* fix test

* Fix test

* Add more tests in sentence-transformer

* use np.max instead of max

* round to 2 decimals

* Remove gradle command action

* Refactor throwing dataflow client exception

* skip the test if gcp is not installed

* remove toxTests for hub

* remove toxTests for hub

* Fix values in assert for sentence_transformer_test

* rename sentence_transformers to huggingface

* fix pydocs

* Change the model name for tests since it is getting different results on different machines

* Fix pydoc in vertexai

* add suffix to artifact_location

* Revert "add suffix to artifact_location"

This reverts commit cfb1883.

* add no_xdist

* Try fixing pydoc for vertexai

* change tox.ini to use pytest directly

* raise FileExistError if Attribute file is already present

* modify build.gradle to match tox task names

* Add note to CHANGES.md

* change gcs bucket to gs://temp-storage-for-perf-tests

* Add TODO GH links

* Update CHANGES.md

Co-authored-by: Danny McCormick <[email protected]>

---------

Co-authored-by: Danny McCormick <[email protected]>
  • Loading branch information
AnandInguva and damccorm authored Dec 11, 2023
1 parent 218af9d commit aacf1ee
Show file tree
Hide file tree
Showing 22 changed files with 1,636 additions and 72 deletions.
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
* Python GCSIO is now implemented with GCP GCS Client instead of apitools ([#25676](https://github.com/apache/beam/issues/25676))
* Adding support for LowCardinality DataType in ClickHouse (Java) ([#29533](https://github.com/apache/beam/pull/29533)).
* Added support for handling bad records to KafkaIO (Java) ([#29546](https://github.com/apache/beam/pull/29546))
* Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.([#29564](https://github.com/apache/beam/pull/29564))

## New Features / Improvements

Expand Down
Loading

0 comments on commit aacf1ee

Please sign in to comment.