Support dataset identifier with schema name #18

imtaos · 2022-08-04T20:35:41Z

Support schema.table in custom affirm artifacts ingestion source.

e.g. logs below is a schema and it can contain a few datasets inside.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…_metrics

grokvoid · 2022-08-25T16:45:16Z

metadata-ingestion/src/datahub/ingestion/source/affirm/artifact.py



 @dataclass
 class AffirmArtifact:
+    schema_name: str


What is this supposed to be for?

Looking later in the PR I get that this is to display the path in the repo for the corresponding artifact. Are we going to do the same for the datasets?

hmm mysql does not have schema concept so we don't have it.
after re-think it over, I feel it's better not to use schema to represent the nested directory structure of artifacts. e.g. logs/application_log.yml, over here we represent logs as database shema-like - and it isn't very much accurate. we can simply name it as logs.application_log
wdyt

What is the purpose/value for the end-user we plan to get out of this? Can we just provide the file path in the datahub-metadata that corresponds to the dataset/artifact in DataHub? That may be useful to cross-compare, update, etc? I suppose it is not possible to have links in properties, but we could also consider adding a link to the GitHub file from elsewhere in DataHub.

Btw, this is all going away as we will be moving it to the export-to-datahub script, correct?

it wont' go away. it is part of the dataset urn. say urn:li:dataset:(urn:li:dataPlatform:affirmInfra,development.bluejays,PROD).
the difference is either consider the whole development.bluejays as the schema.table format, or it as development folder, bluejays table.

grokvoid · 2022-08-25T16:48:42Z

metadata-ingestion/src/datahub/ingestion/source/affirm/artifact.py

+    def get_schema(dir: str):
+        relative_dir = dir.replace(os.path.abspath(directory), '').strip()
+        relative_dir = relative_dir if not relative_dir.startswith(os.sep) else relative_dir[len(os.sep):]
+        relative_dir = relative_dir if not relative_dir.endswith(os.sep) else relative_dir[len(os.sep):]
+        return '.'.join(relative_dir.split(os.sep))


Can we expect a git directory and get the git root to get the relative path instead?

metadata-ingestion/examples/affirm_artifact/3rd-party.recipe.yml

metadata-ingestion/src/datahub/ingestion/source/affirm/artifact.py

Support dataset identifier with schema name, e.g. metrics.application…

a96f293

…_metrics

imtaos requested review from nickwu241 and grokvoid August 4, 2022 20:36

Add artifact example recipes

5e16046

imtaos force-pushed the custom-ingestion-schema branch from f4b5158 to 5e16046 Compare August 25, 2022 04:06

grokvoid reviewed Aug 25, 2022

View reviewed changes

metadata-ingestion/examples/affirm_artifact/3rd-party.recipe.yml Outdated Show resolved Hide resolved

imtaos added 2 commits August 25, 2022 11:58

Address review comments

29f3d48

Simplify artifact ingestion source

f87ae52

imtaos commented Aug 25, 2022

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/affirm/artifact.py Show resolved Hide resolved

imtaos commented Aug 25, 2022

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/affirm/artifact.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dataset identifier with schema name #18

Support dataset identifier with schema name #18

imtaos commented Aug 4, 2022 •

edited

Loading

grokvoid Aug 25, 2022

grokvoid Aug 25, 2022

imtaos Aug 25, 2022

grokvoid Aug 25, 2022

grokvoid Aug 25, 2022

imtaos Aug 26, 2022

grokvoid Aug 25, 2022

Support dataset identifier with schema name #18

Are you sure you want to change the base?

Support dataset identifier with schema name #18

Conversation

imtaos commented Aug 4, 2022 • edited Loading

Checklist

grokvoid Aug 25, 2022

Choose a reason for hiding this comment

grokvoid Aug 25, 2022

Choose a reason for hiding this comment

imtaos Aug 25, 2022

Choose a reason for hiding this comment

grokvoid Aug 25, 2022

Choose a reason for hiding this comment

grokvoid Aug 25, 2022

Choose a reason for hiding this comment

imtaos Aug 26, 2022

Choose a reason for hiding this comment

grokvoid Aug 25, 2022

Choose a reason for hiding this comment

imtaos commented Aug 4, 2022 •

edited

Loading