Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dataset identifier with schema name #18

Open
wants to merge 4 commits into
base: v0.8.33.affirm
Choose a base branch
from

Conversation

imtaos
Copy link

@imtaos imtaos commented Aug 4, 2022

Support schema.table in custom affirm artifacts ingestion source.

e.g. logs below is a schema and it can contain a few datasets inside.

Screen Shot 2022-08-04 at 1 35 10 PM

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub



@dataclass
class AffirmArtifact:
schema_name: str

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this supposed to be for?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking later in the PR I get that this is to display the path in the repo for the corresponding artifact. Are we going to do the same for the datasets?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm mysql does not have schema concept so we don't have it.
after re-think it over, I feel it's better not to use schema to represent the nested directory structure of artifacts. e.g. logs/application_log.yml, over here we represent logs as database shema-like - and it isn't very much accurate. we can simply name it as logs.application_log
wdyt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose/value for the end-user we plan to get out of this? Can we just provide the file path in the datahub-metadata that corresponds to the dataset/artifact in DataHub? That may be useful to cross-compare, update, etc? I suppose it is not possible to have links in properties, but we could also consider adding a link to the GitHub file from elsewhere in DataHub.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this is all going away as we will be moving it to the export-to-datahub script, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wont' go away. it is part of the dataset urn. say urn:li:dataset:(urn:li:dataPlatform:affirmInfra,development.bluejays,PROD).
the difference is either consider the whole development.bluejays as the schema.table format, or it as development folder, bluejays table.

Comment on lines +56 to +60
def get_schema(dir: str):
relative_dir = dir.replace(os.path.abspath(directory), '').strip()
relative_dir = relative_dir if not relative_dir.startswith(os.sep) else relative_dir[len(os.sep):]
relative_dir = relative_dir if not relative_dir.endswith(os.sep) else relative_dir[len(os.sep):]
return '.'.join(relative_dir.split(os.sep))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expect a git directory and get the git root to get the relative path instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants