-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Onboard ESSD dataset using Open Metadata #183
Comments
Dependency on issue #202 in order to configure DBT pipeline for metadata ingestion |
Issue nearly complete - pending unusual behavior where files are created in incorrect directory - c.FileCheckpoints.checkpoint_dir = '' https://stackoverflow.com/questions/51887758/is-there-a-way-to-disable-saving-to-checkpoints-for-jupyter-notebooks |
restart mid November upon MT return |
I've updated the task list with some substantial items that should be discussed and prioritized. |
@MightyNerdEric and @rynofinn - i will create Jira issues for the first two items above; not sure if you can help with 3rd items - rest look like @caldeirav is needed |
Related LF Jira tickets: |
in backlog - behind Iris |
@MightyNerdEric working on upgrade of open metadata - learning helm (extract manifests directly) |
@MightyNerdEric needs access to Quay to update base image, also needs source of container code in operate first; @redmikhail to help with access issues for Eric and @ryanaslett |
prefer higher version than 317 |
have access to operate first, but need access to os-climate; @caldeirav do you have access? @erikerlandson please grant access to rest of LF team and Mikhail |
@erikerlandson to provide quay access |
creator privileges added |
@MichaelTiemannOSC Regarding these two issues: On the second item, what repo are we talking about here? When does dbt create these files? If we don't want them covered by the .gitignore, I'm guessing these are files that we want to commit to a repo, so I'll need to know when/how they're being generated in order to find a solution. |
credentials.env is meant to be stored far, far away from GitHub, but within a user's ability to load the file from a home directory. This is the library that our data users are supposed to use to read that file: https://github.com/os-climate/osc-ingest-tools dbt is part of the new pipeline that Vincent rolled out in August, and which I've been trying to replicate since October (when I was interacting heavily with trino, trino-client, OpenMetadata, and dbt developers). It's listed in the requirements for #234, and #234 is intended to provide the larger context of what we need. This particular issue was filed because of the great surprise (and potential security leakage) due to dbt's default way of handling credentials. Just this morning, Bryon Baker made a suggestion about putting credentials.env into all .gitignore files for OS-Climate, to reduce the risk of credentials leakage. But it doesn't solve this problem, because dbt wants to read from its own files--a leak waiting to happen. Vincent's open metadata demo gives the larger context of how dbt fits into our world. This branch (https://github.com/os-climate/essd-ingest-pipeline/tree/iceberg-dbt) of the ESSD pipeline also gives examples of dbt usage. Vincent is just finishing up the delivery of some major training--hopefully his materials spell this out better. I've only been trying to replicate what I see him doing in another context, documenting as I go. Which means that by no means do I have the larger picture of what "should be". But this issue raises "what should not be", and that is a file, necessary for the operation of dbt, that would wind up leaking credentials because nothing about "~/.dbt/profiles.yml" makes it look like it contains secrets. |
Vincent has recently updated the Data Commons documentation. While written at a high level and aimed mostly at developers, the granularity and completeness encourages the platform team to add information relevant to the ops side of the platform: components, recipes, smoke tests, admin / dev / user roles, provisioning, heath checks, etc. See https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-developer-guide.md I encourage all who are responsible for keeping these systems running to read this documentation as a way to understand what developers (and ultimately users) are expecting, and to write such documentation that it's easy for existing and new platform maintainers to also find what they need to find and do what they need to do. |
blocked pending OpenMetadata 13.1 upgrade |
OM 13.1 is available. The evil and insidious profiles.yml file, which requires secrets but should not contain secrets, remains unaddressed. Filing a new issue about that. |
Now that we are unblocked on openmetadata, there are several other questions that need to be answered, ie. storing dbt intermediate files in github or not, various gitignore problems that may or may not be solved by the latest template, etc. Please consider this a bump to addressing those questions. |
@MightyNerdEric will work on this issue for week of 13-Feb |
@caldeirav can you address: |
The ESSD dataset (s3://redhat-osc-physical-landing-647521352890/ESSD/) can be an exemplar for Open Metadata onboarding. The dataset comes with data dictionaries, and there is a iceberg data pipeline notebook at https://github.com/os-climate/essd-ingest-pipeline/blob/iceberg/notebooks/osc-essd-ingest.ipynb.
As a data pipeline implementor, I want to work from a common template that describes the metadata needed to connect this dataset to a data catalog browser and to understand the various levels of data interoperability that can be achieved/advertised by properly instantiating all the metadata reasonable for this dataset:
@HeatherAck for visibility
The text was updated successfully, but these errors were encountered: