Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboard ESSD dataset using Open Metadata #183

Open
1 of 6 tasks
MichaelTiemannOSC opened this issue Jun 29, 2022 · 21 comments
Open
1 of 6 tasks

Onboard ESSD dataset using Open Metadata #183

MichaelTiemannOSC opened this issue Jun 29, 2022 · 21 comments
Assignees
Labels

Comments

@MichaelTiemannOSC
Copy link
Contributor

MichaelTiemannOSC commented Jun 29, 2022

The ESSD dataset (s3://redhat-osc-physical-landing-647521352890/ESSD/) can be an exemplar for Open Metadata onboarding. The dataset comes with data dictionaries, and there is a iceberg data pipeline notebook at https://github.com/os-climate/essd-ingest-pipeline/blob/iceberg/notebooks/osc-essd-ingest.ipynb.

As a data pipeline implementor, I want to work from a common template that describes the metadata needed to connect this dataset to a data catalog browser and to understand the various levels of data interoperability that can be achieved/advertised by properly instantiating all the metadata reasonable for this dataset:

  • Still need to update OpenMetadata to 0.12.2 or later. Given that we are very much in the development stage, it might make sense to install 0.13.0 preview, released yesterday: https://github.com/open-metadata/OpenMetadata/releases/tag/0.13.0-preview
  • The current dbt implementation requires copying sensitive credential information from credentials.env to ~/.dbt/profiles.yml or some such. This is bad and ugly and should be fixed so that dbt can get that information from env variables, like everything else.
  • dbt creates meaningful files in a target/ subdirectory. The aicoe .gitignore file ignores any and all directories named target/ (due to Pybuilder). We need to delete that noise from the ingestion pipeline template or propagate a better .gitignore
  • should sql files generated in the dbt process be preserved in github (as part of data reproducibility) or should they be ignored as purely derived files? What other rules should apply to what other files that dbt generates or uses?
  • The sample WRI README.md file (https://github.com/os-climate/wri-gppd-ingestion-pipeline/blob/master/README.md) is still just a project template file and does not describe the full theory of all the steps and components needed to fully implement a proper ingestion pipeline (making it difficult for the ESSD dataset to further exemplify and elaborate what it should be doing.
  • We should aim to make ESSD easily comparable with other global CO2 data (such as ClimateTrace) and demonstrate how OM details facilitate both comparability and consequences of data updates.

@HeatherAck for visibility

@caldeirav
Copy link
Contributor

Dependency on issue #202 in order to configure DBT pipeline for metadata ingestion

@HeatherAck
Copy link
Contributor

Issue nearly complete - pending unusual behavior where files are created in incorrect directory - c.FileCheckpoints.checkpoint_dir = '' https://stackoverflow.com/questions/51887758/is-there-a-way-to-disable-saving-to-checkpoints-for-jupyter-notebooks

@HeatherAck
Copy link
Contributor

restart mid November upon MT return

@MichaelTiemannOSC
Copy link
Contributor Author

I've updated the task list with some substantial items that should be discussed and prioritized.

@HeatherAck
Copy link
Contributor

@MightyNerdEric and @rynofinn - i will create Jira issues for the first two items above; not sure if you can help with 3rd items - rest look like @caldeirav is needed

@HeatherAck
Copy link
Contributor

Related LF Jira tickets:
(1) upgrade version of OpenMetaData: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24850
(2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851
(3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852

@HeatherAck
Copy link
Contributor

in backlog - behind Iris

@HeatherAck
Copy link
Contributor

@MightyNerdEric working on upgrade of open metadata - learning helm (extract manifests directly)

@HeatherAck
Copy link
Contributor

@MightyNerdEric needs access to Quay to update base image, also needs source of container code in operate first; @redmikhail to help with access issues for Eric and @ryanaslett

@HeatherAck
Copy link
Contributor

prefer higher version than 317

@HeatherAck
Copy link
Contributor

have access to operate first, but need access to os-climate; @caldeirav do you have access? @erikerlandson please grant access to rest of LF team and Mikhail

@HeatherAck
Copy link
Contributor

@erikerlandson to provide quay access

@HeatherAck
Copy link
Contributor

creator privileges added

@eb-oss
Copy link
Contributor

eb-oss commented Jan 26, 2023

  • The current dbt implementation requires copying sensitive credential information from credentials.env to ~/.dbt/profiles.yml or some such. This is bad and ugly and should be fixed so that dbt can get that information from env variables, like everything else.
  • dbt creates meaningful files in a target/ subdirectory. The aicoe .gitignore file ignores any and all directories named target/ (due to Pybuilder). We need to delete that noise from the ingestion pipeline template or propagate a better .gitignore

@MichaelTiemannOSC Regarding these two issues:
For the first one, could you give some additional information? Where is credentials.env stored? I'm guessing that ~/.dbt/profiles.yml is a file that's on the openmetadata server? We certainly are capable of pulling secrets from Vault into configs one the cluster, but I need a bit more info. I tried looking into dbt setup, but I couldn't find anything in our current configuration that interacts with it.

On the second item, what repo are we talking about here? When does dbt create these files? If we don't want them covered by the .gitignore, I'm guessing these are files that we want to commit to a repo, so I'll need to know when/how they're being generated in order to find a solution.

@MichaelTiemannOSC
Copy link
Contributor Author

credentials.env is meant to be stored far, far away from GitHub, but within a user's ability to load the file from a home directory. This is the library that our data users are supposed to use to read that file: https://github.com/os-climate/osc-ingest-tools

dbt is part of the new pipeline that Vincent rolled out in August, and which I've been trying to replicate since October (when I was interacting heavily with trino, trino-client, OpenMetadata, and dbt developers). It's listed in the requirements for #234, and #234 is intended to provide the larger context of what we need. This particular issue was filed because of the great surprise (and potential security leakage) due to dbt's default way of handling credentials.

Just this morning, Bryon Baker made a suggestion about putting credentials.env into all .gitignore files for OS-Climate, to reduce the risk of credentials leakage. But it doesn't solve this problem, because dbt wants to read from its own files--a leak waiting to happen.

Vincent's open metadata demo gives the larger context of how dbt fits into our world. This branch (https://github.com/os-climate/essd-ingest-pipeline/tree/iceberg-dbt) of the ESSD pipeline also gives examples of dbt usage. Vincent is just finishing up the delivery of some major training--hopefully his materials spell this out better. I've only been trying to replicate what I see him doing in another context, documenting as I go. Which means that by no means do I have the larger picture of what "should be". But this issue raises "what should not be", and that is a file, necessary for the operation of dbt, that would wind up leaking credentials because nothing about "~/.dbt/profiles.yml" makes it look like it contains secrets.

@MichaelTiemannOSC
Copy link
Contributor Author

Vincent has recently updated the Data Commons documentation. While written at a high level and aimed mostly at developers, the granularity and completeness encourages the platform team to add information relevant to the ops side of the platform: components, recipes, smoke tests, admin / dev / user roles, provisioning, heath checks, etc. See https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-developer-guide.md

I encourage all who are responsible for keeping these systems running to read this documentation as a way to understand what developers (and ultimately users) are expecting, and to write such documentation that it's easy for existing and new platform maintainers to also find what they need to find and do what they need to do.

@HeatherAck
Copy link
Contributor

blocked pending OpenMetadata 13.1 upgrade

@MichaelTiemannOSC
Copy link
Contributor Author

OM 13.1 is available. The evil and insidious profiles.yml file, which requires secrets but should not contain secrets, remains unaddressed. Filing a new issue about that.

@MichaelTiemannOSC
Copy link
Contributor Author

Now that we are unblocked on openmetadata, there are several other questions that need to be answered, ie. storing dbt intermediate files in github or not, various gitignore problems that may or may not be solved by the latest template, etc. Please consider this a bump to addressing those questions.

@HeatherAck
Copy link
Contributor

@MightyNerdEric will work on this issue for week of 13-Feb

@HeatherAck
Copy link
Contributor

@caldeirav can you address:
(2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851
(3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

7 participants