Onboard ESSD dataset using Open Metadata #183

MichaelTiemannOSC · 2022-06-29T11:03:43Z

The ESSD dataset (s3://redhat-osc-physical-landing-647521352890/ESSD/) can be an exemplar for Open Metadata onboarding. The dataset comes with data dictionaries, and there is a iceberg data pipeline notebook at https://github.com/os-climate/essd-ingest-pipeline/blob/iceberg/notebooks/osc-essd-ingest.ipynb.

As a data pipeline implementor, I want to work from a common template that describes the metadata needed to connect this dataset to a data catalog browser and to understand the various levels of data interoperability that can be achieved/advertised by properly instantiating all the metadata reasonable for this dataset:

Still need to update OpenMetadata to 0.12.2 or later. Given that we are very much in the development stage, it might make sense to install 0.13.0 preview, released yesterday: https://github.com/open-metadata/OpenMetadata/releases/tag/0.13.0-preview
The current dbt implementation requires copying sensitive credential information from credentials.env to ~/.dbt/profiles.yml or some such. This is bad and ugly and should be fixed so that dbt can get that information from env variables, like everything else.
dbt creates meaningful files in a target/ subdirectory. The aicoe .gitignore file ignores any and all directories named target/ (due to Pybuilder). We need to delete that noise from the ingestion pipeline template or propagate a better .gitignore
should sql files generated in the dbt process be preserved in github (as part of data reproducibility) or should they be ignored as purely derived files? What other rules should apply to what other files that dbt generates or uses?
The sample WRI README.md file (https://github.com/os-climate/wri-gppd-ingestion-pipeline/blob/master/README.md) is still just a project template file and does not describe the full theory of all the steps and components needed to fully implement a proper ingestion pipeline (making it difficult for the ESSD dataset to further exemplify and elaborate what it should be doing.
We should aim to make ESSD easily comparable with other global CO2 data (such as ClimateTrace) and demonstrate how OM details facilitate both comparability and consequences of data updates.

@HeatherAck for visibility

caldeirav · 2022-09-12T14:24:32Z

Dependency on issue #202 in order to configure DBT pipeline for metadata ingestion

HeatherAck · 2022-10-04T00:23:17Z

Issue nearly complete - pending unusual behavior where files are created in incorrect directory - c.FileCheckpoints.checkpoint_dir = '' https://stackoverflow.com/questions/51887758/is-there-a-way-to-disable-saving-to-checkpoints-for-jupyter-notebooks

HeatherAck · 2022-10-18T00:32:52Z

restart mid November upon MT return

MichaelTiemannOSC · 2022-11-16T16:28:41Z

I've updated the task list with some substantial items that should be discussed and prioritized.

HeatherAck · 2022-11-16T19:30:58Z

@MightyNerdEric and @rynofinn - i will create Jira issues for the first two items above; not sure if you can help with 3rd items - rest look like @caldeirav is needed

HeatherAck · 2022-11-16T19:54:26Z

Related LF Jira tickets:
(1) upgrade version of OpenMetaData: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24850
(2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851
(3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852

HeatherAck · 2022-11-21T18:05:55Z

in backlog - behind Iris

HeatherAck · 2022-12-05T18:31:12Z

@MightyNerdEric working on upgrade of open metadata - learning helm (extract manifests directly)

HeatherAck · 2022-12-12T18:11:10Z

@MightyNerdEric needs access to Quay to update base image, also needs source of container code in operate first; @redmikhail to help with access issues for Eric and @ryanaslett

HeatherAck · 2022-12-12T18:13:34Z

prefer higher version than 317

HeatherAck · 2022-12-19T18:09:59Z

have access to operate first, but need access to os-climate; @caldeirav do you have access? @erikerlandson please grant access to rest of LF team and Mikhail

HeatherAck · 2023-01-09T18:30:45Z

@erikerlandson to provide quay access

HeatherAck · 2023-01-09T18:35:18Z

creator privileges added

eb-oss · 2023-01-26T23:38:27Z

The current dbt implementation requires copying sensitive credential information from credentials.env to ~/.dbt/profiles.yml or some such. This is bad and ugly and should be fixed so that dbt can get that information from env variables, like everything else.

dbt creates meaningful files in a target/ subdirectory. The aicoe .gitignore file ignores any and all directories named target/ (due to Pybuilder). We need to delete that noise from the ingestion pipeline template or propagate a better .gitignore

@MichaelTiemannOSC Regarding these two issues:
For the first one, could you give some additional information? Where is credentials.env stored? I'm guessing that ~/.dbt/profiles.yml is a file that's on the openmetadata server? We certainly are capable of pulling secrets from Vault into configs one the cluster, but I need a bit more info. I tried looking into dbt setup, but I couldn't find anything in our current configuration that interacts with it.

On the second item, what repo are we talking about here? When does dbt create these files? If we don't want them covered by the .gitignore, I'm guessing these are files that we want to commit to a repo, so I'll need to know when/how they're being generated in order to find a solution.

MichaelTiemannOSC · 2023-01-26T23:51:51Z

credentials.env is meant to be stored far, far away from GitHub, but within a user's ability to load the file from a home directory. This is the library that our data users are supposed to use to read that file: https://github.com/os-climate/osc-ingest-tools

dbt is part of the new pipeline that Vincent rolled out in August, and which I've been trying to replicate since October (when I was interacting heavily with trino, trino-client, OpenMetadata, and dbt developers). It's listed in the requirements for #234, and #234 is intended to provide the larger context of what we need. This particular issue was filed because of the great surprise (and potential security leakage) due to dbt's default way of handling credentials.

Just this morning, Bryon Baker made a suggestion about putting credentials.env into all .gitignore files for OS-Climate, to reduce the risk of credentials leakage. But it doesn't solve this problem, because dbt wants to read from its own files--a leak waiting to happen.

Vincent's open metadata demo gives the larger context of how dbt fits into our world. This branch (https://github.com/os-climate/essd-ingest-pipeline/tree/iceberg-dbt) of the ESSD pipeline also gives examples of dbt usage. Vincent is just finishing up the delivery of some major training--hopefully his materials spell this out better. I've only been trying to replicate what I see him doing in another context, documenting as I go. Which means that by no means do I have the larger picture of what "should be". But this issue raises "what should not be", and that is a file, necessary for the operation of dbt, that would wind up leaking credentials because nothing about "~/.dbt/profiles.yml" makes it look like it contains secrets.

MichaelTiemannOSC · 2023-01-27T19:47:08Z

Vincent has recently updated the Data Commons documentation. While written at a high level and aimed mostly at developers, the granularity and completeness encourages the platform team to add information relevant to the ops side of the platform: components, recipes, smoke tests, admin / dev / user roles, provisioning, heath checks, etc. See https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-developer-guide.md

I encourage all who are responsible for keeping these systems running to read this documentation as a way to understand what developers (and ultimately users) are expecting, and to write such documentation that it's easy for existing and new platform maintainers to also find what they need to find and do what they need to do.

HeatherAck · 2023-01-30T18:52:26Z

blocked pending OpenMetadata 13.1 upgrade

MichaelTiemannOSC · 2023-02-04T01:44:39Z

OM 13.1 is available. The evil and insidious profiles.yml file, which requires secrets but should not contain secrets, remains unaddressed. Filing a new issue about that.

MichaelTiemannOSC · 2023-02-08T18:50:09Z

Now that we are unblocked on openmetadata, there are several other questions that need to be answered, ie. storing dbt intermediate files in github or not, various gitignore problems that may or may not be solved by the latest template, etc. Please consider this a bump to addressing those questions.

HeatherAck · 2023-02-13T18:13:07Z

@MightyNerdEric will work on this issue for week of 13-Feb

HeatherAck · 2023-02-27T19:05:17Z

@caldeirav can you address:
(2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851
(3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852

MichaelTiemannOSC added the metadata label Jun 29, 2022

MichaelTiemannOSC assigned caldeirav and toki8 Jun 29, 2022

MichaelTiemannOSC added this to Data Commons Platform Jun 29, 2022

caldeirav moved this to In Progress in Data Commons Platform Aug 18, 2022

MichaelTiemannOSC removed the blocked label Nov 16, 2022

MichaelTiemannOSC mentioned this issue Nov 16, 2022

OpenMetadata 0.12.1.2 available; fixes data profiling with Trino operate-first/support#1114

Closed

HeatherAck assigned rynofinn and eb-oss Nov 16, 2022

MichaelTiemannOSC assigned ryanaslett and unassigned rynofinn Jan 27, 2023

HeatherAck added the blocked label Jan 30, 2023

MichaelTiemannOSC removed the blocked label Feb 4, 2023

MichaelTiemannOSC mentioned this issue Feb 4, 2023

dbt profiles.yml file contains secrets #259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onboard ESSD dataset using Open Metadata #183

Onboard ESSD dataset using Open Metadata #183

MichaelTiemannOSC commented Jun 29, 2022 •

edited

Loading

caldeirav commented Sep 12, 2022

HeatherAck commented Oct 4, 2022

HeatherAck commented Oct 18, 2022

MichaelTiemannOSC commented Nov 16, 2022

HeatherAck commented Nov 16, 2022

HeatherAck commented Nov 16, 2022

HeatherAck commented Nov 21, 2022

HeatherAck commented Dec 5, 2022

HeatherAck commented Dec 12, 2022

HeatherAck commented Dec 12, 2022

HeatherAck commented Dec 19, 2022

HeatherAck commented Jan 9, 2023

HeatherAck commented Jan 9, 2023

eb-oss commented Jan 26, 2023

MichaelTiemannOSC commented Jan 26, 2023

MichaelTiemannOSC commented Jan 27, 2023

HeatherAck commented Jan 30, 2023

MichaelTiemannOSC commented Feb 4, 2023

MichaelTiemannOSC commented Feb 8, 2023

HeatherAck commented Feb 13, 2023

HeatherAck commented Feb 27, 2023

Onboard ESSD dataset using Open Metadata #183

Onboard ESSD dataset using Open Metadata #183

Comments

MichaelTiemannOSC commented Jun 29, 2022 • edited Loading

caldeirav commented Sep 12, 2022

HeatherAck commented Oct 4, 2022

HeatherAck commented Oct 18, 2022

MichaelTiemannOSC commented Nov 16, 2022

HeatherAck commented Nov 16, 2022

HeatherAck commented Nov 16, 2022

HeatherAck commented Nov 21, 2022

HeatherAck commented Dec 5, 2022

HeatherAck commented Dec 12, 2022

HeatherAck commented Dec 12, 2022

HeatherAck commented Dec 19, 2022

HeatherAck commented Jan 9, 2023

HeatherAck commented Jan 9, 2023

eb-oss commented Jan 26, 2023

MichaelTiemannOSC commented Jan 26, 2023

MichaelTiemannOSC commented Jan 27, 2023

HeatherAck commented Jan 30, 2023

MichaelTiemannOSC commented Feb 4, 2023

MichaelTiemannOSC commented Feb 8, 2023

HeatherAck commented Feb 13, 2023

HeatherAck commented Feb 27, 2023

MichaelTiemannOSC commented Jun 29, 2022 •

edited

Loading