Rebuild Data Commons DEV on separate cluster #118

caldeirav · 2022-01-16T05:58:56Z

Going forward we want proper isolation between DEV and PROD environments for Data Commons, which means:

Configure two separate clusters and trino instances for DEV and PROD. This also means having two sets of configuration-as-code under os-climate/os_c_data_commons repository to drive deployment separately. Note: iceberg not being a component of ODH, it means we should drive the icerberg deployment fully from os-c.
Have a single catalog for DEV and PROD, based on an iceberg volume.
Re-build and re-run all data pipelines on the new DEV environment
Develop and implement a proper promotion process to PROD for data pipelines

MichaelTiemannOSC · 2022-01-16T08:45:26Z

Since everything I've done has followed the "data as code" approach (more or less), it's no problem for the pipelines I've created to rerun from scratch. Of course each of the pipelines needs a bit of scrubbing to properly use iceberg and implement its metadata and other best practices correctly. Would be great to do that in view/with help from members as a next step in onboarding resources.

@HeatherAck @toki8

HumairAK · 2022-01-19T19:23:09Z

This also means having two sets of configuration-as-code under os-climate/os_c_data_commons repository to drive deployment separately.

FYI current configurations are stored under operate-first org, so we can manage them next to our other services (this makes it easier to share configurations between deployments. )

Regardless, extending configurations to multiple environment should be straight forwards. We should identify the amount of users for both clusters so that we can identify namespace quotas/resource requirements etc.

erikerlandson · 2022-01-19T19:54:35Z

I would expect the number of users to be essentially the same, at least initially.

The main difference is that in the new "prod" cluster, only certain trino groups will have privileges that are more than "select" - the pipeline processes that run workloads like data ingest will have access to table write privilegs, and if needed access to the underlying s3 buckets. All other users will have only "select", and will not have s3 credentials.

Eventually, as the OS-Climate community grows, most community members will be using the prod cluster only, and so eventually the prod cluster will grow larger, but to start with they should have the same set of users.

HumairAK · 2022-01-19T19:55:52Z

to clarify -- by 2 clusters are we talking about ocp clusters?

MichaelTiemannOSC · 2022-01-20T00:52:20Z

CC @HeatherAck to keep abreast of usage and cost.

caldeirav · 2022-01-20T09:41:58Z

@HumairAK Yes, we are talking about two separate OCP clusters. The data layers (from storage volumes, all the way up to Trino) should also be totally separate. As for keeping the configurations in an operate first org, this would mean that ultimately when we move to a full gitops model we will have OS-C owners for all our repositories also contributing / managing their config from operate first. If this is fine and understood then it would be good to document the process (unless it already exists?)

On capacity planning - Erik is right if we exclude the Airbus onboarding stream. I am assuming we will use a separate cluster for SOSTrades / Witness when we onboard them, in which case we don't need to cater for much more capacity.

HumairAK · 2022-01-20T13:56:27Z

If this is fine and understood then it would be good to document the process (unless it already exists?)

This workflow already exists to some degree, we don't have the documentation yet, example of what that looks like: operate-first/apps#1418

We would just need to add these members to an owners file like this within the repo where these osc configs are held, so they can then use /approve /lgtm commands on PRs that change these specific configs.

It would be helpful to identify the configs that we expect to be changed by various users, and we can separate them out within the directory structure and provide a OWNERS file with specific mentions of those members that will need to only touch these files.

caldeirav · 2022-01-21T16:55:14Z

Alright let's go ahead with this and I will bring the repo info / management of OWNERS into our OS-C Data Commons doc then.

caldeirav · 2022-02-03T00:28:33Z

@redmikhail With the access to our partner portal to get OpenShift subs, can you confirm there is no show stopper now to create the required clusters and also when these could be available for @HumairAK to do the platform deployment?

redmikhail · 2022-02-03T02:58:31Z

@caldeirav I have added entitlements to the account , so we now should be able to build cluster. I will create vanilla cluster setup (control plane and infrastructure nodes ) so we can add ArgoCD and proceed with configuration using operate-first repo. I will also start adding "sub-tasks" to this issue so we could track actionable items.

caldeirav · 2022-02-08T13:48:30Z

With the creation of these two clusters i expect we will need to inform the community of the new links for the various environments. Therefore I have brought in the issue #44 again as it may make sense to have one dashboard where community members can access Trino, jupyter, CloudBeaver, the token generator, etc... for both dev and prod in one place.

erikerlandson · 2022-02-08T15:57:01Z

Just to put this on the record, we'll need to confirm that these new clusters have access to GPUs having >= 16GB of ram (and access to larger ones is almost certainly going to be desirable)

redmikhail · 2022-02-08T17:02:29Z

@erikerlandson For now we are planning to use combination of the p3.2xlarge set of nodes (https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/) that has 16GB GPU memory and g4dn instances for the other type of gpu workload (will need be tainted differently )

redmikhail · 2022-02-08T17:17:35Z

Based on discussion above and somewhat large scope of some of the tasks adding here task list with references to the individual sub-tasks:

OS-Climate - Establish the base software versions and tools for Stable instance cluster #98
Have a separate catalog for DEV and PROD, based on an iceberg volume.
Develop and implement a proper promotion process to PROD for data pipelines
Rebuild existing Dev Cluster using the same configuration process as for Production cluster.
Re-build and re-run all data pipelines on the new DEV environment

caldeirav · 2022-02-09T01:49:20Z

@redmikhail as I go through the tasks and in order to avoid misunderstanding - we want separate Trino for DEV / PROD (with a different single catalog for each instance). The reason is we will separately upgrade Trino in DEV and PROD.

caldeirav · 2022-02-14T03:36:36Z

OS-Climate - Configuration of Dev/Prod servers meeting (11/02/2022)
Attendees: Mikhail, Erik, Vincent, Michael

Create two new buckets for osc_datacommons_prod and osc_datacommons_dev with one Trino catalog per bucket
In osc_datacommons_dev, we will give the possibility for developers to create / maintain their own schema (based on their GitHub ID) to have a dedicated sandbox with isolation
Functional Tekton CI pipelines: when is this required - We will move the teams to new environment one by one, likely starting with Physical Risk team then NLP, ITR, WITNESS. We have a dependency on Operate First work on Tekton pipelines, we need to find out when they are planning to fix this so we assess dependencies on OS-Climate streams.
Rules and access control for Trino - start with the raw rule.json file, remove Michael from Admin list. Manage every requirement for access with new issue and a change.
Add a rule for non-activity in JupyterHub - 3 days idle time will result in session being killed
Change default logout time in Superset from 5min to 1hour
There is a need to have better monitoring and management for GPU usage - for two sets of GPU we are planning to use. To look into building a pipeline from the AWS monitoring data. Raise an issue to deploy Kafka on the cluster at later stage (post-meeting note: may also be required for some use-cases with LSEG)
We need to define what the production cluster will do and document it - in Prod we expect data pipelines to be fully automated, no user interaction with Jupuyter notebook, and users access data through Trino SQL / Superset (and in the future, API Gateway)
We will select the first pipeline to be promoted to prod and use it to document the process - this part can be discussed with community users in the event storming process

Will raise separate issues for changes to environment summarised above so we can track them separately.

caldeirav · 2022-02-14T03:37:39Z

So if the cluster being stood up now is the dev cluster, then do I only include the osc_datacommons_dev catalog on to the dev trino instance?
Yes, this should be a new bucket with a single, new catalog

caldeirav · 2022-02-14T03:40:06Z

Considering that we decided that we wont be using catalogs across environments would it make sense to just call it osc_datacommons and remove environment from naming convention (S3 bucket names will still have environment in the name to indicate importance of it ) ? It would simplify configuration as well
We should keep separate names as people may be connecting to both environments at the same time, also we don't want code to be executed in the wrong environment (good practice to have environment name as parameter in scripts, as an additional check).

caldeirav · 2022-02-14T03:40:38Z

Do we want to use encryption on S3 buckets
Yes

HumairAK · 2022-02-16T14:58:13Z

FYI on the dev cluster these services are up:

Jupyterhub
superset
trino
trino-token-service
seldon
cloudbeaver

Console link: https://console-openshift-console.apps.odh-cl2.apps.os-climate.org/k8s/all-namespaces/machine.openshift.io~v1beta1~MachineSet

https://jupyterhub-odh-jupyterhub.apps.odh-cl2.apps.os-climate.org/
https://trino-secure-odh-trino.apps.odh-cl2.apps.os-climate.org/
https://cloudbeaver-odh-trino.apps.odh-cl2.apps.os-climate.org/
https://superset-secure-odh-superset.apps.odh-cl2.apps.os-climate.org/
https://das-odh-trino.apps.odh-cl2.apps.os-climate.org

trino/cloudbeaver admin account same as before

Not up:
kfp-tekton

erikerlandson · 2022-02-16T15:45:44Z

trino token based authentication (this is a personal implementation of mine, we might want to have a more robust method, but I can deploy it on this dev environment as well if needed)

We definitely need this, all of our trino authentication story is based on JWT. That, Or some other solution for generating JWT

I think we should stand yours up - it gives us JTW, and authenticated via github, both of which are important to how we designed OSC platform.

I do not want to block this by getting into long discussion but we might also run it as an Open Services Group service, if there is a reasonable OSG affiliated cluster to run it on.

HumairAK · 2022-02-16T16:38:09Z

Okay sure. I can set it up. But I think we might need to a more robust solution for a prod environment in the future.

erikerlandson · 2022-02-16T18:25:02Z

But I think we might need to a more robust solution for a prod environment in the future.

I agree, but I am not currently sure what that solution should be, and your tool does the job effectively

HumairAK · 2022-02-17T14:25:34Z

trino token service added: https://das-odh-trino.apps.odh-cl2.apps.os-climate.org

erikerlandson · 2022-02-25T18:56:57Z

As the eventual PROD cluster is going to have some significant differences with respect to the new DEV cluster, I am going to close this issue out on behalf of the recent creation of DEV.

Future discussion and progress on PROD will be tracked on #136

caldeirav added the high priority label Jan 16, 2022

caldeirav assigned HumairAK and redmikhail Jan 16, 2022

caldeirav added this to Data Commons Platform Jan 16, 2022

caldeirav moved this to Todo in Data Commons Platform Jan 17, 2022

eoriorda moved this from Todo to In Progress in Data Commons Platform Jan 31, 2022

Shreyanand mentioned this issue Feb 3, 2022

Write notebooks for the training pipeline os-climate/aicoe-osc-demo#11

Closed

5 tasks

This was referenced Feb 9, 2022

Initialize osc-cl2 clusterscope overlay structure. operate-first/apps#1601

Merged

Osc cl2 odh operate-first/apps#1613

Merged

HumairAK mentioned this issue Feb 16, 2022

Enable trino programmatic access via dex auth service. operate-first/apps#1637

Merged

Shreyanand mentioned this issue Feb 24, 2022

Rerun NLP demo on the new cluster os-climate/aicoe-osc-demo#132

Closed

2 tasks

erikerlandson mentioned this issue Feb 25, 2022

Deploy OSC Data Commons Stable Cluster #136

Closed

erikerlandson changed the title ~~Rebuild Data Commons PROD / DEV on separate clusters~~ Rebuild Data Commons DEV on separate cluster Feb 25, 2022

erikerlandson closed this as completed Feb 25, 2022

Repository owner moved this from In Progress to Done in Data Commons Platform Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebuild Data Commons DEV on separate cluster #118

Rebuild Data Commons DEV on separate cluster #118

caldeirav commented Jan 16, 2022

MichaelTiemannOSC commented Jan 16, 2022

HumairAK commented Jan 19, 2022

erikerlandson commented Jan 19, 2022

HumairAK commented Jan 19, 2022

MichaelTiemannOSC commented Jan 20, 2022

caldeirav commented Jan 20, 2022

HumairAK commented Jan 20, 2022 •

edited

Loading

caldeirav commented Jan 21, 2022

caldeirav commented Feb 3, 2022

redmikhail commented Feb 3, 2022

caldeirav commented Feb 8, 2022

erikerlandson commented Feb 8, 2022

redmikhail commented Feb 8, 2022

redmikhail commented Feb 8, 2022 •

edited

Loading

caldeirav commented Feb 9, 2022

caldeirav commented Feb 14, 2022

caldeirav commented Feb 14, 2022

caldeirav commented Feb 14, 2022

caldeirav commented Feb 14, 2022

HumairAK commented Feb 16, 2022 •

edited

Loading

erikerlandson commented Feb 16, 2022 •

edited

Loading

HumairAK commented Feb 16, 2022

erikerlandson commented Feb 16, 2022

HumairAK commented Feb 17, 2022

erikerlandson commented Feb 25, 2022

Rebuild Data Commons DEV on separate cluster #118

Rebuild Data Commons DEV on separate cluster #118

Comments

caldeirav commented Jan 16, 2022

MichaelTiemannOSC commented Jan 16, 2022

HumairAK commented Jan 19, 2022

erikerlandson commented Jan 19, 2022

HumairAK commented Jan 19, 2022

MichaelTiemannOSC commented Jan 20, 2022

caldeirav commented Jan 20, 2022

HumairAK commented Jan 20, 2022 • edited Loading

caldeirav commented Jan 21, 2022

caldeirav commented Feb 3, 2022

redmikhail commented Feb 3, 2022

caldeirav commented Feb 8, 2022

erikerlandson commented Feb 8, 2022

redmikhail commented Feb 8, 2022

redmikhail commented Feb 8, 2022 • edited Loading

caldeirav commented Feb 9, 2022

caldeirav commented Feb 14, 2022

caldeirav commented Feb 14, 2022

caldeirav commented Feb 14, 2022

caldeirav commented Feb 14, 2022

HumairAK commented Feb 16, 2022 • edited Loading

erikerlandson commented Feb 16, 2022 • edited Loading

HumairAK commented Feb 16, 2022

erikerlandson commented Feb 16, 2022

HumairAK commented Feb 17, 2022

erikerlandson commented Feb 25, 2022

HumairAK commented Jan 20, 2022 •

edited

Loading

redmikhail commented Feb 8, 2022 •

edited

Loading

HumairAK commented Feb 16, 2022 •

edited

Loading

erikerlandson commented Feb 16, 2022 •

edited

Loading