GitHub - coiled/coiled-from-prefect: How To Run A Coiled Workflow with Prefect

Run A Data Engineering Workload on Coiled From Prefect

This repo provides a starting point for automating a Dask engineering workload using Coiled and Prefect

Motivating Problem

New York City recently started open-sourced Uber-Lyft data at: s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet

This is a robust dataset for exploration. Unfortunately, while it is provided as a parquet dataset, the row groups are quite large, which make processing very unwieldly. Additionally, datatypes between partitions can lead to unexpected errors. We can leverage this as a motivating example for creating a pure Python data engineering pipeline using Coiled, Dask, and Prefect.

In this workload, we create two Prefect Flows that perform the following actions once daily:

Flow 1

Selects a list of parquet files from s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet
Checks to see if any of those files were written in the last day, using the LastModified time as a proxy for when the file was written. This allows incremental processing of the files
If any files meet this criteria, we kick off Flow 2

Flow 2

Spin up a Dask cluster with Coiled
Pass the list of files to be processed to the Dask cluster. The files will be loaded into the cluster, we declare datatypes, repartition the files to 128MB partitions, and write these files to a private S3 bucket.
Tears down the Dask cluster

Getting Started

This repo assumed you have a Coiled account, a Prefect account, are familiar with the following concepts, and have access to the following access credentials:

Coiled Users, Teams, and Tokens
Prefect Cloud
Prefect Flows and Tasks
Prefect Blocks
AWS S3 Access Key and Secret

Clone this repository
In Prefect, store your credentials to create a Coiled cluster as Prefect Blocks. Here we store them as: coiled-user, coiled-account (your team), and coiled-token.
In Prefect, store your credentials to authenticate to your private S3 bucket as a Prefect AwsCredentials. Here, we label it prefectexample0

References

Coiled
Dask
Prefect Working with Dask collections
Prefect-Dask
Workflow Automation with Prefect
PrefectHTTPStatus Errors

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
flows		flows
.gitignore		.gitignore
.prefectignore		.prefectignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run A Data Engineering Workload on Coiled From Prefect

Motivating Problem

In this workload, we create two Prefect Flows that perform the following actions once daily:

Flow 1

Flow 2

Getting Started

References

About

Releases

Packages

Contributors 2

Languages

License

coiled/coiled-from-prefect

Folders and files

Latest commit

History

Repository files navigation

Run A Data Engineering Workload on Coiled From Prefect

Motivating Problem

In this workload, we create two Prefect Flows that perform the following actions once daily:

Flow 1

Flow 2

Getting Started

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages