A boilerplate for creating data pipelines using Dagster, Docker, and Poetry. To use this repo, clone it or click "Use this template" and follow the instructions below. A detailed explanation of how this repo is structured can be found in the companion blog post here
- Picks up code changes immediately (just hit
Reload
in dagit; don't have to restart the container!) - Unified Dockerfile for development & deployment; easily integrates with CI/CD processes
- Packages the source code according to PEP517 & PEP518
- Tractable package management using
poetry
. No more hideouspip freeze > requirements.txt
!
docker compose down # if already running
docker compose build
docker compose up
Done! At this point, you should be able to successfully navigate to the Dagit UI and launch the job
The top_hacker_news
job will run out of the box and simply log its results to console, but if you configure a Slack Webhook, the job will send its output to the corresponding channel, which is much more fun :)
After creating the Slack Webhook, copy the Slack Webhook URL and uncomment the environment variable lines in docker-compose.yml
, then restart the container
When using containerization, installing poetry locally is not necessary, but it is recommended; the venv it creates can be used for code completion, simple interactive debugging, and more
- Install python 3.9
- Install poetry
The alternative setup runs locally without any containerization
Note It's recommended that the application is run using the docker approach
Running locally is very similar to using the container
- Install poetry (not optional in this case)
- Export the environment variable(s)
- Open a terminal in the project root and run the following commands
# First command optional. creates `.venv` in the project root; very useful when using VSCode!
poetry config virtualenvs.in-project true
poetry install
# To use poetry (i.e. activate the virtualenv):
poetry shell
dagit -w workspace.yaml
I'll be honest, I haven't focused on testing with this repo. Suggestions for improvement are welcome :)
Assuming poetry is installed and the environment created, run the following from the project root:
poetry shell
pytest
If you change any env vars or files that are outside of job_configs
or src
, then you'll want to rebuild the docker container, e.g. when...
- adding new packages to
pyproject.toml
- modifying
Dockerfile
- adding a volume mount for DAGSTER_HOME
Just add it to [tool.poetry.dependencies]
in pyproject.toml (or [tool.poetry.dev-dependencies]
) and rebuild the container. If using poetry locally without containerization, also run poetry update
to update the lockfile
Don't worry! Delete poetry.lock
(poetry.lock) and run poetry install
locally to recreate it
Yes! If you're developing sensors, partitions, schedules, and want to test them in your container, then simply uncomment the following line in the dev
stage of the Dockerfile:
# RUN echo "poetry run dagster-daemon run &" >> /usr/bin/dev_command.sh
I leave this as an exercise for the reader and/or the reader's DevOps team :) Though here are some tips:
- Use semantic versioning to version-bump
pyproject.toml
and associate this with the container version - You don't need to target a specific stage in the Dockerfile; the end result is a Dagster User Code Deployment in a ready-to-use container
- If using helm, make sure you've added the correct container version to the list of User Code Deployments; don't forget to apply any secrets/env vars as needed
Use debugpy
(already installed). In docker-compose.yml
, add - "5678:5678"
to the list of ports. In the actual op you'd like to debug, add the following three lines:
# It's very important that we specify both address and port!
debugpy.listen(('0.0.0.0', 5678))
# Block until you can attach the debugger in VSCode
debugpy.wait_for_client()
# Add this final line wherever you'd like within the op
debugpy.breakpoint()
Finally, you’ll need to create a launch.json
for python remote attach. In VSCode, click “Run and Debug” -> “Create a launch.json file” and follow the prompts ( python -> remote attach -> localhost -> 5678 )