-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Incorporate SDM in CDK and add publish workflow #58
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces several updates across multiple files, primarily enhancing the connector testing workflow and introducing a new Dockerfile for building a Python-based Airbyte connector. New connectors have been added to the testing matrix, and the workflow now includes conditional job aborts and improved logging. Additionally, a new JSON schema for a low-code source is defined, and several functions and classes related to handling local YAML manifests are introduced. A new GitHub Actions workflow is also added to automate the testing and publishing of Docker images. Changes
Possibly related PRs
Suggested labels
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (7)
airbyte_cdk/cli/source_declarative_manifest/__init__.py (1)
1-6
: Clean and well-structured module initialization! 🎉The code follows Python best practices by explicitly declaring the public API through
__all__
and providing a clean interface to the implementation details.Hey, what do you think about adding a module docstring to describe the purpose of this package and maybe a quick example of how to use the
run
function? This could help other developers understand its role in the declarative source handling. wdyt? 😊If you're interested, here's a suggestion:
""" Provides functionality for handling declarative source manifests in Airbyte CDK. Example: from airbyte_cdk.cli.source_declarative_manifest import run # Example usage of the run function run(args) """airbyte_cdk/cli/source_declarative_manifest/spec.json (1)
4-8
: Schema structure looks good! Quick thought about namingThe schema structure is well-defined with appropriate version and type constraints. I noticed we're using
__injected_declarative_manifest
as a required property - while the double underscore prefix is a common convention for special properties, would it make sense to use a more user-friendly name without the prefix since this is part of the public API? wdyt? 🤔.github/workflows/docker-build.yml (2)
3-16
: Consider enhancing workflow triggers for better CI/CD coverageThe current triggers look good, but what do you think about these suggestions to make it even better? 🤔
- Add PR trigger to test builds before merging:
pull_request: paths: - 'airbyte_cdk/**' - '.github/workflows/docker-build.yml' - 'Dockerfile'
- Add CODEOWNERS validation to protect workflow changes:
push: paths: - '.github/workflows/docker-build.yml' branches: - main required_reviewers: - '@airbytehq/workflow-maintainers'wdyt?
46-63
: Improve publishing step maintainability and fix formattingThe publishing logic works well, but here are some suggestions to make it even better:
Fix trailing spaces on lines 52 and 58 🧹
How about adding version tag validation?
- name: Validate version tag if: ${{ github.event.inputs.version-tag != '' }} run: | if ! [[ ${{ github.event.inputs.version-tag }} =~ ^[0-9]+\.[0-9]+\.[0-9]+$ ]]; then echo "Invalid version format. Must be semantic (e.g., 1.0.0)" exit 1 fi
- Consider using a composite action to reduce script complexity:
- name: Push to Docker Hub uses: ./.github/actions/docker-push with: image: airbyte/source-declarative-manifest build-tag: build-test version: ${{ github.event.inputs.version-tag }} sha: ${{ github.sha }}What do you think about these improvements? 🤔
🧰 Tools
🪛 yamllint
[error] 52-52: trailing spaces
(trailing-spaces)
[error] 58-58: trailing spaces
(trailing-spaces)
.github/workflows/connector-tests.yml (1)
Line range hint
3-11
: Consider implementing the workflow optimization.There's a TODO about moving these tests to run after "PyTest (Fast)". Since we're making changes to the workflow, would this be a good time to implement this optimization? It could help reduce CI load by running expensive connector tests only after basic tests pass, wdyt? 🤔
airbyte_cdk/cli/source_declarative_manifest/_run.py (2)
34-59
: Parameterize the 'manifest.yaml' path for greater flexibilityIn the
SourceLocalYaml
class, we have hardcoded thepath_to_yaml
as"manifest.yaml"
. Would it be beneficial to allow this path to be configurable? This way, users could specify a different manifest file if needed, enhancing the flexibility of the class. WDYT?
78-96
: Refactor duplicated error handling into a utility functionThe error handling code in
_get_local_yaml_source
andcreate_declarative_source
is quite similar. Would it make sense to extract this logic into a separate utility function to avoid duplication and simplify maintenance? What are your thoughts?Also applies to: 146-164
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (8)
.github/workflows/connector-tests.yml
(1 hunks).github/workflows/docker-build.yml
(1 hunks)Dockerfile
(1 hunks)airbyte_cdk/cli/__init__.py
(1 hunks)airbyte_cdk/cli/source_declarative_manifest/__init__.py
(1 hunks)airbyte_cdk/cli/source_declarative_manifest/_run.py
(1 hunks)airbyte_cdk/cli/source_declarative_manifest/spec.json
(1 hunks)pyproject.toml
(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- Dockerfile
- airbyte_cdk/cli/init.py
🧰 Additional context used
🪛 yamllint
.github/workflows/docker-build.yml
[error] 52-52: trailing spaces
(trailing-spaces)
[error] 58-58: trailing spaces
(trailing-spaces)
🔇 Additional comments (4)
airbyte_cdk/cli/source_declarative_manifest/spec.json (1)
2-2
: Verify documentation URL accessibility
Hey! Let's make sure this documentation URL is live and contains the relevant content for users. 🔍
✅ Verification successful
Documentation URL is accessible and working ✅
The documentation URL https://docs.airbyte.com/integrations/sources/low-code
returns a successful HTTP 200 response, confirming it's live and accessible to users.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the documentation URL is accessible and returns a valid response
curl -s -I "https://docs.airbyte.com/integrations/sources/low-code" | head -n 1
Length of output: 91
.github/workflows/docker-build.yml (1)
1-63
: Verify workflow security configuration
The workflow handles sensitive operations. Let's verify the security settings:
✅ Verification successful
The workflow's security configuration looks solid! 🔒
The permissions are properly scoped with clear justifications:
id-token: write
is correctly set for trusted publishingcontents: write
is appropriately configured for artifact uploads- The workflow runs only on
main
branch pushes or manual triggers - Docker Hub credentials are only used when needed (on main or manual dispatch)
The security settings align well with similar workflows in the repo that handle artifacts and publishing. What do you think about keeping it as is? 😊
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for potential security issues in the workflow
# Check if GITHUB_TOKEN is used with minimal permissions
rg -l 'GITHUB_TOKEN' .github/workflows/
# Look for other workflow files that might trigger this one
fd -e yml -e yaml . .github/workflows/ -x grep -l "workflow_dispatch"
# Check for workflow_dispatch events in other files that might need protection
fd -e yml -e yaml . .github/workflows/ -x grep -l "workflow_dispatch" {} \; -exec grep -l "permissions:" {} \;
Length of output: 920
Script:
#!/bin/bash
# Let's check the workflow files that use workflow_dispatch for their permissions and security settings
echo "=== docker-build.yml ==="
cat .github/workflows/docker-build.yml | grep -A 5 "permissions:"
echo -e "\n=== test-command.yml ==="
cat .github/workflows/test-command.yml | grep -A 5 "permissions:"
echo -e "\n=== pypi_publish.yml ==="
cat .github/workflows/pypi_publish.yml | grep -A 5 "permissions:"
echo -e "\n=== autofix-command.yml ==="
cat .github/workflows/autofix-command.yml | grep -A 5 "permissions:"
echo -e "\n=== poetry-lock-command.yml ==="
cat .github/workflows/poetry-lock-command.yml | grep -A 5 "permissions:"
echo -e "\n=== pdoc_publish.yml ==="
cat .github/workflows/pdoc_publish.yml | grep -A 5 "permissions:"
echo -e "\n=== connector-tests.yml ==="
cat .github/workflows/connector-tests.yml | grep -A 5 "permissions:"
# Check for write permissions in docker-build.yml
echo -e "\n=== Write permissions in docker-build.yml ==="
cat .github/workflows/docker-build.yml | grep -B 5 -A 5 "write"
Length of output: 2178
🧰 Tools
🪛 yamllint
[error] 52-52: trailing spaces
(trailing-spaces)
[error] 58-58: trailing spaces
(trailing-spaces)
.github/workflows/connector-tests.yml (1)
84-89
: Consider enhancing the TODO comment with more context and tracking.
The TODO comment explains the blocker, but could we make it more actionable? Maybe add a link to an issue tracking the --use-local-cdk
support for manifest connectors? Also, should we add a condition to skip these tests until the support is added to prevent false failures, wdyt? 🤔
Let's check if there's an existing issue tracking this:
pyproject.toml (1)
52-52
: Could you clarify the intended use case for rapidfuzz?
I notice we're adding rapidfuzz, which is typically used for fuzzy string matching. Would love to understand how this fits into the SDM integration - are we planning to use it for manifest parsing or something else? Also, the version constraint ^3.10.1
looks good as it allows minor updates while preventing breaking changes, wdyt? 🤔
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ChristoGrab, I'd like to pushback on the Dockerfile introduction and the push
with vanilla docker commands in the GHA workflow.
Here's why:
- We originally stepped away from Dockerfile and bash scripts to publish because it makes it hard to get consistency and reproductability in the way we build / publish our images. You can read more about that in this blog post (the Legacy CI: A Maze of YAML and Shell Scripts section)
- We decided to use dagger instead so that we can build and publish images programatically. Which allows us to write tests for them and also build abstractions which fosters consistency.
- Taking your approach could indeed be the fastest immediate path but it breaks build consistency which is valuable in the long run.
It's why I want to suggest you to maintain the build logic of SDM in our base_images
package. The release could happen from the CDK repo - but I think the build logic should be kept in base_images
.
Given that:
- SDM is the base image for our manifest connectors
- SDM is based on our
airbyte/python-connector-base
base image. - This python base image is currently managed in our
base_images
package. We declare, build and publish it with a CLI declared in tis package. (The process is currently local and manual... that would have to change to support your usecase).
My recommendations would be to tweak the base_images
package so that the build logic of SDM is stored there. And you can reuse the existing logic to publish multi arch images (which you don't get with docker build
).
The benefits in doing this is that:
- We keep a central place in which our base images are declared. It makes it easier to refactor globally any docker build related thing.
- You can write "sanity checks" which run against the image before publish. which can boost confidence if we ever change the build logic.
- You can automate the changelog management
The challenge in this move is not the declaration of SDM in this package per se but rather that:
- Publishing base image is currently only done locally
- Publishing updates a version controlled changelog which is in the
airbyte
repo. This changelog act as a version controlled state which determines which is the latest published version. We should make the package repo agnostic and provide the changelog to update via a CLI input. - The system / python dependency versions we install in images is are hardcoded. We could pass a
env
file to the CLI that the build logic could use to modify the CDK version that gets installed in the image. - CDK version and SDM version must be decoupled. If we change the
airbyte/python-connector-base
image version SDM uses we'll want to cut a new SDM version but not a new CDK version.
So here's the SDM high level release flow I would suggest (in GHA, in airbyte-python-cdk-repo
):
- Trigger it any time a new CDK version is published to Pypi / the
airbyte/python-connector-base
version changes. - Install the
base_images
package withpoetry
, with a git URL targetting theairbyte
repo. (base_images
could eventually be in its own repo / be published to Pypi). - Checkout
airbyte-python-cdk-repo
- Run a command like:
poetry run generate-release --image=sdm --base-image=<airbyte-python-base-image-version-from-a-version-controlled-file> --bump-type=patch --changelog-path=repo-specific-changelog-path.json --changelog-entry="Update CDK to v.X.X.X"
--env-vars=` - Commit and push the updated changelog.
I know that's a lot to process 😄 . My recommandation would be that you get familiar with our base_images
package and understand how we currently build and publish airbyte/python-connector-base
. And we can of course talk about this live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
.github/workflows/cdk-publish.yml (4)
2-7
: Consider enhancing the workflow triggers?A few suggestions to make the workflow more robust:
- Maybe add
pull_request
trigger to catch issues early? This would help validate changes before they hit main.- The path filter for
airbyte_cdk/pyproject.toml
could be more specific, wdyt?on: push: paths: - - 'airbyte_cdk/pyproject.toml' # To only publish on CDK version change + - 'airbyte_cdk/pyproject.toml' # Version changes + - '.github/workflows/cdk-publish.yml' # Workflow changes - 'Dockerfile' + pull_request: + paths: + - 'airbyte_cdk/pyproject.toml' + - 'Dockerfile'
32-34
: How about adding a timeout to the test step?To prevent hanging builds, maybe we could add a timeout? Also, we could save the test output as an artifact for debugging, wdyt?
- name: Test image + timeout-minutes: 10 run: | docker run airbyte/source-declarative-manifest:build-test spec + mkdir -p test-results + docker run airbyte/source-declarative-manifest:build-test spec > test-results/spec-output.txt + - name: Upload test results + if: always() + uses: actions/upload-artifact@v3 + with: + name: test-results + path: test-results/
75-81
: How about enhancing the tag check?The tag check is great! Maybe we could make it even better with a timeout and more informative error message? Something like:
- name: Check if tag already exists + timeout-minutes: 5 run: | tag="airbyte/source-declarative-manifest:${{ env.CDK_VERSION}}-${{ github.run_number }}" if docker manifest inspect $tag > /dev/null 2>&1; then - echo "The tag $tag already exists on Dockerhub. Skipping publish to prevent overwrite." + echo "::error::Tag $tag already exists on Dockerhub. This usually means the version in pyproject.toml needs to be bumped." exit 1 fi
83-92
: Consider optimizing the build configuration?The multi-platform build looks good! Maybe we could optimize it further with build caching and compression? This could speed up builds and reduce storage costs. Wdyt about:
- name: Build and push uses: docker/build-push-action@v5 with: context: . platforms: linux/amd64,linux/arm64 push: true + cache-from: type=gha + cache-to: type=gha,mode=max + compression: zstd tags: | airbyte/source-declarative-manifest:latest airbyte/source-declarative-manifest:${{ env.CDK_VERSION }} airbyte/source-declarative-manifest:${{ env.CDK_VERSION }}-${{ github.run_number }}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
.github/workflows/cdk-publish.yml
(1 hunks)
🧰 Additional context used
🪛 actionlint
.github/workflows/cdk-publish.yml
71-71: shellcheck reported issue in this script: SC2086:info:2:36: Double quote to prevent globbing and word splitting
(shellcheck)
🔇 Additional comments (1)
.github/workflows/cdk-publish.yml (1)
36-44
: Enhance vulnerability scanning configuration
The current Trivy configuration might miss important vulnerabilities. How about:
- Including
HIGH
severity vulnerabilities? - Adding a timeout to prevent long-running scans?
- Generating a SARIF report for better visibility?
- name: Scan for vulnerabilities
uses: aquasecurity/trivy-action@master
continue-on-error: true
with:
image-ref: airbyte/source-declarative-manifest:build-test
- format: table
+ format: 'table,sarif'
+ output: 'trivy-results.sarif'
exit-code: 1
- severity: CRITICAL
+ severity: 'CRITICAL,HIGH'
+ timeout: '5m'
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
.github/workflows/cdk-publish.yml (2)
2-7
: Consider expanding path triggers for completeness?The current path triggers look good, but we might want to rebuild when SDM-related files change too. What do you think about adding these paths? 🤔
paths: - 'airbyte_cdk/pyproject.toml' - 'Dockerfile' + - 'airbyte_cdk/sources/**' + - 'airbyte_cdk/models/**'
32-35
: Enhance test coverage with additional connector commands?Currently, we're only testing the
spec
command. Should we add checks for other standard connector commands too? Here's a thought:- name: Test image run: | docker run airbyte/source-declarative-manifest:build-test spec + docker run airbyte/source-declarative-manifest:build-test check --config tests/test_config.json + docker run airbyte/source-declarative-manifest:build-test discover --config tests/test_config.jsonwdyt? 🤔
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
.github/workflows/cdk-publish.yml
(1 hunks)
🧰 Additional context used
🪛 actionlint
.github/workflows/cdk-publish.yml
73-73: shellcheck reported issue in this script: SC2086:info:2:36: Double quote to prevent globbing and word splitting
(shellcheck)
🔇 Additional comments (1)
.github/workflows/cdk-publish.yml (1)
66-71
: 🛠️ Refactor suggestion
Consider using OIDC for Docker Hub authentication?
Instead of using Docker Hub credentials directly, we could leverage OIDC for more secure authentication. The workflow already has the necessary permissions. How about:
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
- username: ${{ secrets.DOCKER_HUB_USERNAME }}
- password: ${{ secrets.DOCKER_HUB_PASSWORD }}
+ registry: docker.io
+ token-format: 'docker-hub'
This would eliminate the need to store credentials in GitHub secrets. What are your thoughts? 🔐
Likely invalid or redundant comment.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
.github/workflows/cdk-publish.yml (3)
3-7
: Consider expanding the path triggers?Hey there! 👋 The current path triggers might miss some important changes. How about including:
push: paths: - 'airbyte_cdk/pyproject.toml' - 'Dockerfile' + - '.github/workflows/cdk-publish.yml' + - 'airbyte_cdk/version.py' # If you have a version fileThis would ensure the workflow runs when the workflow file itself is modified or when version-related files change. WDYT? 🤔
24-31
: Add Docker layer caching?Hey! To speed up builds, we could leverage Docker layer caching. Something like:
- name: Build test image uses: docker/build-push-action@v5 with: context: . platforms: linux/amd64 load: true tags: airbyte/source-declarative-manifest:build-test + cache-from: type=gha + cache-to: type=gha,mode=maxThis could significantly reduce build times. Would you like to try this out? ⚡
36-46
: Upload Trivy results as artifacts?Since we're generating a SARIF report, shall we upload it as an artifact for better visibility? Maybe add:
- name: Scan for vulnerabilities uses: aquasecurity/trivy-action@master continue-on-error: true with: image-ref: airbyte/source-declarative-manifest:build-test format: 'table,sarif' output: 'trivy-results.sarif' exit-code: 1 severity: 'CRITICAL,HIGH' timeout: '5m' + - name: Upload Trivy scan results + uses: github/codeql-action/upload-sarif@v2 + if: always() + with: + sarif_file: 'trivy-results.sarif'This would make the security scan results available in the Security tab. Sound good? 🔒
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
.github/workflows/cdk-publish.yml
(1 hunks)
🧰 Additional context used
🪛 actionlint
.github/workflows/cdk-publish.yml
73-73: shellcheck reported issue in this script: SC2086:info:2:36: Double quote to prevent globbing and word splitting
(shellcheck)
🔇 Additional comments (1)
.github/workflows/cdk-publish.yml (1)
47-93
: Publish job configuration looks solid! 👍
Great job incorporating the security improvements and best practices:
- Proper version retrieval with whitespace handling
- Secure tag checking with proper quoting
- Multi-platform build support
- Appropriate permissions configuration
🧰 Tools
🪛 actionlint
73-73: shellcheck reported issue in this script: SC2086:info:2:36: Double quote to prevent globbing and word splitting
(shellcheck)
Release Notes
New Features
Enhancements
rapidfuzz
(required to build SDM image).Summary by CodeRabbit
Release Notes
New Features
source-the-guardian-api
andsource-pokeapi
.Improvements
run
function for improved accessibility within the source declarative manifest.Updates
pyproject.toml
to include a new dependency,rapidfuzz
, and a new script for the source declarative manifest.