Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #2590] Replace gh in analytics ETL #3393

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

widal001
Copy link
Collaborator

@widal001 widal001 commented Jan 5, 2025

Summary

Replaces the sub-process call to the gh CLI by replacing it with a GitHubGraphqlClient class that can make calls to the GitHub GraphQL library directly from python.

Fixes #2590

Time to review: 10 mins

Changes proposed

What was added, updated, or removed in this PR.

  • Adds a GitHubGraphqlClient class that can make paginated calls to the GitHub GraphQL API
  • Replaces the calls to a separate sub-process in src/analytics/etl/github/main.py with the GitHubGraphqlClient
  • Removes the make-graphql-call.sh script that previously invoked the gh CLI
  • Removes the `

Context for reviewers

Testing instructions, background context, more in-depth details of the implementation, and anything else you'd like to call out or ask reviewers. Explain how the changes were verified.

Instructions to test

  1. Checkout the PR locally
  2. Re-build the docker image: make build
  3. Run the GitHub data export and sprint reports end-to-end: make sprint-reports-with-latest-data

Notes

We'll want to refactor the src/analytics/integrations/github/ sub-package a little bit further pulling most of the code in the main.py file in that sub-package into src/analytics/etl/github.py instead.

I didn't include that in this PR to try to minimize the amount of code I was changing, but we can/should tackle that refactor in #3203 because some of the functions in main.py still write to the local file system, but can easily be updated to pass the exported data as a python dictionary.

Additional information

Screenshots, GIF demos, code examples or output to help show the changes working as expected.

The local run of sprint reports with the new code matches the output of the last run triggered by AWS step functions (using code in main) posted to slack:

Sprint report for HHS/13

In Slack (based on main)

Screenshot 2025-01-04 at 7 46 28 PM

Locally, based on this feature branch:

Screenshot 2025-01-04 at 7 47 22 PM

Sprint burndown for HHS/17

In Slack (based on main)

Screenshot 2025-01-04 at 7 48 06 PM

Locally, based on this feature branch:

Screenshot 2025-01-04 at 7 47 50 PM

Deliverable percent complete

In Slack (based on main)

Screenshot 2025-01-04 at 7 49 02 PM

Locally, based on this feature branch:

Screenshot 2025-01-04 at 7 48 45 PM

@widal001 widal001 requested review from DavidDudas-Intuitial and coilysiren and removed request for acouch and coilysiren January 5, 2025 00:50
@@ -18,7 +18,6 @@ RUN apt-get update \
libpq-dev \
postgresql \
wget \
jq \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing jq because we no longer need it for transformations

Comment on lines -37 to -38
# Install gh CLI
# docs: https://github.com/cli/cli/blob/trunk/docs/install_linux.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this script because we no longer need the gh CLI

@@ -19,6 +19,7 @@ class DBSettings(PydanticBaseEnvConfig):
ssl_mode: str = Field("require", alias="DB_SSL_MODE")
db_schema: str = Field ("app", alias="DB_SCHEMA")
slack_bot_token: str = Field(alias="ANALYTICS_SLACK_BOT_TOKEN")
github_token: str = Field(alias="GH_TOKEN")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this because we now need to reference it directly within the codebase, instead of indirectly like we did previously with the gh CLI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are in this file, can we rename DBSettings to something more accurate

###########################
# Do not add these values to this file
# to avoid mistakenly committing them.
# Set these in your shell
# by doing `export ANALYTICS_REPORTING_CHANNEL_ID=whatever`
ANALYTICS_REPORTING_CHANNEL_ID=DO_NOT_SET_HERE
ANALYTICS_SLACK_BOT_TOKEN=DO_NOT_SET_HERE
GH_TOKEN=DO_NOT_SET_HERE
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prevents tests from failing if someone hasn't set their GitHub token locally.

Comment on lines -71 to -72
"ANN101", # missing type annotation for self
"ANN102", # missing type annotation for cls
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these because they've been removed in the latest version of ruff

@@ -78,7 +76,6 @@ ignore = [
"PTH123", # `open()` should be replaced by `Path.open()`
"RUF012", # Mutable class attributes should be annotated with `typing.ClassVar`
"TD003", # missing an issue link on TODO
"PT004", # pytest fixture leading underscore - is marked deprecated
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This files is basically a complete refactor, but preserves the existing helper functions for the export to prevent this PR from getting bigger than it already is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removes this because we no longer need it

@@ -40,7 +40,7 @@ def test_init(
records = caplog.records
assert len(records) == 2
assert re.match(
r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\(\w+\)$",
r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\([\w\.]+\)",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this because the tests were failing locally if there was a period in the username, e.g. billy.daly

Copy link
Collaborator

@coilysiren coilysiren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have (1) significant question about the data formatting, everything else looks fine

@@ -31,14 +31,15 @@ MB_DB_PASS=secret123
MB_DB_HOST=grants-analytics-db

###########################
# Slack Configuration #
# Token Configuration #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Token Configuration #
# Secret Configuration #

{
"project_owner": owner,
"project_number": project,
"issue_title": safe_pluck(item, "content.title"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need safe_pluck. If there's a bunch of fields missing, I would rather the code raise a keyerror, instead of getting us bad (eg. mostly null) data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Call GitHub graphql API directly from python
2 participants