Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance HexRunProjectOperator and HexHook #21

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jacobcbeaudin
Copy link

@jacobcbeaudin jacobcbeaudin commented Sep 7, 2024

Enhance HexRunProjectOperator and HexHook

This PR addresses issues #19 and #20, significantly improving the functionality and reliability of the HexRunProjectOperator and HexHook. Most importantly, it mitigates the issue where jobs were incorrectly marked as failed due to Hex API failures, not actual job failures.

Key Improvements

  1. Prevent false job failures: Jobs are no longer marked as failed when only the Hex status API fails, not the job itself. This greatly reduces false negatives and improves reliability.
  2. Removed deprecated apply_default decorator from HexRunProjectOperator (Fixes Remove deprecated apply_default decorator from HexOperator #19)
  3. Implemented robust retry mechanism for API polling in HexHook (Addresses Add Retry Mechanism for Failed Polls in HexRunProjectOperator #20)
  4. Decoupled polling logic for better maintainability and testability
  5. Correct make init command

Detailed Changes

1. Prevent false job failures

  • Implemented logic to distinguish between Hex API failures and actual job failures
  • Jobs now only fail if the Hex job itself fails, not if there are API communication issues
  • This significantly improves the reliability and accuracy of job status reporting

2. Remove deprecated apply_default decorator

  • Removed the @apply_default decorator from the HexRunProjectOperator class
  • This eliminates the deprecation warning and aligns with current Airflow best practices

3. Implement retry mechanism for API polling

  • Added retry logic to handle temporary API failures during polling
  • New parameters max_poll_retries and poll_retry_delay control retry behavior
  • Improves robustness and reduces false failure reports

4. Decouple polling logic

  • Separated polling logic into its own method for better separation of concerns
  • Enhances testability and maintainability of the code

5. New parameters

  • max_poll_retries: Maximum number of retries for a failed poll (default: 3)
  • poll_retry_delay: Delay between retries in seconds (default: 5)

6. Update pre-commit hooks

  • Updated pre-commit configuration to use https://github.com/pycqa/flake8 instead of the GitLab repository
  • Resolved issues with pre-commit hook initialization

Testing

  • Added unit tests to verify the new retry behavior and correct job failure reporting
  • Manually tested the operator with various configurations, including simulated API failures
  • Ran sync command in docker compose dev environment successfully

Test Results
image

Checklist

  • Code follows the project's coding standards
  • Tests have been added or updated to cover the changes
  • Documentation has been updated to reflect new behavior
  • All tests pass locally
  • Pre-commit hooks pass

This update significantly improves the reliability of the Hex Airflow provider by ensuring that jobs are not incorrectly marked as failed due to API communication issues. Users will now have a more accurate representation of their job statuses, reducing false alarms and improving overall system dependability.

@jacobcbeaudin jacobcbeaudin force-pushed the jacobcbeaudin/enhance-hex-operator branch 5 times, most recently from fc65f1e to 7e2a80b Compare September 8, 2024 17:27
- Remove deprecated "apply_default" decorator from HexRunProjectOperator
- Implement retry mechanism for API polling in HexHook
- Decouple polling logic from run_and_poll method
- Add max_poll_retries and poll_retry_delay parameters
- Update flake8 pre-commit hook to use GitHub repository
- Update pre-commit hooks to latest versions
@jacobcbeaudin jacobcbeaudin force-pushed the jacobcbeaudin/enhance-hex-operator branch from 7e2a80b to 853e83b Compare September 8, 2024 17:38
@jacobcbeaudin jacobcbeaudin marked this pull request as ready for review September 8, 2024 17:39
@jacobcbeaudin jacobcbeaudin marked this pull request as draft September 9, 2024 05:40
@jacobcbeaudin jacobcbeaudin marked this pull request as ready for review September 9, 2024 05:46
Comment on lines -190 to -191
if project_status not in VALID_STATUSES:
raise AirflowException(f"Unhandled status: {project_status}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to remove this exception

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this back

Comment on lines -121 to +120
json=mock_run,
json={"projectId": "abc-123", "runId": "1"},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use defined mock_run object

endpoint = f"api/v1/project/{project_id}/run/{run_id}"
method = "DELETE"

self.run(method=method, endpoint=endpoint)
return run_id

def run_and_poll(
def run_status_with_retries(
self, project_id: str, run_id: str, max_retries: int = 3, retry_delay: int = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A retry limit should be enforced server-side to prevent the caller from specifying an excessive number of retries for project runs, which could lead to resource exhaustion or unintended behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is implementing a server-side retry limit currently prioritized? As this PR focuses on client-side Airflow hooks, I can hardcode a conservative number of retries and delay to prevent configurability. This would address immediate concerns while keeping the implementation on the client side. Let me know if you'd like me to proceed with this approach.

Copy link
Collaborator

@clrcrl clrcrl Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here! (Also this is a product-manager translation from an engineer, so apologies if I lose something in translation!)

We rate limit requests, so at this time don't intend to implement a server-side retry limit. Also, our API is fairly non-flakey, so we don't think users will need to retry that often.

That being said, it would still be great to hardcode a conservative number of retries juuuust in case — 3 seems like a good spot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove deprecated apply_default decorator from HexOperator
3 participants