Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[source-google-ads] Timeout issue with large datasets in shopping_performance_view #49254

Open
1 task
alenoir opened this issue Dec 12, 2024 · 3 comments
Open
1 task

Comments

@alenoir
Copy link

alenoir commented Dec 12, 2024

Connector Name

source-google-ads

Connector Version

3.7.9

What step the error happened?

During the sync

Relevant information

The job fails due to a timeout in the request_records_job method, which is limited to 5 minutes via the @detached(timeout_minutes=5) decorator. Despite sufficient resources allocated to the pod, the operation does not complete within the timeout, likely due to the large volume of data or slow API responses.

Steps to Reproduce:

  1. Configure a Google Ads source with access to the shopping_performance_view stream.
  2. Set up a sync to any destination (e.g., BigQuery).
  3. Run the sync with a large dataset (e.g., multiple months of data).
  4. Observe the timeout after 5 minutes.

Error Message:

TimeoutError: request_records_job exceeded timeout of 5 minutes.

Pod Metrics:

POD                               NAME           CPU(cores)   MEMORY(bytes)   
replication-job-12831-attempt-0   destination    7m           389Mi           
replication-job-12831-attempt-0   orchestrator   6m           482Mi           
replication-job-12831-attempt-0   source         14m          103Mi     

Additional Information:

  • Airbyte Version: 1.2.0
  • Deployment Method: Kubernetes
  • Google Ads API Stream: shopping_performance_view
  • Environment: GCP Kubernetes Engine

Feature Request/Question:

  1. Is it possible to make the timeout_minutes parameter for the @detached decorator configurable?
  2. Are there any known strategies or optimizations to handle large datasets with this connector?
  3. Could retries or chunked processing be implemented for long-running operations?

Relevant log output

2024-12-12 09:58:57 source > Caught retryable error Method 'request_records_job' timed out after 5.0 minutes after 1 tries. Waiting 1 seconds then retrying...

Contribute

  • Yes, I want to contribute
@marcosmarxm
Copy link
Member

@alenoir would be possible to you to build the connector locally increasing the value and check if you're able to get the data? If this is the fix we can work later to implement a parameter to configure the timeout.

@alenoir
Copy link
Author

alenoir commented Dec 17, 2024

Thanks for the suggestion!

I performed a few tests locally to address the timeout issue:

  1. Increased the timeout to 10 minutes:
    Unfortunately, this did not resolve the issue. The query still failed when fetching data from the shopping_performance_view stream.

  2. Adjusted slice_duration in the connector:
    I modified the following line in streams.py:

    slice_duration = pendulum.duration(days=0)

    This change allowed the connector to fetch the data correctly without timing out.

It seems that by setting slice_duration to 0 days, the data is retrieved in smaller, more manageable slices, avoiding the timeout problem.


Next Steps:

Would it be possible to add a configurable parameter for slice_duration in the connector? This would allow users to fine-tune the slicing behavior based on their dataset size and avoid hardcoded values.

Let me know how I can assist further!

@alenoir
Copy link
Author

alenoir commented Dec 17, 2024

I tested with a 30-minute timeout, and it works — the data is fetched without issues.

Would it be possible to increase this timeout in the connector or make it configurable for users?

Let me know your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants