Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HOTFIX: Allow the mastodon copy to fail for taking too long #64

Merged
merged 1 commit into from
Dec 18, 2023

Conversation

AetherUnbound
Copy link
Contributor

We're seeing an issue where files >30GB will fail on the S3CopyObjectOperator with the following exception:

[2023-12-17, 00:15:09 UTC] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.10/http/client.py", line 1374, in getresponse
    response.begin()
  File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.10/ssl.py", line 1274, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.10/ssl.py", line 1130, in read
    return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/httpsession.py", line 455, in send
    urllib_response = conn.urlopen(
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: AWSHTTPSConnectionPool(host='sfo3.digitaloceanspaces.com', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/s3.py", line 314, in execute
    s3_hook.copy_object(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 830, in copy_object
    response = self.get_conn().copy_object(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 515, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 917, in _make_api_call
    http, parsed_response = self._make_request(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/client.py", line 940, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/endpoint.py", line 202, in _send_request
    while self._needs_retry(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/endpoint.py", line 354, in _needs_retry
    responses = self._event_emitter.emit(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 207, in __call__
    if self._checker(**checker_kwargs):
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 284, in __call__
    should_retry = self._should_retry(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 320, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 363, in __call__
    checker_response = checker(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/endpoint.py", line 281, in _do_get_response
    http_response = self._send(request)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/endpoint.py", line 377, in _send
    return self.http_session.send(request)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/httpsession.py", line 492, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: [redacted]

Ultimately we'll need to change the botocore timeout config to be able to increase the timeout (like this post suggests). Until I get a chance to do that though, this will allow the cleanup tasks to run if the copy step fails. Previously, with the copy step failing, the backups would continue accumulating and take up more space (and money). This is a temporary workaround to make sure we aren't incurring extra costs.

@AetherUnbound AetherUnbound added the bug Something isn't working label Dec 18, 2023
@AetherUnbound AetherUnbound requested a review from a team as a code owner December 18, 2023 02:16
@AetherUnbound AetherUnbound merged commit 1fed2a0 into main Dec 18, 2023
2 checks passed
@AetherUnbound AetherUnbound deleted the fix/rotate-backup-timeout branch December 18, 2023 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant