Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to launch workflows due to too many requests #5234

Closed
2 tasks done
pingsutw opened this issue Apr 15, 2024 · 3 comments
Closed
2 tasks done

[BUG] Failed to launch workflows due to too many requests #5234

pingsutw opened this issue Apr 15, 2024 · 3 comments
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo flytepropeller

Comments

@pingsutw
Copy link
Member

Describe the bug

I get the below error when I run a workflow with 100 Launch plans, and each launch plan has 100 tasks.

  • max Parallelism: 100
Workflow[flytesnacks:development:workflow.agent.load_test.sleep_wf] failed. RuntimeExecutionError: max number of system 
retry attempts [52/50] exhausted. Last known status message: Workflow[] failed. ErrorRecordingError: failed to publish event, 
caused by: EventSinkError: Error sending event, caused by [rpc error: code = Unavailable desc = unexpected HTTP status 
code received from server: 429 (Too Many Requests); transport: received unexpected content-type "text/html"]

Expected behavior

Workflow should not fail

Additional context to reproduce

No response

Screenshots

image

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@pingsutw pingsutw added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers flytepropeller and removed untriaged This issues has not yet been looked at by the Maintainers labels Apr 15, 2024
@pingsutw
Copy link
Member Author

pingsutw commented Apr 15, 2024

other error:

Workflow[flytesnacks:development:workflow.agent.load_test.load_test_wf] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: Workflow[flytesnacks:development:workflow.agent.load_test.load_test_wf] failed. CausedByError: Failed to propagate Abort for workflow. Error:
0: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
1: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
2: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
3: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
4: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
5: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
6: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
7: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
8: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
9: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
10: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
11: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
12: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
13: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
14: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
15: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
16: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
17: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
18: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]
19: [SystemError] system error, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Internal desc = rpc error: code = Internal desc = unexpected error type for: ERROR: duplicate key value violates unique constraint "execution_operations_pkey" (SQLSTATE 23505)]

@hamersaw hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. labels Apr 22, 2024
@wild-endeavor
Copy link
Contributor

For the 429 backoff, #5166 will help with this some. Can we try again locally after that is merged and see if we can repro?
https://github.com/flyteorg/flyte/blob/master/flytepropeller/events/config.go#L26-L27

The other issue is a bit different. We should probably make it a different ticket.

@hamersaw
Copy link
Contributor

closing this as the aforementioned PR is the best mitigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo flytepropeller
Projects
None yet
Development

No branches or pull requests

3 participants