Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Handling duplicate calls to provision API #445

Closed
dbwiddis opened this issue Jan 24, 2024 · 3 comments · Fixed by #466
Closed

[DISCUSS] Handling duplicate calls to provision API #445

dbwiddis opened this issue Jan 24, 2024 · 3 comments · Fixed by #466
Assignees
Labels
discuss enhancement New feature or request

Comments

@dbwiddis
Copy link
Member

dbwiddis commented Jan 24, 2024

Is your feature request related to a problem?

If a user calls the provision API twice using the same workflow, its provisioned resources are duplicated. This can be confusing as the combination of workflow step name and workflow step ID is no longer unique, and choosing the correct resource ID requires guesswork. Also the state changes to COMPLETED when the first workflow finishes, even though more resources are being added.

Additionally this results in the additional copy (or copies) consuming cluster resources.

What solution would you like?

Prevent a workflow from being provisioned if its status is PROVISIONING. This status may need to be checked multiple times (like double-checked-locking) to prevent the race condition where two provision calls both see NOT_STARTED before one of them begins provisioning.

While addressing this, we also should discuss how to handle FAILED. We could prevent provisioning and require users to use the deprovision API first (basically only allow provisioning from the NOT_STARTED state) or we could try to "pick up from where we left off", adding logic to skip provisioining steps if the resource already exists (complex and brittle).

What alternatives have you considered?

Leaving the API as is, and/or encouraging use of the create API with the provisioning parameter, which does not have this shortcoming.

Do you have any additional context?

We should probably handle this similarly to trying to create the same Anomaly Detector twice.

@dbwiddis dbwiddis added enhancement New feature or request untriaged discuss labels Jan 24, 2024
@joshpalis
Copy link
Member

Guard rails are a good idea, I think we should block another provision call if the status is PROVISIONING, COMPLETED, or FAILED and only allow provisioning from a NOT_STARTED state. This is somewhat similar behavior implemented for the update API in #416

@owaiskazi19
Copy link
Member

owaiskazi19 commented Jan 24, 2024

The solution looks good to check for the status for the previous provision.

This status may need to be checked multiple times

Trying to understand more here. Do we have to build a functionality like what we have for retry currently to keep checking for the status or a cron job of our own?

While addressing this, we also should discuss how to handle FAILED

If the status of the previous provision is FAILED. We can perform either options:

  1. Let the user know the previous provision was failed for the workflowId, run the deprovision API manually. This would eventually delete the workflowID if deprovisioning is successful. There can be a case where deprovisioning itself failed, then this might result in a deadlock condition.

  2. We can perform deprovisioning on our end, but again we might face the above issue. User has to perform a manual deprovisioning in such cases based on our log message Failed to deprovision some resources.

I am inclined towards option 1.

@dbwiddis
Copy link
Member Author

Trying to understand more here. Do we have to build a functionality like what we have for retry currently to keep checking for the status or a cron job of our own?

No, it's not about retries it's merely to catch a race condition.

  1. Thread A checks status as not started, moves on to next step.
  2. Thread B checks status as not started, moves on to next step.
  3. Thread A changes status to provisioning.
  4. Thread B continues and results in a duplicate.

But relying on the status won't work as once you change it to provisioning... so we really need some other way to "lock" a thread for provisioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants