Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Generic exceptions within storage.write statements are not caught potentially causing inconsistent state #32

Open
jordanly opened this issue Aug 16, 2018 · 0 comments

Comments

@jordanly
Copy link
Contributor

A finding from #31.

A user created an update to remove instances from a job. This throws a NullPointerException as mentioned in the issue above. The LoggingInterceptor actually swallows the exception. This happens because we do the initial evaluation of the update within the user calling the RPC method (follow along the start(...) method if you are not convinced).

Although the above start command throws a NullPointerException, the update is still added to the MemJobUpdateStore but not persisted to the log. We still call saveJobUpdate(...) within the ‘start(...)’ code which will add it to the memory stores. However, because a NullPointerException is thrown before the write lock is exited, these operations are never persisted to the log. The design of the storage system in the scheduler is transactional so everything is added to the log at the end of the write. Due to this, we are now in a state where the memory store does not match the log store.

I think that we should catch all unhandled exceptions within the write lock and immediately kill the scheduler. This would avoid errors leaving a potentially inconsistent state and corrupting the log preventing easy rollback.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant