Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IF: Fix threading issues in qc_chain & chain_pacemaker #1541

Closed
Tracked by #1508
heifner opened this issue Aug 22, 2023 · 3 comments · Fixed by #1550 or #1574
Closed
Tracked by #1508

IF: Fix threading issues in qc_chain & chain_pacemaker #1541

heifner opened this issue Aug 22, 2023 · 3 comments · Fixed by #1550 or #1574
Assignees

Comments

@heifner
Copy link
Member

heifner commented Aug 22, 2023

Depends on #1523.

Do not access chain state from net threads. Solution: signals from main thread to update internal HotStuff state. Then access to internal state from net threads is protected by mutex.

Also, get finalizer state request (protected by mutex) in http plugin should access finality state data from cache which is updated from threads that are modifying the finality state. Should be consistent to not see partial updates to state.

@enf-ci-bot enf-ci-bot moved this to Todo in Team Backlog Aug 22, 2023
@heifner heifner changed the title IF: Fix threading issues in qc_chain IF: Fix threading issues in qc_chain & chain_pacemaker Aug 22, 2023
@ericpassmore
Copy link
Contributor

Comments in qc_chain.cpp

// Invoked when we could perhaps make a proposal to the network (or to ourselves, if we are the leader).
void qc_chain::on_beat(){
// Non-proposing leaders do not care about on_beat(), because leaders react to a block proposal
// which comes from processing an incoming new block message from a proposer instead.
// on_beat() is called by the pacemaker, which decides when it's time to check whether we are
// proposers that should check whether as proposers we should propose a new hotstuff block to
// the network (or to ourselves, which is faster and doesn't require the bandwidth of an additional
// gossip round for a new proposed block).
// The current criteria for a leader selecting a proposal among all proposals it receives is to go
// with the first valid one that it receives. So if a proposer is also a leader, it silently goes
// with its own proposal, which is hopefully valid at the point of generation which is also the
// point of consumption.
//

Comments in qc_chain.hpp
// This mutex synchronizes all writes to the data members of this qc_chain against
// get_state() calls (which ultimately come from e.g. the HTTP plugin).
// This could result in a HTTP query that gets the state of the core while it is
// in the middle of processing a given request, since this is not serializing
// against high-level message or request processing borders.
// If that behavior is not desired, we can instead synchronize this against a
// consistent past snapshot of the qc_chain's state for e.g. the HTTP plugin,
// which would be updated at the end of processing every request to the core
// that does alter the qc_chain (hotstuff protocol state).
// And if the chain_pacemaker::_hotstuff_global_mutex locking strategy is ever
// changed, then this probably needs to be reviewed as well.
//
mutable std::mutex _state_mutex;

Comments in qc_pacemaker.hpp
// This serializes all messages (high-level requests) to the qc_chain core.
// For maximum safety, the qc_chain core will only process one request at a time.
// These requests can come directly from the net threads, or indirectly from a
// dedicated finalizer thread (TODO: discuss).
#warning discuss
std::mutex _hotstuff_global_mutex;
chain::controller* _chain = nullptr;
qc_chain _qc_chain;
uint32_t _quorum_threshold = 15; //FIXME/TODO: calculate from schedule
};

@fcecin
Copy link

fcecin commented Aug 23, 2023

Note (1): the upcoming PR for this moves everything in qc_chain that isn't an interface to private: as discussed in the last meeting, as that helped with solving threading.

Note (2): the "Non-proposing leaders" / on_beat() comment is not a threading issue so I'm not addressing it in the upcoming PR (this item can be moved to another ticket if needed).

Status: after validating the solution to the get_state cache (the overhead ended up just incrementing an uint64 inside the global engine mutex + qc_chain is now 100% lock-free internally + R/W lock at the cache that is at the chain_pacemaker now + cache update check itself is lock-free), I will work on "do not access chain state from net threads" next.

fcecin added a commit that referenced this issue Aug 23, 2023
- Moved everything in qc_chain.hpp that is not an interface to private:
- Fixed test_hotstuff to use get_state() instead of accessing various (now) private fields
- Fixed test_pacemaker to use get_id() instead of accessing (now) private _id field
- Added get_proposal() helper method to finalizer_state struct in chain/hotstuff.hpp since finalizer_state is now used in tests, etc.
- Made qc_chain class lock-free; all thread synchronization is now done externally
- Solved concurrency get_finalizer_state vs. qc_chain updates by caching chain_pacemaker::get_state(); zero overhead on qc_chain state updating, low overhead on external state read (no global/write locks acquired on cache hits)
@arhag arhag linked a pull request Aug 23, 2023 that will close this issue
fcecin added a commit that referenced this issue Aug 24, 2023
IF: Fix threading 1/2 (#1541) & qc_chain private:

#1541 should not be closed by this. The `_chain` pointer net vs main thread controller access will be dealt with by another PR.

Using the Github "Merge pull request" button, which should do the right thing.
heifner added a commit that referenced this issue Aug 28, 2023
heifner added a commit that referenced this issue Aug 28, 2023
heifner added a commit that referenced this issue Aug 29, 2023
IF: Add thread safety to chain_pacemaker access of chain state
@heifner heifner self-assigned this Aug 29, 2023
@heifner heifner moved this from Todo to In Progress in Team Backlog Aug 29, 2023
@heifner heifner moved this from In Progress to Awaiting Review in Team Backlog Aug 29, 2023
@heifner
Copy link
Member Author

heifner commented Aug 29, 2023

Completed by #1550 & #1574

@heifner heifner closed this as completed Aug 29, 2023
@github-project-automation github-project-automation bot moved this from Awaiting Review to Done in Team Backlog Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
5 participants