-
Notifications
You must be signed in to change notification settings - Fork 97
Transactional row locking
In TokuMX, write transactions have serializable isolation for all documents affected by the write operation. For an insert, this means the primary key, and for an update, it means the entire set of documents matched by the query. That means an update that looks like 'db.coll.update({}, { $set: { field:1 } }, { multi: true }) will be pathologically bad. It will serialize with all other write operations that affect any document in coll. TokuMX accomplishes this by storing row lock information in the TokuFT storage engine.
Row locks are represented by simple begin, end key range pairs whose endpoints may or may not be inclusive. Taking the above example, the table-scan update statement will storage a [ -inf, +inf ] (inclusive ends here) row lock in the primary-key index for coll
, which is typically the _id index. Any new transaction that wants to write a row in bewteen -inf and +info (ie: all keys fall into this range) will see that an existing, overlapping lock exists and block until it is released.
Documents are stored in the primary-key index. For most collections, this is the _id index. All indexes are backed by a dictionary in the TokuFT storage engine, and thus each has a locktree to represent the locks currently held and pending for live transactions. If an update modifies (or matches on) secondary index keys, those keys will be locked in the locktree for that index. This necessarily means that two update statements whose query portions overlap will serialize.
TokuMX provides two commands for row lock introspection: db.showLiveTransactions() and db.showPendingLockRequests(). The show live transactions command will dump the state of live transactions including any and all row locks currently held. There is one caveat: duplicate row locks may show up in this list if they logically overlap. This is because the TokuFT storage engine records successfully granted lock requests made by a transaction, which may include duplicate locks on ranges already owned by the calling transaction. The reporting will be redundant, but it's (more or less) benign: you'll just have to filter out the overlaps visually. The show pending lock requests command does just that: dumps the current state of waiting transactions and the locks they wish to aquire before proceeding. Using a combination of these commands allows an advanced user to figure out which of their operations are serializing and provides clues on how to mitigate it. For example, if a transaction A has locked primary key _id: 12345 locked while transactions B, C, and D are blocked waiting to acquire that lock, it can suggest that the application has too many threads operating on the same piece of data. It may make sense to rewrite that application to avoid key-level contention when writing to the database.
The TokuFT locktree is a binary search tree whose keys are non-overlapping ranges that represent the index-level keys locked by the holding transaction. The tree is self-balancing using AVL-style rotations which are performed when the tree is traversed on the way down. Each node contains a mutex, which is held for roughly as long as it takes to decide which child to traverse down next, or that a conflict has been identified or ruled out. Since locks are obtained hand-over-hand (or "chain locking") on the way down, the locktree has decent writer-writer scalability which enables decent multicore utilization when the workload is write-heavy on multiple threads.
In addition, each locktree has an associated set of pending lock requests keyed on the id of the waiting transaction. This tree is a tightly packed OMT (order maintenance tree) and provides only single-threaded access. This means failing to acquire a lock is quite expensive: you must access the single-threaded lock request set and register yourself. The obvious advice here is that, for best results, architect your application to have as few writer-writer conflicts as possible.
When a transaction successfully acquires a lock in a locktree, it writes the obtained keys in a transaction-local buffer called the lock buffer. There is one lock buffer for each locktree a transaction has locks in. If a TokuMX collection has a primary key and 4 secondary keys, then an insert will touch 5 locktrees and thus there will be 5 lock buffers attached to the writing transaction, each with a lock representing the document and keys just indexed. Since both the locktree and and lock buffers have a copy of every key obtained, the locktree is roughly a factor of 2 away (super hand wavy) from space-optimal. There is an easy but important optimization where "point" locks (those locks where [a, b] has a == b) are represented as a single chunk of memory, and so they approach space-optimal. Most locks taken are point locks, since only one index row is usually affected by a document insert, delete, or update at a time.
If the system runs out of the pre-configured memory size for locks (see locktreeMaxMemory in TokuMX), then a painful but necessary operation called lock escalation is performed. Lock escalation is the process by which existing row locks are squashed together so that they represent larger ranges but take up less memory. All locktrees are fully locked while this happens and no write transactions will be able to make progress until it is complete. Escalation is the second worse thing that can happen to an application next to crashing. The single best thing to do to avoid lock escalation (and other pathological serializability issues) is to limit the scope of multi-document update and delete statements. That is, avoid statements that are expected to affect a lot of rows, and instead chunk them up into smaller operations.