Round robin scheduler #4

art-w · 2022-10-03T10:41:56Z

(Builds on #3 )

I'm not quite convince yet about this tiny change! The scheduler was exhausting the first thread before moving on, so I though it could be interesting to alternate between the threads in a round-robin-like fashion:

It enables testing spinlocks (otherwise the spinlocked thread could get us stuck)
On broken code, it terminates really really fast! I think it's because it's breaking nearly all the CAS operations from the start (rather than testing the happy path)
But on correct code, it tries a bunch more paths than before... I'm guessing that the backtracking logic could be improved to consider the involved values of the atomics ("does permuting this failing CAS at this point in the past yield a successful CAS?")

I'll try to improve on the last part when I get some time and push further analysis... In the mean time, I'm very open to suggestions or alternate explanations :)

art-w · 2022-10-09T22:54:44Z

So the naïve round-robin was clearly not good enough... I've been playing with dscheck on various projects and it was still getting stuck on spinlocks. @bartoszmodelski MPMC queue was a really cool test-case for this: The change to dscheck scheduler did help in making progress (rather than never completing the first run), but then the DFS exploration was only pushing the blocked domain to spinlock a little longer on every run.

By using a trie, dscheck can now store the different traces and backtrack from the root to the closest unexplored branch (= more like a BFS, with a random factor to encourage varied runs). This way it can sample the search space towards more interesting spots and it's pretty fast at finding bugs (in seconds!) I also like the trie because it can be useful to understand and debug dscheck findings ... in fact I still have some cleanup to do on this latest commit, but I couldn't resist sharing the pretty pictures!

Printing the full trie with graphviz isn't reasonnable nor useful, so it attempts to compact the useful information surrounding the found error traces:

The black nodes lead to an error
The grey nodes are the unexplored TODOs branches
The green nodes are the succesful runs (Ok count)
The white nodes are partially explored, with no error found
The steps are written thread_id atomic_op(atomic_id) with a letter for the thread id (_ is the root domain, A,B, ... are its children)

This shows the first 5 error traces to produce a FIFO ordering bug in the MPMC queue, with 1 consummer + 3 producers threads that attempts to push more than the max capacity of the queue:

It still requires deep knowledge of the exact test and algorithm to read the graph, but seeing the different branches helps a bit to figure out which instruction starts the bug chain. Here we can see at the top B fetch_and_add(5), followed by no progress by B until the end. In the MPMC queue, this translates to B reserving the first cell of the array but not putting an element there -- and so its place gets stolen by another thread once the ring buffer is full and loops around. When we finally start popping, the first element is actually the second to last one pushed to the queue.

Another example with short code:

module Atomic = Dscheck.TracedAtomic

let rec add t i =
  if i > 0 then begin
    let v = Atomic.get t in
    Atomic.set t (v + 1) ;
    let i = if Atomic.get t > v then i - 1 else i in
    add t i
  end

let test () =
  let counter = Atomic.make 0 in
  Atomic.spawn (fun () -> add counter 4) ;
  Atomic.spawn (fun () -> add counter 4) ;
  Atomic.final (fun () -> Atomic.check (fun () -> Atomic.get counter >= 3))

let () = Atomic.trace test

The two threads attempt to increment the counter 4 times.. or at least they observe 4 increases. The lower bound of the counter at the end isn't 4, and it's not even 3:

The search space is much smaller, so no branches have been left unexplored!

I've also changed a tiny bit the terminal output while debugging the scheduler:

  run   todo depth/max   latest trace
  31k  3_981    17/39    ___A_BABBABA(B5)AABABAABAAABA
  33k  3_867    17/39    ___A_BABABAAABA(B5)AABAAABABB
  35k  3_894    17/39    ___A_BAABABAABAAABA(B7)ABAA

The run is as before, the number of tested traces in thousands
The todo is the number of branches left to explore in the trie => if it keeps growing, the search is doomed!
The depth shows how deep in the trie the closest unexplored branch is
The max is the length of the longest trace ever tested => if it keeps growing, you have a spinlock
The latest trace helps visualize the search and the static prefix of the trace... and it is fun to look at (the rle notation (B5) means 5 B steps)

art-w added 23 commits November 20, 2022 17:33

full replay only when necessary

7686c29

list optims

89a80aa

more precise backtracking

2355531

fixup num_runs

9222f07

fixup backtrack

9a6f00c

test: only one bad apple

4036a90

backtracking: mark only last operation

abdcc0b

big bad backtracking steps

18e0bb0

add simple test

ecbff9b

fix: allow backtracking inside big steps

36bb3ec

fix: print replay trace for all unhandled errors

6b2195f

stable domains/atomics identifiers

708a2ab

fix: backtracking nested spawns

a43282f

fix: atomic ops categorization

61482a1

fix: atomic uid generation

36a06f1

simplify dead code

16c8663

fix dune-project url

63aad4d

round-robin scheduling

ec3bfc1

add spinlock test

bae4ce4

use trie for backtracking fairness

238a7e6

cut trie branches to minimize memory consumption

5375983

prettier trace

8cf65e8

expose trace options

4be4164

art-w force-pushed the round-robin branch from 0861536 to 4be4164 Compare November 21, 2022 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round robin scheduler #4

Round robin scheduler #4

art-w commented Oct 3, 2022

art-w commented Oct 9, 2022

Round robin scheduler #4

Are you sure you want to change the base?

Round robin scheduler #4

Conversation

art-w commented Oct 3, 2022

art-w commented Oct 9, 2022