Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round robin scheduler #4

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Conversation

art-w
Copy link

@art-w art-w commented Oct 3, 2022

(Builds on #3 )

I'm not quite convince yet about this tiny change! The scheduler was exhausting the first thread before moving on, so I though it could be interesting to alternate between the threads in a round-robin-like fashion:

  • It enables testing spinlocks (otherwise the spinlocked thread could get us stuck)
  • On broken code, it terminates really really fast! I think it's because it's breaking nearly all the CAS operations from the start (rather than testing the happy path)
  • But on correct code, it tries a bunch more paths than before... I'm guessing that the backtracking logic could be improved to consider the involved values of the atomics ("does permuting this failing CAS at this point in the past yield a successful CAS?")

I'll try to improve on the last part when I get some time and push further analysis... In the mean time, I'm very open to suggestions or alternate explanations :)

@art-w
Copy link
Author

art-w commented Oct 9, 2022

So the naïve round-robin was clearly not good enough... I've been playing with dscheck on various projects and it was still getting stuck on spinlocks. @bartoszmodelski MPMC queue was a really cool test-case for this: The change to dscheck scheduler did help in making progress (rather than never completing the first run), but then the DFS exploration was only pushing the blocked domain to spinlock a little longer on every run.

By using a trie, dscheck can now store the different traces and backtrack from the root to the closest unexplored branch (= more like a BFS, with a random factor to encourage varied runs). This way it can sample the search space towards more interesting spots and it's pretty fast at finding bugs (in seconds!) I also like the trie because it can be useful to understand and debug dscheck findings ... in fact I still have some cleanup to do on this latest commit, but I couldn't resist sharing the pretty pictures!

Printing the full trie with graphviz isn't reasonnable nor useful, so it attempts to compact the useful information surrounding the found error traces:

  • The black nodes lead to an error
  • The grey nodes are the unexplored TODOs branches
  • The green nodes are the succesful runs (Ok count)
  • The white nodes are partially explored, with no error found
  • The steps are written thread_id atomic_op(atomic_id) with a letter for the thread id (_ is the root domain, A,B, ... are its children)

This shows the first 5 error traces to produce a FIFO ordering bug in the MPMC queue, with 1 consummer + 3 producers threads that attempts to push more than the max capacity of the queue:

mpmc_fifo

It still requires deep knowledge of the exact test and algorithm to read the graph, but seeing the different branches helps a bit to figure out which instruction starts the bug chain. Here we can see at the top B fetch_and_add(5), followed by no progress by B until the end. In the MPMC queue, this translates to B reserving the first cell of the array but not putting an element there -- and so its place gets stolen by another thread once the ring buffer is full and loops around. When we finally start popping, the first element is actually the second to last one pushed to the queue.

Another example with short code:

module Atomic = Dscheck.TracedAtomic

let rec add t i =
  if i > 0 then begin
    let v = Atomic.get t in
    Atomic.set t (v + 1) ;
    let i = if Atomic.get t > v then i - 1 else i in
    add t i
  end

let test () =
  let counter = Atomic.make 0 in
  Atomic.spawn (fun () -> add counter 4) ;
  Atomic.spawn (fun () -> add counter 4) ;
  Atomic.final (fun () -> Atomic.check (fun () -> Atomic.get counter >= 3))

let () = Atomic.trace test

The two threads attempt to increment the counter 4 times.. or at least they observe 4 increases. The lower bound of the counter at the end isn't 4, and it's not even 3:

add4

The search space is much smaller, so no branches have been left unexplored!

I've also changed a tiny bit the terminal output while debugging the scheduler:

  run   todo depth/max   latest trace
  31k  3_981    17/39    ___A_BABBABA(B5)AABABAABAAABA
  33k  3_867    17/39    ___A_BABABAAABA(B5)AABAAABABB
  35k  3_894    17/39    ___A_BAABABAABAAABA(B7)ABAA
  • The run is as before, the number of tested traces in thousands
  • The todo is the number of branches left to explore in the trie => if it keeps growing, the search is doomed!
  • The depth shows how deep in the trie the closest unexplored branch is
  • The max is the length of the longest trace ever tested => if it keeps growing, you have a spinlock
  • The latest trace helps visualize the search and the static prefix of the trace... and it is fun to look at (the rle notation (B5) means 5 B steps)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant