Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: Some flakyness in windows-agent and wsl-pro-service (#873)
## Addressing windows-agent failures The tests that fail often are contained in `windows-agent/internal/distros/worker/worker_test.go`. they all show the same pattern: 1. worker.SubmitTasks submits an immediate task. 2. the already running goroutine processing tasks is expected to be notified and dequeue that task. 3. require.Eventually asserts the side effects of that task. I could find half explanation for the failure mode observed: ``` time="2024-07-26T15:14:24Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task processing" time="2024-07-26T15:14:24Z" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": Submitting tasks [\"Test task\"] to queue" time="2024-07-26T15:14:25Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": stopping task processing" ``` The second log line confirms a synchronous call is made to worker.Submit, and the failed assertion proves it returned. Since it's synchronous the task was delivered. Processing is asynchronous. The goroutine is running (maybe paused, but that's up to the runtime). If the task was dequeued we'd see this line in the log: ``` time="2024-07-26T15:14:23Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task \"Test task\"" ``` If we augment the task queue Push and Pull methods with more logs to observe how submission and dequeuing behave, we find an interesting pattern: the writer arrives to the channel earlier than the reader when that failure mode happens. ```diff diff --git a/windows-agent/internal/distros/worker/task_queue.go b/windows-agent/internal/distros/worker/task_queue.go index 74f2b9bf..cf9f08b9 100644 --- a/windows-agent/internal/distros/worker/task_queue.go +++ b/windows-agent/internal/distros/worker/task_queue.go @@ -4,6 +4,7 @@ import ( "context" "sync" + log "github.com/canonical/ubuntu-pro-for-wsl/common/grpc/logstreamer" "github.com/canonical/ubuntu-pro-for-wsl/windows-agent/internal/distros/task" ) @@ -87,8 +88,10 @@ func (q *taskQueue) Push(t task.Task) { q.data = append(q.data, t) // Notify waiters if there are any + log.Warningf(context.TODO(), "+++++++++ NOTIFYING AWAITERS OF TASKS %v on channel", t) select { case q.wait <- struct{}{}: + log.Warningf(context.TODO(), "+++++++++ NOTIFY COMPLETE %v", t) default: } } @@ -155,14 +158,16 @@ func (q *taskQueue) Pull(ctx context.Context) task.Task { if task, ok := q.tryPopFront(); ok { return task } q.mu.RLock() // This is mostly to appease the race detector wait := q.wait + log.Warning(ctx, "+++++++++ Pull: Waiting on the channel") q.mu.RUnlock() select { case <-ctx.Done(): + log.Warning(ctx, "+++++++++ Pull: inside the loop CTX DONE") return nil case <-wait: // ↑ @@ -170,6 +175,7 @@ func (q *taskQueue) Pull(ctx context.Context) task.Task { // | only entry in the queue. Or an empty Load could // | leave an empty "data" behind. // ↓ + log.Warning(ctx, "+++++++++ TRY POPPING after the lock") if task, ok := q.tryPopFront(); ok { return task } ``` I couldn't understand why the reader would arrive later, since the goroutine supposed to process incoming tasks starts earlier in the test cases, but somehow sometimes it only blocks on the channel after the writer attempted to write into and failed. I'd be surprised if the attempt to acquire a read lock inside the Pull method delayed it that much. The channel is unbuffered and the select-default statement prevents the writter from blocking if there are no readers. ``` === RUN TestTaskProcessing/Success_executing_a_task === PAUSE TestTaskProcessing/Success_executing_a_task === CONT TestTaskProcessing/Success_executing_a_task time="2024-08-28T15:26:15-03:00" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": starting task processing" time="2024-08-28T15:26:15-03:00" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": Submitting tasks [\"Test task\"] to queue" time="2024-08-28T15:26:15-03:00" level=warning msg="+++++++++ NOTIFYING AWAITERS OF TASKS Test task on channel" time="2024-08-28T15:26:15-03:00" level=warning msg="+++++++++ Pull: Waiting on the channel" worker_test.go:206: Error Trace: D:/UP4W/cloudinit/windows-agent/internal/distros/worker/worker_test.go:206 Error: Condition never satisfied Test: TestTaskProcessing/Success_executing_a_task Messages: distro should have been "Running" after SubmitTask(). Current state is "Stopped" time="2024-08-28T15:26:20-03:00" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": stopping task processing" time="2024-08-28T15:26:20-03:00" level=warning msg="+++++++++ Pull: inside the loop CTX DONE" --- FAIL: TestTaskProcessing (0.00s) --- FAIL: TestTaskProcessing/Success_executing_a_task (5.02s) ``` Removing the select-default is not an option, test cases would remain blocked until timeout. In production that's less likely to happen, but still undesirable. I opted into making the `q.wait` channel buffered, so some ammount of writters will succeed without blocking, giving the reader more chances to reach the channel and subsequent writers should have more chances of success as well. I don't have a good explanation for the exact number of slots the channel should have. I ran the worker tests 1367 times in my machine without a single failure, so this seems to be a suitable fix. `go test -v .\windows-agent\internal\distros\worker\ -tags=gowslmock -count=1367 -failfast -timeout=120m`. :) Before that I was experimenting failures quite often when running with `--count=200` with any of the `TestTaskProcessing` or `TestTaskDeduplication` test cases. --- ## wsl-pro-service There is still one test failing sometimes: `wsl-pro-service/internal/streams/server_test.go`. The test assumes that when `server.Stop()` is called the gorountine serving the multistream service will exit with an error. It seems that the current implementation of the internal `handlingLoop[Command].run()` method connects the context objects used in the RPCs in a way that allows for its loop to interpret a `server.Stop()` as a graceful cancellation. This method derives two context ojects mirroring the structure found in the Server type: `gCtx` for graceful stop and `ctx` for "brute forcing". Those local context objects added complexity and it seems we can get rid of them. Additionally, there was an error in the server constructor `NewServer` causing the `gracefulCtx` to be a child of the `ctx` due a shadow declaration. Per my current understanding this change in wsl-pro-service seems to not affect its runtime behaviour apart from responding to force-exit internal requests. --- UDENG-3311
- Loading branch information