-
-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming branch fixes #496
Streaming branch fixes #496
Conversation
Hmm, ok the tests aren't hanging, they're just hideously slow. |
1d3c6b1
to
8ebc4fd
Compare
So right now I think the tests are failing in three(-ish) ways:
|
Even going back to the passing tests it looks like the pibots have always taken ~3-4x longer than other nodes: https://buildkite.com/julialang/dagger-dot-jl/builds/944 The tests went from ~11 minutes to ~20 minutes (I am assuming the armageddon nodes weren't swapped/had their loads changed). |
All of this looks good aside from the revert of #467 (which should be a strictly beneficial change - I have no idea why this would change precompile behavior). The purpose of that assertion is to ensure that all Dagger tasks finish during precompile, as Julia itself will hang or die when trying to finish precompile with tasks still running in the background. So something is either still keeping tasks alive, or this is just spurious and we need to wait a bit longer for Dagger to clean things up. |
8ebc4fd
to
a10afd1
Compare
Yeah makes sense, I've removed the reverting commit so the PR can be merged. I think something is still keeping the tasks alive because that test is consistently failing since #467 (albeit I haven't been able to reproduce it locally yet). I did wonder if we were hitting JuliaLang/julia#40626 since it's the only multithreaded test, but I tried it with a single thread and it still failed 🤷 |
0a870a7
to
ab4d4b1
Compare
Hmm, it seems to be something thread-related but I can't grok how. Notes:
|
c765853
to
9aa8eee
Compare
Well, that was a rabbit hole... @jpsamaroo I went a bit further than we discussed and ended up moving all the dynamic worker support exclusively to the eager API in 9aa8eee because that seemed cleaner than maintaining both. I don't know if that's too breaking though? I also fixed a couple of other bugs along the way. The tests pass locally for me, let's see what CI thinks 🤞 |
856d4b7
to
a61af1c
Compare
TL;DR:
I'll come back to these failures another time, I think the PR can be merged now. |
Using `myid()` with `workers()` meant that when the context was initialized with a single worker the processor list would be: `[OSProc(1), OSProc(1)]`. `procs()` will always include PID 1 and any other workers, which is what we want.
a61af1c
to
3b0a355
Compare
Thanks again! |
This fixes some minor-ish things I came across when looking into the tests. I don't know if it'll fix the timeouts in CI, for me all the tests pass locally.