Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming branch fixes #496

Merged
merged 3 commits into from
Apr 11, 2024

Conversation

JamesWrigley
Copy link
Collaborator

This fixes some minor-ish things I came across when looking into the tests. I don't know if it'll fix the timeouts in CI, for me all the tests pass locally.

@JamesWrigley
Copy link
Collaborator Author

Hmm, ok the tests aren't hanging, they're just hideously slow.

@JamesWrigley JamesWrigley force-pushed the streaming-fixes branch 2 times, most recently from 1d3c6b1 to 8ebc4fd Compare April 1, 2024 14:26
@JamesWrigley
Copy link
Collaborator Author

So right now I think the tests are failing in three(-ish) ways:

@JamesWrigley
Copy link
Collaborator Author

Even going back to the passing tests it looks like the pibots have always taken ~3-4x longer than other nodes: https://buildkite.com/julialang/dagger-dot-jl/builds/944
So either the tests slowed down a lot recently, or the nodes are slower/under heavier load. This is very anecdotal but I'd guess that it's our tests that've slowed down from comparing the times on the armageddon nodes from these jobs: https://buildkite.com/julialang/dagger-dot-jl/builds/972 (February) and https://buildkite.com/julialang/dagger-dot-jl/builds/1039 (this PR)

The tests went from ~11 minutes to ~20 minutes (I am assuming the armageddon nodes weren't swapped/had their loads changed).

@jpsamaroo
Copy link
Member

All of this looks good aside from the revert of #467 (which should be a strictly beneficial change - I have no idea why this would change precompile behavior). The purpose of that assertion is to ensure that all Dagger tasks finish during precompile, as Julia itself will hang or die when trying to finish precompile with tasks still running in the background. So something is either still keeping tasks alive, or this is just spurious and we need to wait a bit longer for Dagger to clean things up.

@JamesWrigley
Copy link
Collaborator Author

Yeah makes sense, I've removed the reverting commit so the PR can be merged. I think something is still keeping the tasks alive because that test is consistently failing since #467 (albeit I haven't been able to reproduce it locally yet). I did wonder if we were hitting JuliaLang/julia#40626 since it's the only multithreaded test, but I tried it with a single thread and it still failed 🤷

@JamesWrigley JamesWrigley force-pushed the streaming-fixes branch 3 times, most recently from 0a870a7 to ab4d4b1 Compare April 3, 2024 17:53
@JamesWrigley
Copy link
Collaborator Author

JamesWrigley commented Apr 3, 2024

Hmm, it seems to be something thread-related but I can't grok how. Notes:

  • Using multiple threads on the other tests makes them fail too (despite my last message, which I can no longer reproduce), so it's not DTables.jl/Julia 1.8 specific. Using a single thread in the DTables.jl test makes it succeed. EDIT: I just observed precompilation succeed on 1.10 so it must not be strictly thread-related 🐙
  • What seems to make a difference is just the fact of doing a remotecall in evict_all_chunks!(). I can replace the inner remotecall to evict_chunks!() with one to identity() and the tests will still pass.
  • I can't reproduce it locally at all.
  • Increasing the loop iterations to 50 in precompile.jl does nothing, and the issue is deterministic, so I don't think it's as simple as a race condition.

@JamesWrigley
Copy link
Collaborator Author

Well, that was a rabbit hole... @jpsamaroo I went a bit further than we discussed and ended up moving all the dynamic worker support exclusively to the eager API in 9aa8eee because that seemed cleaner than maintaining both. I don't know if that's too breaking though? I also fixed a couple of other bugs along the way. The tests pass locally for me, let's see what CI thinks 🤞

@JamesWrigley JamesWrigley force-pushed the streaming-fixes branch 3 times, most recently from 856d4b7 to a61af1c Compare April 10, 2024 15:54
@JamesWrigley
Copy link
Collaborator Author

TL;DR:

  • On the first attempt there was an error about a peer not being connected (seen that before on previous commits), but the OSX scheduler tests passed. But the tests were very slow in general and it timed out (all other platforms passed).
  • I used addprocs(; lazy=false) in an attempt to fix the peer-connection errors, that did fix them, but the OSX tests seemed to hang at the scheduler tests.
  • I gave up and reverted the eager refactoring, now the branch only contains miscellaneous bug fixes.
  • And on the last try the scheduler tests pass on OSX...

I'll come back to these failures another time, I think the PR can be merged now.

Using `myid()` with `workers()` meant that when the context was initialized with
a single worker the processor list would be: `[OSProc(1), OSProc(1)]`. `procs()`
will always include PID 1 and any other workers, which is what we want.
@jpsamaroo jpsamaroo merged commit 8423fb8 into JuliaParallel:jps/stream2 Apr 11, 2024
4 of 8 checks passed
@jpsamaroo
Copy link
Member

Thanks again!

@JamesWrigley JamesWrigley deleted the streaming-fixes branch April 11, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants