Stop workers when there are no more tests to run#107
Conversation
|
A benefit of this change is that it'll reduce the memory pressure towards the end of long running tests, by completely terminating workers as soon as they aren't needed anymore, while the stragglers are still doing their job. This is a problem I've had with a downstream package lately: long-running tests may also use a larger amount of memory, but this results in OOM because the other workers are still alive doing exactly nothing, while squatting memory for no reason. |
I tried with using ParallelTestRunner, Test
@testset "workers stopped at end" begin
testsuite = Dict(
"a" => :(),
"b" => :(),
"c" => :(),
"d" => :(),
"e" => :(),
"f" => :(),
)
procs = Base.Process[]
procs_lock = ReentrantLock()
function test_worker(name)
wrkr = addworker()
Base.@lock procs_lock push!(procs, wrkr.w.proc)
return wrkr
end
io = IOBuffer()
runtests(ParallelTestRunner, Base.ARGS; test_worker, testsuite, stdout=io, stderr=io)
@test all(!Base.process_running, procs)
endbut this test would pass also on |
|
I added some tests. I'm not 100% happy about them, but they do check something. I asked Cursor/Claude for help to write them and it gave me a half-decent idea (counting the children processes of the current process before and after the tests), but it completely botched the way to count the children, so I reworked it completely following https://askubuntu.com/a/512872 (the bot initially used |
| "t6" => quote | ||
| # Make this test run longer than the others so that it runs alone... | ||
| sleep(5) | ||
| children = _count_child_pids($(getpid())) | ||
| # ...then check there's only one worker still running. WARNING: this test may be | ||
| # flaky on very busy systems, if at this point some of the other tests are still | ||
| # running, hope for the best. | ||
| if children >= 0 | ||
| @test children == 1 | ||
| end | ||
| end, |
There was a problem hiding this comment.
This is a bit convoluted and at risk of flakiness (I added a note in the comments about it for future record) because it depends on exact timing, but this is one thing I really want to ensure: there's a single worker still running when the others are done.
|
This is what the code in #101 did, right? Although the version here makes it happen earlier. Why not put the |
OK, that should work as well, I'll try it out when I get back to the computer. |
|
I just looked at the code, we can't stop the worker at because it's defined further belowParallelTestRunner.jl/src/ParallelTestRunner.jl Lines 998 to 1010 in c0188e5 |
I also just verified that all the tests in ParallelTestRunner.jl/test/runtests.jl Lines 407 to 461 in 3530d91 to confirm that the old |
Yes, I meant conceptually. And you dropped my elaboration on this being useful 🙂
|
This should fix #106 as far as I can tell, but I don't know how to test it, as we don't track the PIDs of the workers anymore after #84.