Skip to content

Stop workers when there are no more tests to run#107

Merged
maleadt merged 6 commits intomainfrom
mg/stop-worker
Mar 2, 2026
Merged

Stop workers when there are no more tests to run#107
maleadt merged 6 commits intomainfrom
mg/stop-worker

Conversation

@giordano
Copy link
Collaborator

@giordano giordano commented Mar 1, 2026

This should fix #106 as far as I can tell, but I don't know how to test it, as we don't track the PIDs of the workers anymore after #84.

@giordano giordano requested review from maleadt and vchuravy March 1, 2026 17:37
@giordano
Copy link
Collaborator Author

giordano commented Mar 1, 2026

A benefit of this change is that it'll reduce the memory pressure towards the end of long running tests, by completely terminating workers as soon as they aren't needed anymore, while the stragglers are still doing their job. This is a problem I've had with a downstream package lately: long-running tests may also use a larger amount of memory, but this results in OOM because the other workers are still alive doing exactly nothing, while squatting memory for no reason.

@giordano
Copy link
Collaborator Author

giordano commented Mar 1, 2026

but I don't know how to test it

I tried with

using ParallelTestRunner, Test
@testset "workers stopped at end" begin
    testsuite = Dict(
        "a" => :(),
        "b" => :(),
        "c" => :(),
        "d" => :(),
        "e" => :(),
        "f" => :(),
    )

    procs = Base.Process[]
    procs_lock = ReentrantLock()

    function test_worker(name)
        wrkr = addworker()
        Base.@lock procs_lock push!(procs, wrkr.w.proc)
        return wrkr
    end

    io = IOBuffer()
    runtests(ParallelTestRunner, Base.ARGS; test_worker, testsuite, stdout=io, stderr=io)
    @test all(!Base.process_running, procs)
end

but this test would pass also on main because custom workers defined with test_worker are always cleaned up 😞

@giordano
Copy link
Collaborator Author

giordano commented Mar 1, 2026

I added some tests. I'm not 100% happy about them, but they do check something. I asked Cursor/Claude for help to write them and it gave me a half-decent idea (counting the children processes of the current process before and after the tests), but it completely botched the way to count the children, so I reworked it completely following https://askubuntu.com/a/512872 (the bot initially used pgrep with wrong arguments on macOS, and checked /proc/$(pid)/children on Linux, but that file doesn't exist, it's /proc/$(pid)/task/$(pid)/children, and the proc_tid_children(5) manual says it's not very reliable). In the end, only the initial idea suggested by the bot remained, the final implementation of _count_child_pids is mine. At least I verified that _count_child_pids works on Linux, FreeBSD, and macOS.

Comment on lines +417 to +427
"t6" => quote
# Make this test run longer than the others so that it runs alone...
sleep(5)
children = _count_child_pids($(getpid()))
# ...then check there's only one worker still running. WARNING: this test may be
# flaky on very busy systems, if at this point some of the other tests are still
# running, hope for the best.
if children >= 0
@test children == 1
end
end,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit convoluted and at risk of flakiness (I added a note in the comments about it for future record) because it depends on exact timing, but this is one thing I really want to ensure: there's a single worker still running when the others are done.

@maleadt
Copy link
Collaborator

maleadt commented Mar 2, 2026

This is what the code in #101 did, right? Although the version here makes it happen earlier.

Why not put the Malt.stop outside the while loop? We should exit it as soon as there's no work anymore:

isempty(tests) && break

@giordano
Copy link
Collaborator Author

giordano commented Mar 2, 2026

This is what the code in #101 did, right?

rmprocs is from Distributed, which we don't use anymore, so that line wasn't doing anything anymore. Also, I believe when that line was doing something, it was stopping all the workers only at the very end, with this change I'm stopping each worker as soon as it doesn't have anything else to do.

Why not put the Malt.stop outside the while loop? We should exit it as soon as there's no work anymore:

isempty(tests) && break

OK, that should work as well, I'll try it out when I get back to the computer.

@giordano
Copy link
Collaborator Author

giordano commented Mar 2, 2026

I just looked at the code, we can't stop the worker at

isempty(tests) && break
because it's defined further below
# pass in init_worker_code to custom worker function if defined
wrkr = if init_worker_code == :()
test_worker(test)
else
test_worker(test, init_worker_code)
end
if wrkr === nothing
wrkr = p
end
# if a worker failed, spawn a new one
if wrkr === nothing || !Malt.isrunning(wrkr)
wrkr = p = addworker(; init_worker_code, io_ctx.color)
end

@giordano
Copy link
Collaborator Author

giordano commented Mar 2, 2026

This is what the code in #101 did, right?

I also just verified that all the tests in

# Issue <https://github.com/JuliaTesting/ParallelTestRunner.jl/issues/106>.
@testset "default workers stopped at end" begin
# Use default workers (no test_worker) so the framework creates and should stop them.
# More tests than workers so some tasks finish early and must stop their worker.
testsuite = Dict(
"t1" => :(),
"t2" => :(),
"t3" => :(),
"t4" => :(),
"t5" => :(),
"t6" => quote
# Make this test run longer than the others so that it runs alone...
sleep(5)
children = _count_child_pids($(getpid()))
# ...then check there's only one worker still running. WARNING: this test may be
# flaky on very busy systems, if at this point some of the other tests are still
# running, hope for the best.
if children >= 0
@test children == 1
end
end,
)
before = _count_child_pids()
if before < 0
# Counting child PIDs not supported on this platform
@test_skip false
else
old_id_counter = ParallelTestRunner.ID_COUNTER[]
njobs = 2
io = IOBuffer()
ioc = IOContext(io, :color => true)
try
runtests(ParallelTestRunner, ["--jobs=$(njobs)", "--verbose"];
testsuite, stdout=ioc, stderr=ioc, init_code=:(include($(joinpath(@__DIR__, "utils.jl")))))
catch
# Show output in case of failure, to help debugging.
output = String(take!(io))
printstyled(stderr, "Output of failed test >>>>>>>>>>>>>>>>>>>>\n", color=:red, bold=true)
println(stderr, output)
printstyled(stderr, "End of output <<<<<<<<<<<<<<<<<<<<<<<<<<<<\n", color=:red, bold=true)
rethrow()
end
# Make sure we didn't spawn more workers than expected.
@test ParallelTestRunner.ID_COUNTER[] == old_id_counter + njobs
# Allow a moment for worker processes to exit
for _ in 1:50
sleep(0.1)
after = _count_child_pids()
after >= 0 && after <= before && break
end
after = _count_child_pids()
@test after >= 0
@test after == before
end
end
fail on v2.4.1 (which is before #101)

Error in testset t6:
Test Failed at /home/mose/.julia/dev/ParallelTestRunner/test/runtests.jl:425
  Expression: children == 1
   Evaluated: 21 == 1

[...]

default workers stopped at end: Test Failed at /home/mose/.julia/dev/ParallelTestRunner/test/runtests.jl:459
  Expression: after == before
   Evaluated: 22 == 20

to confirm that the old rmprocs was in fact not doing anything at all, besides silently erroring out.

@maleadt
Copy link
Collaborator

maleadt commented Mar 2, 2026

This is what the code in #101 did, right?

rmprocs is from Distributed, which we don't use anymore, so that line wasn't doing anything anymore. Also, I believe when that line was doing something, it was stopping all the workers only at the very end, with this change I'm stopping each worker as soon as it doesn't have anything else to do.

Yes, I meant conceptually. And you dropped my elaboration on this being useful 🙂

Although the version here makes it happen earlier.


I just looked at the code, we can't stop the worker at

isempty(tests) && break

because it's defined further below

p is; wrkr is only the local version of that (which may be an ephemeral custom worker to be stopped after this test anyway). I've pushed a simplification.

@maleadt maleadt merged commit e40a168 into main Mar 2, 2026
23 checks passed
@maleadt maleadt deleted the mg/stop-worker branch March 2, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workers aren't stopped when there's no more work to do

3 participants