Stop workers when there are no more tests to run by giordano · Pull Request #107 · JuliaTesting/ParallelTestRunner.jl

giordano · 2026-03-01T17:37:17Z

This should fix #106 as far as I can tell, but I don't know how to test it, as we don't track the PIDs of the workers anymore after #84.

giordano · 2026-03-01T17:44:44Z

A benefit of this change is that it'll reduce the memory pressure towards the end of long running tests, by completely terminating workers as soon as they aren't needed anymore, while the stragglers are still doing their job. This is a problem I've had with a downstream package lately: long-running tests may also use a larger amount of memory, but this results in OOM because the other workers are still alive doing exactly nothing, while squatting memory for no reason.

giordano · 2026-03-01T18:05:17Z

but I don't know how to test it

I tried with

using ParallelTestRunner, Test
@testset "workers stopped at end" begin
    testsuite = Dict(
        "a" => :(),
        "b" => :(),
        "c" => :(),
        "d" => :(),
        "e" => :(),
        "f" => :(),
    )

    procs = Base.Process[]
    procs_lock = ReentrantLock()

    function test_worker(name)
        wrkr = addworker()
        Base.@lock procs_lock push!(procs, wrkr.w.proc)
        return wrkr
    end

    io = IOBuffer()
    runtests(ParallelTestRunner, Base.ARGS; test_worker, testsuite, stdout=io, stderr=io)
    @test all(!Base.process_running, procs)
end

but this test would pass also on main because custom workers defined with test_worker are always cleaned up 😞

giordano · 2026-03-01T20:00:55Z

I added some tests. I'm not 100% happy about them, but they do check something. I asked Cursor/Claude for help to write them and it gave me a half-decent idea (counting the children processes of the current process before and after the tests), but it completely botched the way to count the children, so I reworked it completely following https://askubuntu.com/a/512872 (the bot initially used pgrep with wrong arguments on macOS, and checked /proc/$(pid)/children on Linux, but that file doesn't exist, it's /proc/$(pid)/task/$(pid)/children, and the proc_tid_children(5) manual says it's not very reliable). In the end, only the initial idea suggested by the bot remained, the final implementation of _count_child_pids is mine. At least I verified that _count_child_pids works on Linux, FreeBSD, and macOS.

giordano · 2026-03-01T22:55:57Z

test/runtests.jl

+        "t6" => quote
+            # Make this test run longer than the others so that it runs alone...
+            sleep(5)
+            children = _count_child_pids($(getpid()))
+            # ...then check there's only one worker still running. WARNING: this test may be
+            # flaky on very busy systems, if at this point some of the other tests are still
+            # running, hope for the best.
+            if children >= 0
+                @test children == 1
+            end
+        end,


This is a bit convoluted and at risk of flakiness (I added a note in the comments about it for future record) because it depends on exact timing, but this is one thing I really want to ensure: there's a single worker still running when the others are done.

maleadt · 2026-03-02T06:05:34Z

This is what the code in #101 did, right? Although the version here makes it happen earlier.

Why not put the Malt.stop outside the while loop? We should exit it as soon as there's no work anymore:

ParallelTestRunner.jl/src/ParallelTestRunner.jl

Line 989 in c0188e5

isempty(tests) && break

giordano · 2026-03-02T07:48:40Z

This is what the code in #101 did, right?

rmprocs is from Distributed, which we don't use anymore, so that line wasn't doing anything anymore. Also, I believe when that line was doing something, it was stopping all the workers only at the very end, with this change I'm stopping each worker as soon as it doesn't have anything else to do.

Why not put the Malt.stop outside the while loop? We should exit it as soon as there's no work anymore:

ParallelTestRunner.jl/src/ParallelTestRunner.jl

Line 989 in c0188e5

isempty(tests) && break

OK, that should work as well, I'll try it out when I get back to the computer.

giordano · 2026-03-02T12:34:22Z

I just looked at the code, we can't stop the worker at

ParallelTestRunner.jl/src/ParallelTestRunner.jl

Line 989 in c0188e5

isempty(tests) && break

because it's defined further below

ParallelTestRunner.jl/src/ParallelTestRunner.jl

Lines 998 to 1010 in c0188e5

    
           # pass in init_worker_code to custom worker function if defined 
        
           wrkr = if init_worker_code == :() 
        
               test_worker(test) 
        
           else 
        
               test_worker(test, init_worker_code) 
        
           end 
        
           if wrkr === nothing 
        
               wrkr = p 
        
           end 
        
           # if a worker failed, spawn a new one 
        
           if wrkr === nothing || !Malt.isrunning(wrkr) 
        
               wrkr = p = addworker(; init_worker_code, io_ctx.color) 
        
           end

giordano · 2026-03-02T12:42:23Z

This is what the code in #101 did, right?

I also just verified that all the tests in

ParallelTestRunner.jl/test/runtests.jl

Lines 407 to 461 in 3530d91

    
           # Issue <https://github.com/JuliaTesting/ParallelTestRunner.jl/issues/106>. 
        
           @testset "default workers stopped at end" begin 
        
               # Use default workers (no test_worker) so the framework creates and should stop them. 
        
               # More tests than workers so some tasks finish early and must stop their worker. 
        
               testsuite = Dict( 
        
                   "t1" => :(), 
        
                   "t2" => :(), 
        
                   "t3" => :(), 
        
                   "t4" => :(), 
        
                   "t5" => :(), 
        
                   "t6" => quote 
        
                       # Make this test run longer than the others so that it runs alone... 
        
                       sleep(5) 
        
                       children = _count_child_pids($(getpid())) 
        
                       # ...then check there's only one worker still running. WARNING: this test may be 
        
                       # flaky on very busy systems, if at this point some of the other tests are still 
        
                       # running, hope for the best. 
        
                       if children >= 0 
        
                           @test children == 1 
        
                       end 
        
                   end, 
        
               ) 
        
               before = _count_child_pids() 
        
               if before < 0 
        
                   # Counting child PIDs not supported on this platform 
        
                   @test_skip false 
        
               else 
        
                   old_id_counter = ParallelTestRunner.ID_COUNTER[] 
        
                   njobs = 2 
        
                   io = IOBuffer() 
        
                   ioc = IOContext(io, :color => true) 
        
                   try 
        
                       runtests(ParallelTestRunner, ["--jobs=$(njobs)", "--verbose"]; 
        
                                testsuite, stdout=ioc, stderr=ioc, init_code=:(include($(joinpath(@__DIR__, "utils.jl"))))) 
        
                   catch 
        
                       # Show output in case of failure, to help debugging. 
        
                       output = String(take!(io)) 
        
                       printstyled(stderr, "Output of failed test >>>>>>>>>>>>>>>>>>>>\n", color=:red, bold=true) 
        
                       println(stderr, output) 
        
                       printstyled(stderr, "End of output <<<<<<<<<<<<<<<<<<<<<<<<<<<<\n", color=:red, bold=true) 
        
                       rethrow() 
        
                   end 
        
                   # Make sure we didn't spawn more workers than expected. 
        
                   @test ParallelTestRunner.ID_COUNTER[] == old_id_counter + njobs 
        
                   # Allow a moment for worker processes to exit 
        
                   for _ in 1:50 
        
                       sleep(0.1) 
        
                       after = _count_child_pids() 
        
                       after >= 0 && after <= before && break 
        
                   end 
        
                   after = _count_child_pids() 
        
                   @test after >= 0 
        
                   @test after == before 
        
               end 
        
           end

fail on v2.4.1 (which is before #101)

Error in testset t6:
Test Failed at /home/mose/.julia/dev/ParallelTestRunner/test/runtests.jl:425
  Expression: children == 1
   Evaluated: 21 == 1

[...]

default workers stopped at end: Test Failed at /home/mose/.julia/dev/ParallelTestRunner/test/runtests.jl:459
  Expression: after == before
   Evaluated: 22 == 20

to confirm that the old rmprocs was in fact not doing anything at all, besides silently erroring out.

src/ParallelTestRunner.jl

maleadt · 2026-03-02T13:12:05Z

This is what the code in #101 did, right?

rmprocs is from Distributed, which we don't use anymore, so that line wasn't doing anything anymore. Also, I believe when that line was doing something, it was stopping all the workers only at the very end, with this change I'm stopping each worker as soon as it doesn't have anything else to do.

Yes, I meant conceptually. And you dropped my elaboration on this being useful 🙂

Although the version here makes it happen earlier.

I just looked at the code, we can't stop the worker at

ParallelTestRunner.jl/src/ParallelTestRunner.jl

Line 989 in c0188e5

isempty(tests) && break

because it's defined further below

p is; wrkr is only the local version of that (which may be an ephemeral custom worker to be stopped after this test anyway). I've pushed a simplification.

Stop workers when there are no more tests to run

b91b1e2

giordano requested review from maleadt and vchuravy March 1, 2026 17:37

Add tests for checking workers are stopped at the end

7cf6f3c

Slightly simplify _count_child_pids

1c9cdbc

giordano force-pushed the mg/stop-worker branch from 7a09a4a to 1c9cdbc Compare March 1, 2026 20:39

Add test for checking long-running test is running alone

3530d91

giordano commented Mar 1, 2026

View reviewed changes

vchuravy reviewed Mar 2, 2026

View reviewed changes

src/ParallelTestRunner.jl Show resolved Hide resolved

giordano and others added 2 commits March 2, 2026 12:56

Delete entry from running_tests before stopping the worker

d31bca9

Simplify.

e9d8608

maleadt approved these changes Mar 2, 2026

View reviewed changes

maleadt merged commit e40a168 into main Mar 2, 2026
23 checks passed

maleadt deleted the mg/stop-worker branch March 2, 2026 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop workers when there are no more tests to run#107

Stop workers when there are no more tests to run#107
maleadt merged 6 commits intomainfrom
mg/stop-worker

giordano commented Mar 1, 2026

Uh oh!

giordano commented Mar 1, 2026 •

edited

Loading

Uh oh!

giordano commented Mar 1, 2026

Uh oh!

giordano commented Mar 1, 2026 •

edited

Loading

Uh oh!

giordano Mar 1, 2026

Uh oh!

maleadt commented Mar 2, 2026

Uh oh!

giordano commented Mar 2, 2026

Uh oh!

giordano commented Mar 2, 2026

Uh oh!

giordano commented Mar 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

maleadt commented Mar 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

giordano commented Mar 1, 2026

Uh oh!

giordano commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Mar 1, 2026

Uh oh!

giordano commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

maleadt commented Mar 2, 2026

Uh oh!

giordano commented Mar 2, 2026

Uh oh!

giordano commented Mar 2, 2026

Uh oh!

giordano commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

maleadt commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

giordano commented Mar 1, 2026 •

edited

Loading

giordano commented Mar 1, 2026 •

edited

Loading

giordano commented Mar 2, 2026 •

edited

Loading

maleadt commented Mar 2, 2026 •

edited

Loading