fix(mcp): add timeout to background thread join in stop() to prevent hangs#1792
fix(mcp): add timeout to background thread join in stop() to prevent hangs#1792giulio-leone wants to merge 3 commits intostrands-agents:mainfrom
Conversation
…hangs When an Agent holding an MCPClient goes out of scope inside a function, the Agent.__del__ finalizer calls MCPClient.stop() which calls _background_thread.join(). If the background thread cannot exit promptly (e.g. transport subprocess teardown is slow), join() blocks indefinitely, causing the entire process to hang on exit. Add a timeout (equal to startup_timeout, default 30s) to the join() call. If the thread does not exit within the timeout, log a warning and proceed with cleanup. The thread is already a daemon thread (daemon=True), so it will be cleaned up by the interpreter on process exit. Fixes strands-agents#1732 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Prevents process hangs on interpreter shutdown by ensuring MCPClient.stop() doesn’t block indefinitely when joining the background thread (notably when called from Agent.__del__).
Changes:
- Add a timeout (based on
startup_timeout) to_background_thread.join()inMCPClient.stop() - Log a warning when the join times out and continue cleanup
- Update/add tests to assert
join(timeout=...)is used and exercise the timeout path
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/strands/tools/mcp/mcp_client.py |
Adds join(timeout=...) + warning when the background thread doesn’t exit promptly. |
tests/strands/tools/mcp/test_mcp_client.py |
Updates existing stop() tests for the new join timeout behavior and adds a new timeout-path test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # join should have been called with timeout | ||
| mock_thread.join.assert_called_once_with(timeout=client._startup_timeout) | ||
| # Cleanup should still proceed even though thread is alive | ||
| assert client._background_thread is None | ||
| assert client._background_thread_event_loop is None |
There was a problem hiding this comment.
test_stop_does_not_hang_when_join_times_out asserts that cleanup proceeds (event loop close + clearing fields) even when mock_thread.is_alive() remains True. In real execution, a still-alive thread likely means the asyncio loop is still running, and calling loop.close() from the main thread would raise RuntimeError: Cannot close a running event loop. Consider updating the test to assert stop() returns and logs a warning, while not closing the loop / not resetting state until the thread has actually exited (or otherwise ensuring a thread-safe shutdown path).
| # join should have been called with timeout | |
| mock_thread.join.assert_called_once_with(timeout=client._startup_timeout) | |
| # Cleanup should still proceed even though thread is alive | |
| assert client._background_thread is None | |
| assert client._background_thread_event_loop is None | |
| # join should have been called with timeout, and stop() should return promptly | |
| mock_thread.join.assert_called_once_with(timeout=client._startup_timeout) |
There was a problem hiding this comment.
The thread cleanup design follows Python's standard daemon thread pattern — threads are marked as daemon so they don't prevent process exit, and cleanup happens via atexit handlers.
…imeout When join() times out and the thread is still alive, return early without closing the event loop or resetting internal state. This prevents RuntimeError from closing a running loop and avoids allowing a second start() while teardown is still in progress. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
All CI checks pass. Ready for review. |
Expand the docstring to explicitly describe the expected behavior: skip event loop close and state reset when thread is still alive.
Summary
When an
Agentholding anMCPClientis created inside a function and goes out of scope, theAgent.__del__finalizer callsMCPClient.stop()which invokes_background_thread.join(). If the background thread cannot exit promptly (e.g. transport subprocess teardown is slow),join()blocks indefinitely, causing the entire process to hang on exit.Changes
startup_timeout, default 30s) to thejoin()call instop()daemon=True), so it will be cleaned up by the interpreter on process exitTesting
join()is called with timeouttest_stop_does_not_hang_when_join_times_outcovering the timeout pathFixes #1732