Support MIG UUID and respect CUDA_VISIBLE_DEVICES in nnUNetV2Runner#8716
Support MIG UUID and respect CUDA_VISIBLE_DEVICES in nnUNetV2Runner#8716ytl0623 wants to merge 8 commits intoProject-MONAI:devfrom
Conversation
Signed-off-by: ytl0623 <david89062388@gmail.com>
📝 WalkthroughWalkthroughReworks nnUNetV2Runner to construct commands as a tuple (cmd: list[str], env: dict[str, str]) instead of a single shell string. train_single_model_command accepts gpu_id as int | str | tuple | list, validates empty GPU lists, computes num_gpus, preserves an existing CUDA_VISIBLE_DEVICES when gpu_id resolves to 0, or sets it for single/multi-GPU selections, and builds a list-based nnUNetv2_train command with forwarded kwargs and optional --npz. train_single_model, train_parallel, and predict are updated to propagate and use the returned env for subprocess invocations; logging now uses shlex-joined command representations. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 827-830: The current check uses gpu_id == 0 (int) so a string "0"
won’t match and will overwrite CUDA_VISIBLE_DEVICES; update the logic around
gpu_id, os.environ and logger.info in nnunetv2_runner.py to compare string
forms: check if "CUDA_VISIBLE_DEVICES" in os.environ and str(gpu_id) ==
os.environ.get("CUDA_VISIBLE_DEVICES") then log and do not overwrite, otherwise
set os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id); refer to the gpu_id
variable, logger.info call, and the CUDA_VISIBLE_DEVICES env handling
(consistent with train_single_model_command behavior).
🧹 Nitpick comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)
791-791: Docstring should reflect new behavior.The
gpu_idparameter now accepts MIG UUIDs (strings) and respects existingCUDA_VISIBLE_DEVICES. Update the docstring at line 825 to document this.Suggested docstring
- gpu_id: which GPU to use for prediction. + gpu_id: GPU device index (int) or MIG UUID (str) for prediction. + If CUDA_VISIBLE_DEVICES is already set and gpu_id is 0, the existing + environment variable is preserved.Also applies to: 825-825
Signed-off-by: ytl0623 <david89062388@gmail.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)
531-565: Missing docstring and type hints.Per coding guidelines, all definitions require Google-style docstrings documenting parameters and return values. This method also lacks type annotations.
Proposed signature and docstring
- def train_single_model_command(self, config, fold, gpu_id, kwargs): + def train_single_model_command( + self, config: str, fold: int, gpu_id: int | str | tuple | list, kwargs: dict[str, Any] + ) -> str: + """ + Build the shell command string for training a single nnU-Net model. + + Args: + config: Configuration name (e.g., "3d_fullres"). + fold: Cross-validation fold index (0-4). + gpu_id: Device selector—int, str (MIG UUID), or tuple/list for multi-GPU. + kwargs: Additional CLI arguments forwarded to nnUNetv2_train. + + Returns: + Shell command string. + """
🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 537-544: The code path for gpu_id being a tuple/list doesn't
handle empty sequences, leaving device_setting and num_gpus in unintended
states; update the branch that checks isinstance(gpu_id, (tuple, list)) to
explicitly handle len(gpu_id) == 0 (e.g., raise a ValueError or TypeError with a
clear message including the received gpu_id), ensuring you reference and set
device_setting and num_gpus deterministically (or abort) instead of falling
through silently; keep the existing branches for len>1 and len==1 unchanged.
🧹 Nitpick comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)
553-557: Leading whitespace whendevice_settingis empty.When the existing env var is reused,
device_setting = ""produces" nnUNetv2_train ...".Trim the prefix
- cmd = ( - f"{device_setting} nnUNetv2_train " + prefix = f"{device_setting} " if device_setting else "" + cmd = ( + f"{prefix}nnUNetv2_train "
1. added docstring and type hints 2. lead whitespace when device_setting is empty 3. empty tuple/list falls through silently Signed-off-by: ytl0623 <david89062388@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 531-533: Add direct unit tests for the new
train_single_model_command method and the updated predict signature: write tests
that call train_single_model_command with representative gpu_id types (int, str,
tuple/list) and sample kwargs and assert the returned command string contains
the expected flags/arguments and correctly interpolates values; add unit tests
for predict that call the method with the new signature parameters (including
optional/edge cases) and assert it accepts the new parameters, invokes the
expected internal behavior (mocking subprocess/IO as needed) and returns/raises
the expected results; focus on exercising argument parsing, kwarg handling, and
error branches, and use mocking for external side effects so tests remain pure
unit tests.
🧹 Nitpick comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)
534-545: Docstring missingRaisessection.Per coding guidelines, document the
ValueErrorraised whengpu_idis an empty tuple/list.Suggested docstring addition
Returns: Shell command string. + + Raises: + ValueError: If gpu_id is an empty tuple or list. """
Signed-off-by: ytl0623 <david89062388@gmail.com>
8c79b9e to
d053eb4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 528-529: train_single_model_command currently builds and returns a
shell string that is passed to run_cmd (which expects list[str]) and then
executed with shell=True risking injection; change train_single_model_command to
build and return a list[str] (argv-style) instead of an f-string, move any
dynamic settings (e.g. kwargs, fold, trainer_class_name, config) into either
explicit CLI args in that list or into an env dict, then call run_cmd(cmd_list,
env=env_dict) without shell=True; update the caller in nnunetv2_runner.py to
pass the returned list and env to run_cmd and remove shell=True, and add a unit
test for train_single_model_command that asserts the returned value is a list of
strings with correct argument ordering and that environment variables are
returned/handled as expected (validate env contains expected keys like trainer
class / kwargs).
♻️ Duplicate comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)
531-533: Add unit coverage for the new command builder and GPU handling.
Please add direct tests fortrain_single_model_command(int/str/tuple/list, MIG UUIDs, env preservation) andpredict(gpu_id str/int paths). As per coding guidelines, ...
Signed-off-by: ytl0623 <david89062388@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monai/apps/nnunet/nnunetv2_runner.py (1)
489-529:⚠️ Potential issue | 🟡 MinorAdd
strtype hint totrain_single_modeland fix CUDA_VISIBLE_DEVICES docstring.Line 489 signature omits
strfor MIG UUID support (present intrain_single_model_command). Line 492 docstring incorrectly states the environment variable is always overridden—code preserves it whengpu_id=0and already set. Also, line 544 Returns docstring misidentifies the return type as a string rather than a tuple.📝 Proposed updates
- def train_single_model(self, config: Any, fold: int, gpu_id: tuple | list | int = 0, **kwargs: Any) -> None: + def train_single_model(self, config: Any, fold: int, gpu_id: tuple | list | int | str = 0, **kwargs: Any) -> None: """ Run the training on a single GPU with one specified configuration provided. - Note: this will override the environment variable `CUDA_VISIBLE_DEVICES`. + Note: if CUDA_VISIBLE_DEVICES is already set and gpu_id resolves to 0, the existing value is preserved; + otherwise it is set to gpu_id. Args: config: configuration that should be trained. Examples: "2d", "3d_fullres", "3d_lowres". fold: fold of the 5-fold cross-validation. Should be an int between 0 and 4. - gpu_id: an integer to select the device to use, or a tuple/list of GPU device indices used for multi-GPU - training (e.g., (0,1)). Default: 0. + gpu_id: an int, MIG UUID (str), or tuple/list of GPU indices for multi-GPU training (e.g., (0,1)). Default: 0.Returns: - Shell command string. + Tuple of (command list, environment dict) for the training process.
🤖 Fix all issues with AI agents
In `@monai/apps/nnunet/nnunetv2_runner.py`:
- Around line 534-545: The Returns section of the train_single_model_command
docstring is incorrect: the function signature and callsite expect a
tuple[list[str], dict[str, str]] but the docstring still says "Shell command
string." Update the Returns block in train_single_model_command to clearly state
it returns a tuple where the first element is a list of command string parts
(list[str]) and the second element is an environment dictionary (dict[str, str])
that should be used when running the command, matching the function signature
and how callers unpack the result.
- Around line 549-591: The call to train_single_model_command returns (cmd, env)
but the caller stored the tuple instead of unpacking it, causing later code that
expects a list of command strings (e.g., stage[device_id] used with ";
".join(stage[device_id])) to fail; fix by unpacking the return into cmd, env
when invoking train_single_model_command and append/extend only the cmd (not the
tuple) into stage[device_id] (or whatever collection holds commands), and ensure
the environment is actually passed to subprocess invocations (e.g.,
subprocess.run/launch) by supplying env=env so child processes inherit the
CUDA_VISIBLE_DEVICES and other settings.
…ES docstring. 2. Fix train_single_model_command return docstring. 3. Unpack tuple return from train_single_model_command and propagate environment to subprocesses. Signed-off-by: ytl0623 <david89062388@gmail.com>
Fixes #7497
Description
This PR fixes two critical issues when running
nnUNetV2Runneron NVIDIA MIG (Multi-Instance GPU) environments or when using a specificCUDA_VISIBLE_DEVICESconfiguration.Types of changes
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.make htmlcommand in thedocs/folder.