feat: add vllm-metal by doringeman · Pull Request #605 · docker/model-runner

doringeman · 2026-01-28T11:19:49Z

You can build and install it manually using

make vllm-metal-build && make vllm-metal-install && MODEL_RUNNER_PORT=8080 make run

or, let DMR pull vllm-metal from Hub (https://hub.docker.com/layers/docker/model-runner/vllm-metal-v0.1.0-20260126-121650/images/sha256-9f4eeb45c168889baa14e935f5a9200bf1dcc6ab9d5be392fb1092a414fcc8cf from https://github.com/vllm-project/vllm-metal/releases/tag/v0.1.0-20260126-121650)

rm -rf ~/.docker/model-runner/vllm-metal && MODEL_RUNNER_PORT=8080 make run

or, install vllm-metal form local source (for vllm-metal development)

make vllm-metal-dev VLLM_METAL_PATH=../vllm-metal && MODEL_RUNNER_PORT=8080 make run

For running vllm-metal Python 3.12 is required, the same used for compiling the wheel.

brew install python@3.12

The vllm-metal files are downloaded and installed at ~/.docker/model-runner/vllm-metal.

Then, pull and use an MLX model.

MODEL_RUNNER_HOST=http://localhost:8080 docker model pull hf.co/mlx-community/llama-3.2-1b-instruct-4bit

curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "hf.co/mlx-community/llama-3.2-1b-instruct-4bit",
      "messages": [{"role": "user", "content": "Hello, who are you?"}],
      "max_tokens": 512
    }'

At the moment vllm-metal doesn't seem to support streaming, so the CLI needs a tweak.

make -C cmd/cli install

MODEL_RUNNER_HOST=http://localhost:8080 docker model run hf.co/mlx-community/llama-3.2-1b-instruct-4bit "hi, who are you?"

gemini-code-assist · 2026-01-28T11:20:20Z

Summary of Changes

Hello @doringeman, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the model-runner's capabilities by integrating vllm-metal as a new inference backend, specifically targeting macOS ARM64 for optimized performance on Apple Silicon. It includes comprehensive tooling for building and installing this backend via new Makefile targets. Furthermore, the changes improve the robustness of the API client by enabling it to process both streaming and non-streaming responses, and enhance the model scheduling system to intelligently select the most appropriate backend based on platform and model format.

Highlights

vllm-metal Backend Integration: Introduces a new backend for vllm-metal, specifically optimized for macOS ARM64 (Apple Silicon) to leverage Metal performance for accelerated inference.
Makefile Enhancements: Adds new make targets (vllm-metal-build, vllm-metal-install, vllm-metal-clean) for streamlined management of the vllm-metal installation lifecycle.
Flexible API Response Handling: Updates the API client to gracefully handle both streaming (Server-Sent Events) and non-streaming JSON responses from OpenAI-compatible LLM servers, improving compatibility.
Intelligent Backend Prioritization: Modifies the model scheduling logic to prioritize vllm-metal for safetensors models on macOS, followed by MLX, and then vLLM or SGLang on Linux, ensuring optimal backend selection.
Sandbox Configuration Flexibility: Adjusts the macOS sandbox creation to allow processes to run without sandboxing when no specific configuration is provided, accommodating the vllm-metal Python virtual environment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the vllm-metal backend, which is a great addition for Apple Silicon users. The changes are comprehensive, including Makefile targets for building and installing, the backend implementation itself, and necessary adjustments in the CLI and scheduler to integrate it. The approach of distributing the Python dependencies via a Docker image is clever. My review includes a few points for improvement: a bug fix for handling streaming responses, a suggestion to improve Makefile maintainability, and minor enhancements to error handling and security in the new backend code.

cmd/cli/desktop/desktop.go

Makefile

pkg/inference/backends/vllmmetal/vllmmetal.go

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

sourcery-ai

New security issues found

pkg/inference/backends/vllmmetal/vllmmetal.go

pkg/sandbox/sandbox_darwin.go

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

Add `make vllm-metal-dev VLLM_METAL_PATH=../vllm-metal` target to install vllm-metal from local source in editable mode. Installs vLLM CPU requirements before and after vllm-metal to ensure compatible dependency versions(transformers < 5). Also recognize "dev" as valid version in Go backend to skip re-downloading from Docker Hub when using local development install. Signed-off-by: Dorin Geman <dorin.geman@docker.com>

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

doringeman · 2026-01-29T15:35:11Z

None of these flagged security issues doesn't take user input. 🤔

sourcery-ai

Hey - I've found 4 security issues, and left some high level feedback:

Security issues:

Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)
Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)
Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)
Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)

General comments:

The new non-streaming handling in ChatWithMessagesContext relies on the first line not starting with data: , which can be brittle for chunked or pretty-printed JSON responses; consider detecting streaming via headers (e.g., Content-Type or Transfer-Encoding) or attempting JSON decode first before falling back to SSE parsing.
Python 3.12 detection and venv setup logic is now duplicated across the Makefile targets, vllmmetal backend (findSystemPython / downloadAndExtract), and build-vllm-metal-tarball.sh; consolidating this into a single reusable helper or clearly aligning the flows would reduce the risk of version skew or behavior drift.
In sandbox_darwin.Create, when configuration is empty the process is no longer sandboxed but error messages still refer to a "sandboxed process"; updating the error text or differentiating the code path would make behavior clearer and easier to debug.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new non-streaming handling in `ChatWithMessagesContext` relies on the first line not starting with `data: `, which can be brittle for chunked or pretty-printed JSON responses; consider detecting streaming via headers (e.g., `Content-Type` or `Transfer-Encoding`) or attempting JSON decode first before falling back to SSE parsing.
- Python 3.12 detection and venv setup logic is now duplicated across the Makefile targets, `vllmmetal` backend (`findSystemPython` / `downloadAndExtract`), and `build-vllm-metal-tarball.sh`; consolidating this into a single reusable helper or clearly aligning the flows would reduce the risk of version skew or behavior drift.
- In `sandbox_darwin.Create`, when `configuration` is empty the process is no longer sandboxed but error messages still refer to a "sandboxed process"; updating the error text or differentiating the code path would make behavior clearer and easier to debug.

## Individual Comments

### Comment 1
<location> `pkg/inference/backends/vllmmetal/vllmmetal.go:141` </location>
<code_context>
		out, err := exec.CommandContext(ctx, pythonPath, "--version").Output()
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

### Comment 2
<location> `pkg/inference/backends/vllmmetal/vllmmetal.go:209` </location>
<code_context>
	venvCmd := exec.CommandContext(ctx, pythonPath, "-m", "venv", v.installDir)
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

### Comment 3
<location> `pkg/inference/backends/vllmmetal/vllmmetal.go:274` </location>
<code_context>
	cmd := exec.CommandContext(ctx, v.pythonPath, "-c", "import vllm_metal")
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

### Comment 4
<location> `pkg/sandbox/sandbox_darwin.go:127` </location>
<code_context>
		command = exec.CommandContext(ctx, name, arg...)
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

pkg/inference/backends/vllmmetal/vllmmetal.go

pkg/sandbox/sandbox_darwin.go

doringeman · 2026-01-29T15:45:33Z

@sourcery-ai resolve

doringeman · 2026-01-29T15:46:02Z

@sourcery-ai dismiss

ilopezluna

I'm super excited with this one 👏

cmd/cli/desktop/desktop.go

pkg/inference/backends/vllmmetal/vllmmetal.go

pkg/sandbox/sandbox_darwin.go

ericcurtin

This LGTM, want to answer @ilopezluna 's queries @doringeman ?

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

…extraction Signed-off-by: Dorin Geman <dorin.geman@docker.com>

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

doringeman · 2026-02-09T11:38:10Z

@sourcery-ai dismiss

ilopezluna

this is super nice 🔥

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

cmd/cli/desktop/desktop.go Show resolved Hide resolved

Makefile Outdated Show resolved Hide resolved

pkg/inference/backends/vllmmetal/vllmmetal.go Outdated Show resolved Hide resolved

pkg/inference/backends/vllmmetal/vllmmetal.go Show resolved Hide resolved

github-advanced-security bot found potential problems Jan 28, 2026

View reviewed changes

pkg/inference/backends/vllmmetal/vllmmetal.go Fixed Show fixed Hide fixed

pkg/inference/backends/vllmmetal/vllmmetal.go Fixed Show fixed Hide fixed

feat: add vllm-metal

ea914fc

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

doringeman force-pushed the vllm-metal branch from 3d0e6f4 to ea914fc Compare January 29, 2026 10:01

sourcery-ai bot reviewed Jan 29, 2026

View reviewed changes

doringeman added 5 commits January 29, 2026 15:57

refactor: pass tarball path to build script

2b409a6

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

fix: convert tarball path to absolute

4af3084

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

refactor(vllm-metal): use version file

0f7b99e

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

feat(vllm-metal): add installing status during download

8626ed7

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

doringeman marked this pull request as ready for review January 29, 2026 15:35

sourcery-ai bot reviewed Jan 29, 2026

View reviewed changes

pkg/inference/backends/vllmmetal/vllmmetal.go Outdated Show resolved Hide resolved

pkg/inference/backends/vllmmetal/vllmmetal.go Outdated Show resolved Hide resolved

pkg/inference/backends/vllmmetal/vllmmetal.go Show resolved Hide resolved

pkg/sandbox/sandbox_darwin.go Show resolved Hide resolved

ilopezluna reviewed Jan 30, 2026

View reviewed changes

ericcurtin approved these changes Feb 2, 2026

View reviewed changes

ericcurtin approved these changes Feb 5, 2026

View reviewed changes

doringeman force-pushed the vllm-metal branch from e15e244 to 9258bbf Compare February 9, 2026 10:05

feat(vllm-metal): include python

9f64701

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

doringeman force-pushed the vllm-metal branch from 9258bbf to 9f64701 Compare February 9, 2026 10:19

doringeman added 3 commits February 9, 2026 13:22

fix(vllm-metal): restore execute permission on bundled python3 after …

a9b0858

…extraction Signed-off-by: Dorin Geman <dorin.geman@docker.com>

refactor(cli): detect streaming responses via Content-Type header

da4192c

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

fix(vllm-metal): add --runner pooling for embedding models

d3bb1e8

Signed-off-by: Dorin Geman <dorin.geman@docker.com>

ilopezluna approved these changes Feb 9, 2026

View reviewed changes

doringeman merged commit d2ac536 into main Feb 9, 2026
15 checks passed

doringeman deleted the vllm-metal branch February 9, 2026 12:49

Conversation

doringeman commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

doringeman commented Jan 29, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

doringeman commented Jan 29, 2026

Uh oh!

doringeman commented Jan 29, 2026

Uh oh!

ilopezluna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

doringeman commented Feb 9, 2026

Uh oh!

ilopezluna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

doringeman commented Jan 28, 2026 •

edited

Loading