Docker Model Runner

Overview

Property	Details
Description	Docker Model Runner allows you to run large language models locally using Docker Desktop.
Provider Route on LiteLLM	`docker_model_runner/`
Link to Provider Doc	Docker Model Runner ↗
Base URL	`http://localhost:22088`
Supported Operations	`/chat/completions`

https://docs.docker.com/ai/model-runner/

We support ALL Docker Model Runner models, just set docker_model_runner/ as a prefix when sending completion requests

Quick Start

Docker Model Runner is a Docker Desktop feature that lets you run AI models locally. It provides better performance than other local solutions while maintaining OpenAI compatibility.

Installation

Install Docker Desktop
Enable Docker Model Runner in Docker Desktop settings
Download your preferred model through Docker Desktop

Environment Variables

Environment Variables
os.environ["DOCKER_MODEL_RUNNER_API_BASE"] = "http://localhost:22088/engines/llama.cpp"  # Optional - defaults to this
os.environ["DOCKER_MODEL_RUNNER_API_KEY"] = "dummy-key"  # Optional - Docker Model Runner may not require auth for local instances

Note:

Docker Model Runner typically runs locally and may not require authentication. LiteLLM will use a dummy key by default if no key is provided.
The API base should include the engine path (e.g., /engines/llama.cpp)

API Base Structure

Docker Model Runner uses a unique URL structure:

http://model-runner.docker.internal/engines/{engine}/v1/chat/completions

Where {engine} is the engine you want to use (typically llama.cpp).

Important: Specify the engine in your api_base URL, not in the model name:

✅ Correct: api_base="http://localhost:22088/engines/llama.cpp", model="docker_model_runner/llama-3.1"
❌ Incorrect: api_base="http://localhost:22088", model="docker_model_runner/llama.cpp/llama-3.1"

Usage - LiteLLM Python SDK

Non-streaming

Docker Model Runner Non-streaming Completion
import os
import litellm
from litellm import completion

# Specify the engine in the api_base URL
os.environ["DOCKER_MODEL_RUNNER_API_BASE"] = "http://localhost:22088/engines/llama.cpp"

messages = [{"content": "Hello, how are you?", "role": "user"}]

# Docker Model Runner call
response = completion(
    model="docker_model_runner/llama-3.1", 
    messages=messages
)

print(response)

Streaming

Docker Model Runner Streaming Completion
import os
import litellm
from litellm import completion

# Specify the engine in the api_base URL
os.environ["DOCKER_MODEL_RUNNER_API_BASE"] = "http://localhost:22088/engines/llama.cpp"

messages = [{"content": "Hello, how are you?", "role": "user"}]

# Docker Model Runner call with streaming
response = completion(
    model="docker_model_runner/llama-3.1", 
    messages=messages,
    stream=True
)

for chunk in response:
    print(chunk)

Custom API Base and Engine

Custom API Base with Different Engine
import litellm
from litellm import completion

messages = [{"content": "Hello, how are you?", "role": "user"}]

# Specify the engine in the api_base URL
# Using a different host and engine
response = completion(
    model="docker_model_runner/llama-3.1",
    messages=messages,
    api_base="http://model-runner.docker.internal/engines/llama.cpp"
)

print(response)

Using Different Engines

Using a Different Engine
import litellm
from litellm import completion

messages = [{"content": "Hello, how are you?", "role": "user"}]

# To use a different engine, specify it in the api_base
# For example, if Docker Model Runner supports other engines:
response = completion(
    model="docker_model_runner/mistral-7b",
    messages=messages,
    api_base="http://localhost:22088/engines/custom-engine"
)

print(response)

Usage - LiteLLM Proxy

Add the following to your LiteLLM Proxy configuration file:

config.yaml
model_list:
  - model_name: llama-3.1
    litellm_params:
      model: docker_model_runner/llama-3.1
      api_base: http://localhost:22088/engines/llama.cpp

  - model_name: mistral-7b
    litellm_params:
      model: docker_model_runner/mistral-7b
      api_base: http://localhost:22088/engines/llama.cpp

Start your LiteLLM Proxy server:

Start LiteLLM Proxy
litellm --config config.yaml

# RUNNING on http://0.0.0.0:4000

OpenAI SDK
LiteLLM SDK
cURL

Docker Model Runner via Proxy - Non-streaming
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-proxy-api-key"       # Your proxy API key
)

# Non-streaming response
response = client.chat.completions.create(
    model="llama-3.1",
    messages=[{"role": "user", "content": "hello from litellm"}]
)

print(response.choices[0].message.content)

Docker Model Runner via Proxy - Streaming
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
    base_url="http://localhost:4000",  # Your proxy URL
    api_key="your-proxy-api-key"       # Your proxy API key
)

# Streaming response
response = client.chat.completions.create(
    model="llama-3.1",
    messages=[{"role": "user", "content": "hello from litellm"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Docker Model Runner via Proxy - LiteLLM SDK
import litellm

# Configure LiteLLM to use your proxy
response = litellm.completion(
    model="litellm_proxy/llama-3.1",
    messages=[{"role": "user", "content": "hello from litellm"}],
    api_base="http://localhost:4000",
    api_key="your-proxy-api-key"
)

print(response.choices[0].message.content)

Docker Model Runner via Proxy - LiteLLM SDK Streaming
import litellm

# Configure LiteLLM to use your proxy with streaming
response = litellm.completion(
    model="litellm_proxy/llama-3.1",
    messages=[{"role": "user", "content": "hello from litellm"}],
    api_base="http://localhost:4000",
    api_key="your-proxy-api-key",
    stream=True
)

for chunk in response:
    if hasattr(chunk.choices[0], 'delta') and chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Docker Model Runner via Proxy - cURL
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-proxy-api-key" \
  -d '{
    "model": "llama-3.1",
    "messages": [{"role": "user", "content": "hello from litellm"}]
  }'

Docker Model Runner via Proxy - cURL Streaming
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-proxy-api-key" \
  -d '{
    "model": "llama-3.1",
    "messages": [{"role": "user", "content": "hello from litellm"}],
    "stream": true
  }'

For more detailed information on using the LiteLLM Proxy, see the LiteLLM Proxy documentation.

API Reference

For detailed API information, see the Docker Model Runner API Reference.

Overview​

Quick Start​

Installation​

Environment Variables​

API Base Structure​

Usage - LiteLLM Python SDK​

Non-streaming​

Streaming​

Custom API Base and Engine​

Using Different Engines​

Usage - LiteLLM Proxy​

API Reference​