Skip to main content

Overview

Set model list, api_base, api_key, temperature & proxy server settings (master-key) on the config.yaml.

Param NameDescription
model_listList of supported models on the server, with model-specific configs
router_settingslitellm Router settings, example routing_strategy="least-busy" see all
litellm_settingslitellm Module settings, example litellm.drop_params=True, litellm.set_verbose=True, litellm.api_base, litellm.cache see all
general_settingsServer settings, example setting master_key: sk-my_special_key see all
environment_variablesEnvironment Variables example, REDIS_HOST, REDIS_PORT see all

Complete List: Check the Swagger UI docs on <your-proxy-url>/#/config.yaml (e.g. http://0.0.0.0:4000/#/config.yaml), for everything you can pass in the config.yaml.

Quick Startโ€‹

Set a model alias for your deployments.

In the config.yaml the model_name parameter is the user-facing name to use for your deployment.

In the config below:

  • model_name: the name to pass TO litellm from the external client
  • litellm_params.model: the model string passed to the litellm.completion() function

E.g.:

  • model=vllm-models will route to openai/facebook/opt-125m.
  • model=gpt-4o will load balance between azure/gpt-4o-eu and azure/gpt-4o-ca
model_list:
- model_name: gpt-4o ### RECEIVED MODEL NAME ###
litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
model: azure/gpt-4o-eu ### MODEL NAME sent to `litellm.completion()` ###
api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
api_key: "os.environ/AZURE_API_KEY_EU" # does os.getenv("AZURE_API_KEY_EU")
rpm: 6 # [OPTIONAL] Rate limit for this deployment: in requests per minute (rpm)
- model_name: bedrock-claude-v1
litellm_params:
model: bedrock/anthropic.claude-instant-v1
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-ca
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
api_key: "os.environ/AZURE_API_KEY_CA"
rpm: 6
- model_name: anthropic-claude
litellm_params:
model: bedrock/anthropic.claude-instant-v1
### [OPTIONAL] SET AWS REGION ###
aws_region_name: us-east-1
- model_name: vllm-models
litellm_params:
model: openai/facebook/opt-125m # the `openai/` prefix tells litellm it's openai compatible
api_base: http://0.0.0.0:4000/v1
api_key: none
rpm: 1440
model_info:
version: 2

# Use this if you want to make requests to `claude-3-haiku-20240307`,`claude-3-opus-20240229`,`claude-2.1` without defining them on the config.yaml
# Default models
# Works for ALL Providers and needs the default provider credentials in .env
- model_name: "*"
litellm_params:
model: "*"

litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
drop_params: True
success_callback: ["langfuse"] # OPTIONAL - if you want to start sending LLM Logs to Langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your env

general_settings:
master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)
alerting: ["slack"] # [OPTIONAL] If you want Slack Alerts for Hanging LLM requests, Slow llm responses, Budget Alerts. Make sure to set `SLACK_WEBHOOK_URL` in your env
info

For more provider-specific info, go here

Step 2: Start Proxy with configโ€‹

$ litellm --config /path/to/config.yaml
tip

Run with --detailed_debug if you need detailed debug logs

$ litellm --config /path/to/config.yaml --detailed_debug

Step 3: Test itโ€‹

Sends request to model where model_name=gpt-4o on config.yaml.

If multiple with model_name=gpt-4o does Load Balancing

Langchain, OpenAI SDK Usage Examples

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
]
}
'

LLM configs model_listโ€‹

Model-specific params (API Base, Keys, Temperature, Max Tokens, Organization, Headers etc.)โ€‹

You can use the config to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

All input params

Step 1: Create a config.yaml file

model_list:
- model_name: gpt-4-team1
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_version: "2023-05-15"
azure_ad_token: eyJ0eXAiOiJ
seed: 12
max_tokens: 20
- model_name: gpt-4-team2
litellm_params:
model: azure/gpt-4
api_key: sk-123
api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
temperature: 0.2
- model_name: openai-gpt-4o
litellm_params:
model: openai/gpt-4o
extra_headers: {"AI-Resource Group": "ishaan-resource"}
api_key: sk-123
organization: org-ikDc4ex8NB
temperature: 0.2
- model_name: mistral-7b
litellm_params:
model: ollama/mistral
api_base: your_ollama_api_base

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Expected Logs:

Look for this line in your console logs to confirm the config.yaml was loaded in correctly.

LiteLLM: Proxy initialized with Config, Set models:

Embedding Models - Use Sagemaker, Bedrock, Azure, OpenAI, XInferenceโ€‹

See supported Embedding Providers & Models here

model_list:
- model_name: bedrock-cohere
litellm_params:
model: "bedrock/cohere.command-text-v14"
aws_region_name: "us-west-2"
- model_name: bedrock-cohere
litellm_params:
model: "bedrock/cohere.command-text-v14"
aws_region_name: "us-east-2"
- model_name: bedrock-cohere
litellm_params:
model: "bedrock/cohere.command-text-v14"
aws_region_name: "us-east-1"

Start Proxyโ€‹

litellm --config config.yaml

Make Requestโ€‹

Sends Request to bedrock-cohere

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "bedrock-cohere",
"messages": [
{
"role": "user",
"content": "gm"
}
]
}'

Multiple OpenAI Organizationsโ€‹

Add all openai models across all OpenAI organizations with just 1 model definition

  - model_name: *
litellm_params:
model: openai/*
api_key: os.environ/OPENAI_API_KEY
organization:
- org-1
- org-2
- org-3

LiteLLM will automatically create separate deployments for each org.

Confirm this via

curl --location 'http://0.0.0.0:4000/v1/model/info' \
--header 'Authorization: Bearer ${LITELLM_KEY}' \
--data ''

Load Balancingโ€‹

info

For more on this, go to this page

Use this to call multiple instances of the same model and configure things like routing strategy.

For optimal performance:

  • Set tpm/rpm per model deployment. Weighted picks are then based on the established tpm/rpm.
  • Select your optimal routing strategy in router_settings:routing_strategy.

LiteLLM supports

["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"`

When tpm/rpm is set + routing_strategy==simple-shuffle litellm will use a weighted pick based on set tpm/rpm. In our load tests setting tpm/rpm for all deployments + routing_strategy==simple-shuffle maximized throughput

  • When using multiple LiteLLM Servers / Kubernetes set redis settings router_settings:redis_host etc
model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
rpm: 60 # Optional[int]: When rpm/tpm set - litellm uses weighted pick for load balancing. rpm = Rate limit for this deployment: in requests per minute (rpm).
tpm: 1000 # Optional[int]: tpm = Tokens Per Minute
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
rpm: 600
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
rpm: 60000
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: <my-openai-key>
rpm: 200
- model_name: gpt-3.5-turbo-16k
litellm_params:
model: gpt-3.5-turbo-16k
api_key: <my-openai-key>
rpm: 100

litellm_settings:
num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
request_timeout: 10 # raise Timeout error if call takes longer than 10s. Sets litellm.request_timeout
fallbacks: [{"zephyr-beta": ["gpt-4o"]}] # fallback to gpt-4o if call fails num_retries
context_window_fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo-16k"]}, {"gpt-4o": ["gpt-3.5-turbo-16k"]}] # fallback to gpt-3.5-turbo-16k if context window error
allowed_fails: 3 # cooldown model if it fails > 1 call in a minute.

router_settings: # router_settings are optional
routing_strategy: simple-shuffle # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"
model_group_alias: {"gpt-4": "gpt-4o"} # all requests with `gpt-4` will be routed to models with `gpt-4o`
num_retries: 2
timeout: 30 # 30 seconds
redis_host: <your redis host> # set this when using multiple litellm proxy deployments, load balancing state stored in redis
redis_password: <your redis password>
redis_port: 1992

You can view your cost once you set up Virtual keys or custom_callbacks

Load API Keys / config values from Environmentโ€‹

If you have secrets saved in your environment, and don't want to expose them in the config.yaml, here's how to load model-specific keys from the environment. This works for ANY value on the config.yaml

os.environ/<YOUR-ENV-VAR> # runs os.getenv("YOUR-ENV-VAR")
model_list:
- model_name: gpt-4-team1
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_version: "2023-05-15"
api_key: os.environ/AZURE_NORTH_AMERICA_API_KEY # ๐Ÿ‘ˆ KEY CHANGE

See Code

s/o to @David Manouchehri for helping with this.

Centralized Credential Managementโ€‹

Define credentials once and reuse them across multiple models. This helps with:

  • Secret rotation
  • Reducing config duplication
model_list:
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o
litellm_credential_name: default_azure_credential # Reference credential below

credential_list:
- credential_name: default_azure_credential
credential_values:
api_key: os.environ/AZURE_API_KEY # Load from environment
api_base: os.environ/AZURE_API_BASE
api_version: "2023-05-15"
credential_info:
description: "Production credentials for EU region"
custom_llm_provider: "azure"

Key Parametersโ€‹

  • credential_name: Unique identifier for the credential set
  • credential_values: Key-value pairs of credentials/secrets (supports os.environ/ syntax)
  • credential_info: Key-value pairs of user provided credentials information. No key-value pairs are required, but the dictionary must exist.

Load API Keys from Secret Managers (Azure Vault, etc)โ€‹

Using Secret Managers with LiteLLM Proxy

Set Supported Environments for a model - production, staging, developmentโ€‹

Use this if you want to control which model is exposed on a specific litellm environment

Supported Environments:

  • production
  • staging
  • development
  1. Set LITELLM_ENVIRONMENT="<environment>" in your environment. Can be one of production, staging or development

  2. For each model set the list of supported environments in model_info.supported_environments

model_list:
- model_name: gpt-3.5-turbo-16k
litellm_params:
model: openai/gpt-3.5-turbo-16k
api_key: os.environ/OPENAI_API_KEY
model_info:
supported_environments: ["development", "production", "staging"]
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY
model_info:
supported_environments: ["production", "staging"]
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
model_info:
supported_environments: ["production"]

Set Custom Prompt Templatesโ€‹

LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml:

Step 1: Save your prompt template in a config.yaml

# Model-specific parameters
model_list:
- model_name: mistral-7b # model alias
litellm_params: # actual params for litellm.completion()
model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1"
api_base: "<your-api-base>"
api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
initial_prompt_value: "\n"
roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
final_prompt_value: "\n"
bos_token: " "
eos_token: " "
max_tokens: 4096

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Set custom tokenizerโ€‹

If you're using the /utils/token_counter endpoint, and want to set a custom huggingface tokenizer for a model, you can do so in the config.yaml

model_list:
- model_name: openai-deepseek
litellm_params:
model: deepseek/deepseek-chat
api_key: os.environ/OPENAI_API_KEY
model_info:
access_groups: ["restricted-models"]
custom_tokenizer:
identifier: deepseek-ai/DeepSeek-V3-Base
revision: main
auth_token: os.environ/HUGGINGFACE_API_KEY

Spec

custom_tokenizer: 
identifier: str # huggingface model identifier
revision: str # huggingface model revision (usually 'main')
auth_token: Optional[str] # huggingface auth token

General Settings general_settings (DB Connection, etc)โ€‹

Configure DB Pool Limits + Connection Timeoutsโ€‹

general_settings: 
database_connection_pool_limit: 10 # sets connection pool per worker for prisma client to postgres db (default: 10, recommended: 10-20)
database_connection_timeout: 60 # sets a 60s timeout for any connection call to the db

How to calculate the right value:

The connection limit is applied per worker process, not per instance. This means if you have multiple workers, each worker will create its own connection pool.

Formula:

database_connection_pool_limit = MAX_DB_CONNECTIONS รท (number_of_instances ร— number_of_workers_per_instance)

Example:

  • Your database allows a maximum of 100 connections
  • You're running 1 instance of LiteLLM
  • Each instance has 8 workers (set via --num_workers 8)

Calculation: 100 รท (1 ร— 8) = 12.5

Since you shouldn't use 12.5, round down to 10 to leave a safety buffer. This means:

  • Each of the 8 workers will have a connection pool limit of 10
  • Total maximum connections: 8 workers ร— 10 connections = 80 connections
  • This stays safely under your database's 100 connection limit

Cap Idle DB Connections + Pass Extra Prisma URL Paramsโ€‹

If you're seeing a large number of idle Prisma connections that never close, set database_socket_timeout so Prisma closes any connection that's been silent past the threshold. You can also bound how long Prisma waits to open a new connection with database_connect_timeout, and pass arbitrary extra query-string params through to Prisma via database_extra_connection_params.

These map to the Prisma PostgreSQL connection URL params of the same name (minus the database_ prefix), and LiteLLM appends them to both DATABASE_URL and DIRECT_URL.

general_settings:
database_connection_pool_limit: 20
database_socket_timeout: 300 # close any connection idle/slow for >5 min
database_connect_timeout: 15 # fail fast if a new connection can't be established within 15s
database_extra_connection_params:
pgbouncer: "true" # set if running behind PgBouncer
statement_cache_size: 0
sslmode: "require"

Notes:

  • database_socket_timeout is the main knob for capping idle DB connections from LiteLLM.
  • database_connect_timeout and database_socket_timeout are omitted from the URL when unset, so Prisma's defaults apply.
  • database_extra_connection_params is an untyped passthrough โ€” any key you set here overrides the LiteLLM-set defaults for that key (e.g. you can override pool_timeout from this dict). Use it for sslmode, pgbouncer, statement_cache_size, or any other Prisma URL param.

Disable Server-Side Prepared Statementsโ€‹

Set database_disable_prepared_statements: true to stop Prisma from reusing server-side prepared statements. It appends pgbouncer=true to the Prisma connection URL, so each query is prepared fresh instead of reusing a cached plan.

general_settings:
database_disable_prepared_statements: true

Use this when:

  • LiteLLM connects to Postgres through PgBouncer in transaction pooling mode, where reused prepared statements break because consecutive queries can land on different server connections.
  • You run rolling deployments with schema migrations and see cached plan must not change result type errors. The error fires when a migration changes the result type of a column referenced by a plan that a pooled connection still holds; with this flag on there is no reused plan to invalidate, so the migration is harmless.

The tradeoff is that every query pays the prepare cost instead of amortizing it, which adds a small per-query overhead. An explicit pgbouncer key in database_extra_connection_params takes precedence over this flag.

LiteLLM License Key (Enterprise)โ€‹

To enable LiteLLM Enterprise features, set your license key as an environment variable:

export LITELLM_LICENSE="eyJ..."

The license key is a JWT token provided when you purchase a LiteLLM Enterprise license. Once set, LiteLLM will automatically detect and activate enterprise features.

You can also add it to your .env file:

LITELLM_LICENSE="eyJ..."

Extrasโ€‹

Disable Swagger UIโ€‹

To disable the Swagger docs from the base url, set

NO_DOCS="True"

in your environment, and restart the proxy.

Disable Redocโ€‹

To disable the Redoc docs (defaults to <your-proxy-url>/redoc), set

NO_REDOC="True"

in your environment, and restart the proxy.

Use CONFIG_FILE_PATH for proxy (Easier Azure container deployment)โ€‹

  1. Setup config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
  1. Store filepath as env var
CONFIG_FILE_PATH="/path/to/config.yaml"
  1. Start Proxy
$ litellm 

# RUNNING on http://0.0.0.0:4000

Providing LiteLLM config.yaml file as a s3, GCS Bucket Object/urlโ€‹

Use this if you cannot mount a config file on your deployment service (example - AWS Fargate, Railway etc)

LiteLLM Proxy will read your config.yaml from an s3 Bucket or GCS Bucket

Set the following .env vars

LITELLM_CONFIG_BUCKET_TYPE = "gcs"                              # set this to "gcs"         
LITELLM_CONFIG_BUCKET_NAME = "litellm-proxy" # your bucket name on GCS
LITELLM_CONFIG_BUCKET_OBJECT_KEY = "proxy_config.yaml" # object key on GCS

Start litellm proxy with these env vars - litellm will read your config from GCS

docker run --name litellm-proxy \
-e DATABASE_URL=<database_url> \
-e LITELLM_CONFIG_BUCKET_NAME=<bucket_name> \
-e LITELLM_CONFIG_BUCKET_OBJECT_KEY="<object_key>> \
-e LITELLM_CONFIG_BUCKET_TYPE="gcs" \
-p 4000:4000 \
docker.litellm.ai/berriai/litellm-database:main-latest --detailed_debug