v1.72.2-stable

June 7, 2025

Krrish Dholakia

CEO, LiteLLM

Ishaan Jaffer

CTO, LiteLLM

Deploy this version

Docker
Pip

docker run litellm
docker run
-e STORE_MODEL_IN_DB=True
-p 4000:4000
ghcr.io/berriai/litellm:main-v1.72.2-stable

pip install litellm
pip install litellm==1.72.2.post1

TLDR

Why Upgrade
- Performance Improvements for /v1/messages: For this endpoint LiteLLM Proxy overhead is now down to 50ms at 250 RPS.
- Accurate Rate Limiting: Multi-instance rate limiting now tracks rate limits across keys, models, teams, and users with 0 spillover.
- Audit Logs on UI: Track when Keys, Teams, and Models were deleted by viewing Audit Logs on the LiteLLM UI.
- /v1/messages all models support: You can now use all LiteLLM models (gpt-4.1, o1-pro, gemini-2.5-pro) with /v1/messages API.
- Anthropic MCP: Use remote MCP Servers with Anthropic Models.
Who Should Read
- Teams using /v1/messages API (Claude Code)
- Proxy Admins using LiteLLM Virtual Keys and setting rate limits
Risk of Upgrade
- Medium
  - Upgraded ddtrace==3.8.0, if you use DataDog tracing this is a medium level risk. We recommend monitoring logs for any issues.

`/v1/messages` Performance Improvements

This release brings significant performance improvements to the /v1/messages API on LiteLLM.

For this endpoint LiteLLM Proxy overhead latency is now down to 50ms, and each instance can handle 250 RPS. We validated these improvements through load testing with payloads containing over 1,000 streaming chunks.

This is great for real time use cases with large requests (eg. multi turn conversations, Claude Code, etc.).

Multi-Instance Rate Limiting Improvements

LiteLLM now accurately tracks rate limits across keys, models, teams, and users with 0 spillover.

This is a significant improvement over the previous version, which faced issues with leakage and spillover in high traffic, multi-instance setups.

Key Changes:

Redis is now part of the rate limit check, instead of being a background sync. This ensures accuracy and reduces read/write operations during low activity.
LiteLLM now uses Lua scripts to ensure all checks are atomic.
In-memory caching uses Redis values. This prevents drift, and reduces Redis queries once objects are over their limit.

These changes are currently behind the feature flag - EXPERIMENTAL_ENABLE_MULTI_INSTANCE_RATE_LIMITING=True. We plan to GA this in our next release - subject to feedback.

Audit Logs on UI

This release introduces support for viewing audit logs in the UI. As a Proxy Admin, you can now check if and when a key was deleted, along with who performed the action.

LiteLLM tracks changes to the following entities and actions:

Entities: Keys, Teams, Users, Models
Actions: Create, Update, Delete, Regenerate

New Models / Updated Models

Newly Added Models

Provider	Model	Context Window	Input ($/1M tokens)	Output ($/1M tokens)
Anthropic	`claude-4-opus-20250514`	200K	$15.00	$75.00
Anthropic	`claude-4-sonnet-20250514`	200K	$3.00	$15.00
VertexAI, Google AI Studio	`gemini-2.5-pro-preview-06-05`	1M	$1.25	$10.00
OpenAI	`codex-mini-latest`	200K	$1.50	$6.00
Cerebras	`qwen-3-32b`	128K	$0.40	$0.80
SambaNova	`DeepSeek-R1`	32K	$5.00	$7.00
SambaNova	`DeepSeek-R1-Distill-Llama-70B`	131K	$0.70	$1.40

Model Updates

Anthropic
- Cost tracking added for new Claude models - PR
  - claude-4-opus-20250514
  - claude-4-sonnet-20250514
- Support for MCP tool calling with Anthropic models - PR
Google AI Studio
- Google Gemini 2.5 Pro Preview 06-05 support - PR
- Gemini streaming thinking content parsing with reasoning_content - PR
- Support for no reasoning option for Gemini models - PR
- URL context support for Gemini models - PR
- Gemini embeddings-001 model prices and context window - PR
OpenAI
- Cost tracking for codex-mini-latest - PR
Vertex AI
- Cache token tracking on streaming calls - PR
- Return response_id matching upstream response ID for stream and non-stream - PR
Cerebras
- Cerebras/qwen-3-32b model pricing and context window - PR
HuggingFace
- Fixed embeddings using non-default input_type - PR
DataRobot
- New provider integration for enterprise AI workflows - PR
DeepSeek
- DeepSeek R1 family model configuration via Together AI - PR
- DeepSeek R1 pricing and context window configuration - PR

LLM API Endpoints

Images API
- Azure endpoint support for image endpoints - PR
Anthropic Messages API
- Support for ALL LiteLLM Providers (OpenAI, Azure, Bedrock, Vertex, DeepSeek, etc.) on /v1/messages API Spec - PR
- Performance improvements for /v1/messages route - PR
- Return streaming usage statistics when using LiteLLM with Bedrock models - PR
Embeddings API
- Provider-specific optional params handling for embedding calls - PR
- Proper Sagemaker request attribute usage for embeddings - PR
Rerank API
- New HuggingFace rerank provider support - PR, Guide

Spend Tracking

Added token tracking for anthropic batch calls via /anthropic passthrough route- PR

Management Endpoints / UI

SSO/Authentication
- SSO configuration endpoints and UI integration with persistent settings - PR
- Update proxy admin ID role in DB + Handle SSO redirects with custom root path - PR
- Support returning virtual key in custom auth - PR
- User ID validation to ensure it is not an email or phone number - PR
Teams
- Fixed Create/Update team member API 500 error - PR
- Enterprise feature gating for RegenerateKeyModal in KeyInfoView - PR
SCIM
- Fixed SCIM running patch operation case sensitivity - PR
General
- Converted action buttons to sticky footer action buttons - PR
- Custom Server Root Path - support for serving UI on a custom root path - Guide

Logging / Guardrails Integrations

Logging

S3
- Async + Batched S3 Logging for improved performance - PR
DataDog
- Add instrumentation for streaming chunks - PR
- Add DD profiler to monitor Python profile of LiteLLM CPU% - PR
- Bump DD trace version - PR
Prometheus
- Pass custom metadata labels in litellm_total_token metrics - PR
GCS
- Update GCSBucketBase to handle GSM project ID if passed - PR

Guardrails

Presidio
- Add presidio_language yaml configuration support for guardrails - PR

Performance / Reliability Improvements

Performance Optimizations
- Don't run auth on /health/liveliness endpoints - PR
- Don't create 1 task for every hanging request alert - PR
- Add debugging endpoint to track active /asyncio-tasks - PR
- Make batch size for maximum retention in spend logs controllable - PR
- Expose flag to disable token counter - PR
- Support pipeline redis lpop for older redis versions - PR

Bug Fixes

LLM API Fixes
- Anthropic: Fix regression when passing file url's to the 'file_id' parameter - PR
- Vertex AI: Fix Vertex AI any_of issues for Description and Default. - PR
- Fix transcription model name mapping - PR
- Image Generation: Fix None values in usage field for gpt-image-1 model responses - PR
- Responses API: Fix _transform_responses_api_content_to_chat_completion_content doesn't support file content type - PR
- Fireworks AI: Fix rate limit exception mapping - detect "rate limit" text in error messages - PR
Spend Tracking/Budgets
- Respect user_header_name property for budget selection and user identification - PR
MCP Server
- Remove duplicate server_id MCP config servers - PR
Function Calling
- supports_function_calling works with llm_proxy models - PR
Knowledge Base
- Fixed Knowledge Base Call returning error - PR

New Contributors

@mjnitz02 made their first contribution in #10385
@hagan made their first contribution in #10479
@wwells made their first contribution in #11409
@likweitan made their first contribution in #11400
@raz-alon made their first contribution in #10102
@jtsai-quid made their first contribution in #11394
@tmbo made their first contribution in #11362
@wangsha made their first contribution in #11351
@seankwalker made their first contribution in #11452
@pazevedo-hyland made their first contribution in #11381
@cainiaoit made their first contribution in #11438
@vuanhtu52 made their first contribution in #11508

Demo Instance

Here's a Demo Instance to test changes:

Instance: https://demo.litellm.ai/
Login Credentials:
- Username: admin
- Password: sk-1234

Deploy this version​

TLDR​

/v1/messages Performance Improvements​

Multi-Instance Rate Limiting Improvements​

Audit Logs on UI​

New Models / Updated Models​

Model Updates​

LLM API Endpoints​

Spend Tracking​

Management Endpoints / UI​

Logging / Guardrails Integrations​

Logging​

Guardrails​

Performance / Reliability Improvements​

Bug Fixes​

New Contributors​

Demo Instance​

Git Diff​