Skip to main content

[Pre Release] v1.72.2-stable

Krrish Dholakia
Ishaan Jaffer
info

The release candidate is live now.

The production release will be live on Wednesday.

Deploy this version​

docker run litellm
docker run
-e STORE_MODEL_IN_DB=True
-p 4000:4000
ghcr.io/berriai/litellm:main-v1.72.2.rc

TLDR​

  • Why Upgrade
    • Performance Improvements for /v1/messages: For this endpoint LiteLLM Proxy overhead is now down to 50ms at 250 RPS.
    • Accurate Rate Limiting: Multi-instance rate limiting now tracks rate limits across keys, models, teams, and users with 0 spillover.
    • Audit Logs on UI: Track when Keys, Teams, and Models were deleted by viewing Audit Logs on the LiteLLM UI.
    • /v1/messages all models support: You can now use all LiteLLM models (gpt-4.1, o1-pro, gemini-2.5-pro) with /v1/messages API.
    • Anthropic MCP: Use remote MCP Servers with Anthropic Models.
  • Who Should Read
    • Teams using /v1/messages API (Claude Code)
    • Proxy Admins using LiteLLM Virtual Keys and setting rate limits
  • Risk of Upgrade
    • Medium
      • Upgraded ddtrace==3.8.0, if you use DataDog tracing this is a medium level risk. We recommend monitoring logs for any issues.

/v1/messages Performance Improvements​

This release brings significant performance improvements to the /v1/messages API on LiteLLM.

For this endpoint LiteLLM Proxy overhead latency is now down to 50ms, and each instance can handle 250 RPS. We validated these improvements through load testing with payloads containing over 1,000 streaming chunks.

This is great for real time use cases with large requests (eg. multi turn conversations, Claude Code, etc.).

Multi-Instance Rate Limiting Improvements​

LiteLLM v1.72.2.rc now accurately tracks rate limits across keys, models, teams, and users with 0 spillover.

This is a significant improvement over the previous version, which faced issues with leakage and spillover in high traffic, multi-instance setups.

Key Changes:

  • Redis is now part of the rate limit check, instead of being a background sync. This ensures accuracy and reduces read/write operations during low activity.
  • LiteLLM now uses Lua scripts to ensure all checks are atomic.
  • In-memory caching uses Redis values. This prevents drift, and reduces Redis queries once objects are over their limit.

These changes are currently behind the feature flag - ENABLE_MULTI_INSTANCE_RATE_LIMITING=True. We plan to GA this in our next release - subject to feedback.

Audit Logs on UI​

This release introduces support for viewing audit logs in the UI. As a Proxy Admin, you can now check if and when a key was deleted, along with who performed the action.

LiteLLM tracks changes to the following entities and actions:

  • Entities: Keys, Teams, Users, Models
  • Actions: Create, Update, Delete, Regenerate

New Models / Updated Models​

Newly Added Models

ProviderModelContext WindowInput ($/1M tokens)Output ($/1M tokens)
Anthropicclaude-4-opus-20250514200K$15.00$75.00
Anthropicclaude-4-sonnet-20250514200K$3.00$15.00
VertexAI, Google AI Studiogemini-2.5-pro-preview-06-051M$1.25$10.00
OpenAIcodex-mini-latest200K$1.50$6.00
Cerebrasqwen-3-32b128K$0.40$0.80
SambaNovaDeepSeek-R132K$5.00$7.00
SambaNovaDeepSeek-R1-Distill-Llama-70B131K$0.70$1.40

Model Updates​

  • Anthropic
    • Cost tracking added for new Claude models - PR
      • claude-4-opus-20250514
      • claude-4-sonnet-20250514
    • Support for MCP tool calling with Anthropic models - PR
  • Google AI Studio
    • Google Gemini 2.5 Pro Preview 06-05 support - PR
    • Gemini streaming thinking content parsing with reasoning_content - PR
    • Support for no reasoning option for Gemini models - PR
    • URL context support for Gemini models - PR
    • Gemini embeddings-001 model prices and context window - PR
  • OpenAI
    • Cost tracking for codex-mini-latest - PR
  • Vertex AI
    • Cache token tracking on streaming calls - PR
    • Return response_id matching upstream response ID for stream and non-stream - PR
  • Cerebras
    • Cerebras/qwen-3-32b model pricing and context window - PR
  • HuggingFace
    • Fixed embeddings using non-default input_type - PR
  • DataRobot
    • New provider integration for enterprise AI workflows - PR
  • DeepSeek
    • DeepSeek R1 family model configuration via Together AI - PR
    • DeepSeek R1 pricing and context window configuration - PR

LLM API Endpoints​

  • Images API
    • Azure endpoint support for image endpoints - PR
  • Anthropic Messages API
    • Support for ALL LiteLLM Providers (OpenAI, Azure, Bedrock, Vertex, DeepSeek, etc.) on /v1/messages API Spec - PR
    • Performance improvements for /v1/messages route - PR
    • Return streaming usage statistics when using LiteLLM with Bedrock models - PR
  • Embeddings API
    • Provider-specific optional params handling for embedding calls - PR
    • Proper Sagemaker request attribute usage for embeddings - PR
  • Rerank API
    • New HuggingFace rerank provider support - PR, Guide

Spend Tracking​

  • Added token tracking for anthropic batch calls via /anthropic passthrough route- PR

Management Endpoints / UI​

  • SSO/Authentication
    • SSO configuration endpoints and UI integration with persistent settings - PR
    • Update proxy admin ID role in DB + Handle SSO redirects with custom root path - PR
    • Support returning virtual key in custom auth - PR
    • User ID validation to ensure it is not an email or phone number - PR
  • Teams
    • Fixed Create/Update team member API 500 error - PR
    • Enterprise feature gating for RegenerateKeyModal in KeyInfoView - PR
  • SCIM
    • Fixed SCIM running patch operation case sensitivity - PR
  • General
    • Converted action buttons to sticky footer action buttons - PR
    • Custom Server Root Path - support for serving UI on a custom root path - Guide

Logging / Guardrails Integrations​

Logging​

  • S3
    • Async + Batched S3 Logging for improved performance - PR
  • DataDog
    • Add instrumentation for streaming chunks - PR
    • Add DD profiler to monitor Python profile of LiteLLM CPU% - PR
    • Bump DD trace version - PR
  • Prometheus
    • Pass custom metadata labels in litellm_total_token metrics - PR
  • GCS
    • Update GCSBucketBase to handle GSM project ID if passed - PR

Guardrails​

  • Presidio
    • Add presidio_language yaml configuration support for guardrails - PR

Performance / Reliability Improvements​

  • Performance Optimizations
    • Don't run auth on /health/liveliness endpoints - PR
    • Don't create 1 task for every hanging request alert - PR
    • Add debugging endpoint to track active /asyncio-tasks - PR
    • Make batch size for maximum retention in spend logs controllable - PR
    • Expose flag to disable token counter - PR
    • Support pipeline redis lpop for older redis versions - PR

Bug Fixes​

  • LLM API Fixes
    • Anthropic: Fix regression when passing file url's to the 'file_id' parameter - PR
    • Vertex AI: Fix Vertex AI any_of issues for Description and Default. - PR
    • Fix transcription model name mapping - PR
    • Image Generation: Fix None values in usage field for gpt-image-1 model responses - PR
    • Responses API: Fix _transform_responses_api_content_to_chat_completion_content doesn't support file content type - PR
    • Fireworks AI: Fix rate limit exception mapping - detect "rate limit" text in error messages - PR
  • Spend Tracking/Budgets
    • Respect user_header_name property for budget selection and user identification - PR
  • MCP Server
    • Remove duplicate server_id MCP config servers - PR
  • Function Calling
    • supports_function_calling works with llm_proxy models - PR
  • Knowledge Base
    • Fixed Knowledge Base Call returning error - PR

New Contributors​


Demo Instance​

Here's a Demo Instance to test changes:

Git Diff​