Skip to main content

Benchmarks

Benchmarks for LiteLLM Gateway (Proxy Server) tested against a fake OpenAI endpoint.

Use this config for testing:

model_list:
- model_name: "fake-openai-endpoint"
litellm_params:
model: openai/any
api_base: https://your-fake-openai-endpoint.com/chat/completions
api_key: "test"

2 Instance LiteLLM Proxy​

In these tests the baseline latency characteristics are measured against a fake-openai-endpoint.

Performance Metrics​

TypeNameMedian (ms)95%ile (ms)99%ile (ms)Average (ms)Current RPS
POST/chat/completions2006301200262.461035.7
CustomLiteLLM Overhead Duration (ms)12294314.741035.7
Aggregated100430930138.62071.4

4 Instances​

TypeNameMedian (ms)95%ile (ms)99%ile (ms)Average (ms)Current RPS
POST/chat/completions100150240111.731170
CustomLiteLLM Overhead Duration (ms)28133.321170
Aggregated7713018057.532340

Key Findings​

  • Doubling from 2 to 4 LiteLLM instances halves median latency: 200 ms → 100 ms.
  • High-percentile latencies drop significantly: P95 630 ms → 150 ms, P99 1,200 ms → 240 ms.
  • Setting workers equal to CPU count gives optimal performance.

Machine Spec used for testing​

Each machine deploying LiteLLM had the following specs:

  • 4 CPU
  • 8GB RAM

Locust Settings​

  • 1000 Users
  • 500 user Ramp Up

How to measure LiteLLM Overhead​

All responses from litellm will include the x-litellm-overhead-duration-ms header, this is the latency overhead in milliseconds added by LiteLLM Proxy.

If you want to measure this on locust you can use the following code:

Locust Code for measuring LiteLLM Overhead
import os
import uuid
from locust import HttpUser, task, between, events

# Custom metric to track LiteLLM overhead duration
overhead_durations = []

@events.request.add_listener
def on_request(request_type, name, response_time, response_length, response, context, exception, start_time, url, **kwargs):
if response and hasattr(response, 'headers'):
overhead_duration = response.headers.get('x-litellm-overhead-duration-ms')
if overhead_duration:
try:
duration_ms = float(overhead_duration)
overhead_durations.append(duration_ms)
# Report as custom metric
events.request.fire(
request_type="Custom",
name="LiteLLM Overhead Duration (ms)",
response_time=duration_ms,
response_length=0,
)
except (ValueError, TypeError):
pass

class MyUser(HttpUser):
wait_time = between(0.5, 1) # Random wait time between requests

def on_start(self):
self.api_key = os.getenv('API_KEY', 'sk-1234567890')
self.client.headers.update({'Authorization': f'Bearer {self.api_key}'})

@task
def litellm_completion(self):
# no cache hits with this
payload = {
"model": "db-openai-endpoint",
"messages": [{"role": "user", "content": f"{uuid.uuid4()} This is a test there will be no cache hits and we'll fill up the context" * 150}],
"user": "my-new-end-user-1"
}
response = self.client.post("chat/completions", json=payload)

if response.status_code != 200:
# log the errors in error.txt
with open("error.txt", "a") as error_log:
error_log.write(response.text + "\n")

Logging Callbacks​

GCS Bucket Logging​

Using GCS Bucket has no impact on latency, RPS compared to Basic Litellm Proxy

MetricBasic Litellm ProxyLiteLLM Proxy with GCS Bucket Logging
RPS1133.21137.3
Median Latency (ms)140138

LangSmith logging​

Using LangSmith has no impact on latency, RPS compared to Basic Litellm Proxy

MetricBasic Litellm ProxyLiteLLM Proxy with LangSmith
RPS1133.21135
Median Latency (ms)140132