Skip to main content

2 posts tagged with "performance"

View All Tags

Your Middleware Could Be a Bottleneck

Krrish Dholakia
CEO, LiteLLM
Ishaan Jaff
CTO, LiteLLM
Ryan Crabbe
Performance Engineer, LiteLLM

How we improved LiteLLM proxy latency and throughput by replacing a single, simple middleware base class


Our Setup​

The LiteLLM proxy server has two middleware layers. The first is Starlette's CORSMiddleware (re-exported by FastAPI), which is a pure ASGI middleware. Then we have a simple BaseHTTPMiddleware called PrometheusAuthMiddleware.

The job of PrometheusAuthMiddleware is to authenticate requests to the /metrics endpoint. It's not on by default, you enable it with a flag in your proxy config:

Proxy config flag
litellm_settings:
require_auth_for_metrics_endpoint: true

The middleware checks two things: is the request hitting /metrics, and is auth even enabled? If both checks fail, which they do for the vast majority of requests, it just passes the request through unchanged.

PrometheusAuthMiddleware source
class PrometheusAuthMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
if self._is_prometheus_metrics_endpoint(request):
if self._should_run_auth_on_metrics_endpoint() is True:
try:
await user_api_key_auth(request=request, api_key=...)
except Exception as e:
return JSONResponse(status_code=401, content=...)
response = await call_next(request)
return response

@staticmethod
def _is_prometheus_metrics_endpoint(request: Request):
if "/metrics" in request.url.path:
return True
return False

Looks harmless. Subclass BaseHTTPMiddleware, implement dispatch(), done. This is what you will see in Starlette's documentation1.

Achieving Sub-Millisecond Proxy Overhead

Alexsander Hamir
Performance Engineer, LiteLLM
Krrish Dholakia
CEO, LiteLLM
Ishaan Jaff
CTO, LiteLLM

Sidecar architecture: Python control plane vs. sidecar hot path

Introduction​

Our Q1 performance target is to aggressively move toward sub-millisecond proxy overhead on a single instance with 4 CPUs and 8 GB of RAM, and to continue pushing that boundary over time. Our broader goal is to make LiteLLM inexpensive to deploy, lightweight, and fast. This post outlines the architectural direction behind that effort.

Proxy overhead refers to the latency introduced by LiteLLM itself, independent of the upstream provider.

To measure it, we run the same workload directly against the provider and through LiteLLM at identical QPS (for example, 1,000 QPS) and compare the latency delta. To reduce noise, the load generator, LiteLLM, and a mock LLM endpoint all run on the same machine, ensuring the difference reflects proxy overhead rather than network latency.