Skip to main content

OpenAI Moderation

Overviewโ€‹

PropertyDetails
DescriptionUse OpenAI's built-in Moderation API to detect and block harmful content including hate speech, harassment, self-harm, sexual content, and violence.
ProviderOpenAI Moderation API
Supported ActionsBLOCK (raises HTTP 400 exception when violations detected)
Supported Modespre_call, during_call, post_call
Streaming Supportโœ… Full support for streaming responses
API RequirementsOpenAI API key

Quick Startโ€‹

1. Define Guardrails on your LiteLLM config.yamlโ€‹

Define your guardrails under the guardrails section:

config.yaml
model_list:
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY

guardrails:
- guardrail_name: "openai-moderation-pre"
litellm_params:
guardrail: openai_moderation
mode: "pre_call"
api_key: os.environ/OPENAI_API_KEY # Optional if already set globally
model: "omni-moderation-latest" # Optional, defaults to omni-moderation-latest
api_base: "https://api.openai.com/v1" # Optional, defaults to OpenAI API

Supported values for modeโ€‹

  • pre_call Run before LLM call, on user input
  • during_call Run during LLM call, on user input. Same as pre_call but runs in parallel as LLM call. Response not returned until guardrail check completes.
  • post_call Run after LLM call, on LLM response

Supported OpenAI Moderation Modelsโ€‹

  • omni-moderation-latest (default) - Latest multimodal moderation model
  • text-moderation-latest - Latest text-only moderation model

2. Start LiteLLM Gatewayโ€‹

litellm --config config.yaml --detailed_debug

3. Test requestโ€‹

Expect this to fail since the request contains harmful content:

curl -i http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "I hate all people and want to hurt them"}
],
"guardrails": ["openai-moderation-pre"]
}'

Expected response on failure:

{
"error": {
"message": {
"error": "Violated OpenAI moderation policy",
"moderation_result": {
"violated_categories": ["hate", "violence"],
"category_scores": {
"hate": 0.95,
"violence": 0.87,
"harassment": 0.12,
"self-harm": 0.01,
"sexual": 0.02
}
}
},
"type": "None",
"param": "None",
"code": "400"
}
}

Advanced Configurationโ€‹

Multiple Guardrails for Input and Outputโ€‹

You can configure separate guardrails for user input and LLM responses:

Multiple Guardrails Config
guardrails:
- guardrail_name: "openai-moderation-input"
litellm_params:
guardrail: openai_moderation
mode: "pre_call"
api_key: os.environ/OPENAI_API_KEY

- guardrail_name: "openai-moderation-output"
litellm_params:
guardrail: openai_moderation
mode: "post_call"
api_key: os.environ/OPENAI_API_KEY

Custom API Configurationโ€‹

Configure custom OpenAI API endpoints or different models:

Custom API Config
guardrails:
- guardrail_name: "openai-moderation-custom"
litellm_params:
guardrail: openai_moderation
mode: "pre_call"
api_key: os.environ/OPENAI_API_KEY
api_base: "https://your-custom-openai-endpoint.com/v1"
model: "text-moderation-latest"

Streaming Supportโ€‹

The OpenAI Moderation guardrail fully supports streaming responses. When used in post_call mode, it will:

  1. Collect all streaming chunks
  2. Assemble the complete response
  3. Apply moderation to the full content
  4. Block the entire stream if violations are detected
  5. Return the original stream if content is safe
Streaming Config
guardrails:
- guardrail_name: "openai-moderation-streaming"
litellm_params:
guardrail: openai_moderation
mode: "post_call" # Works with streaming responses
api_key: os.environ/OPENAI_API_KEY

Content Categoriesโ€‹

The OpenAI Moderation API detects the following categories of harmful content:

CategoryDescription
hateContent that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste
harassmentContent that harasses, bullies, or intimidates an individual
self-harmContent that promotes, encourages, or depicts acts of self-harm
sexualContent meant to arouse sexual excitement or promote sexual services
violenceContent that depicts death, violence, or physical injury

Each category is evaluated with both a boolean flag and a confidence score (0.0 to 1.0).

Error Handlingโ€‹

When content violates OpenAI's moderation policy:

  • HTTP Status: 400 Bad Request
  • Error Type: HTTPException
  • Error Details: Includes violated categories and confidence scores
  • Behavior: Request is immediately blocked

Best Practicesโ€‹

1. Use Pre-call for User Inputโ€‹

guardrails:
- guardrail_name: "input-moderation"
litellm_params:
guardrail: openai_moderation
mode: "pre_call" # Block harmful user inputs early

2. Use Post-call for LLM Responsesโ€‹

guardrails:
- guardrail_name: "output-moderation"
litellm_params:
guardrail: openai_moderation
mode: "post_call" # Ensure LLM responses are safe

3. Combine with Other Guardrailsโ€‹

guardrails:
- guardrail_name: "openai-moderation"
litellm_params:
guardrail: openai_moderation
mode: "pre_call"

- guardrail_name: "custom-pii-detection"
litellm_params:
guardrail: presidio
mode: "pre_call"

Troubleshootingโ€‹

Common Issuesโ€‹

  1. Invalid API Key: Ensure your OpenAI API key is correctly set

    export OPENAI_API_KEY="sk-your-actual-key"
  2. Rate Limiting: OpenAI Moderation API has rate limits. Monitor usage in high-volume scenarios.

  3. Network Issues: Verify connectivity to OpenAI's API endpoints.

Debug Modeโ€‹

Enable detailed logging to troubleshoot issues:

litellm --config config.yaml --detailed_debug

Look for logs starting with OpenAI Moderation: to trace guardrail execution.

API Costsโ€‹

The OpenAI Moderation API is free to use for content policy compliance. This makes it a cost-effective guardrail option compared to other commercial moderation services.

Need Help?โ€‹

For additional support: