Skip to main content

LlamaGate

Overview​

PropertyDetails
DescriptionLlamaGate is an OpenAI-compatible API gateway for open-source LLMs with credit-based billing. Access 26+ open-source models including Llama, Mistral, DeepSeek, and Qwen at competitive prices.
Provider Route on LiteLLMllamagate/
Link to Provider DocLlamaGate Documentation ↗
Base URLhttps://api.llamagate.dev/v1
Supported Operations/chat/completions, /embeddings

What is LlamaGate?​

LlamaGate provides access to open-source LLMs through an OpenAI-compatible API:

  • 26+ Open-Source Models: Llama 3.1/3.2, Mistral, Qwen, DeepSeek R1, and more
  • OpenAI-Compatible API: Drop-in replacement for OpenAI SDK
  • Vision Models: Qwen VL, LLaVA, olmOCR, UI-TARS for multimodal tasks
  • Reasoning Models: DeepSeek R1, OpenThinker for complex problem-solving
  • Code Models: CodeLlama, DeepSeek Coder, Qwen Coder, StarCoder2
  • Embedding Models: Nomic, Qwen3 Embedding for RAG and search
  • Competitive Pricing: $0.02-$0.55 per 1M tokens

Required Variables​

Environment Variables
os.environ["LLAMAGATE_API_KEY"] = ""  # your LlamaGate API key

Get your API key from llamagate.dev.

Supported Models​

General Purpose​

ModelModel ID
Llama 3.1 8Bllamagate/llama-3.1-8b
Llama 3.2 3Bllamagate/llama-3.2-3b
Mistral 7B v0.3llamagate/mistral-7b-v0.3
Qwen 3 8Bllamagate/qwen3-8b
Dolphin 3 8Bllamagate/dolphin3-8b

Reasoning Models​

ModelModel ID
DeepSeek R1 8Bllamagate/deepseek-r1-8b
DeepSeek R1 Distill Qwen 7Bllamagate/deepseek-r1-7b-qwen
OpenThinker 7Bllamagate/openthinker-7b

Code Models​

ModelModel ID
Qwen 2.5 Coder 7Bllamagate/qwen2.5-coder-7b
DeepSeek Coder 6.7Bllamagate/deepseek-coder-6.7b
CodeLlama 7Bllamagate/codellama-7b
CodeGemma 7Bllamagate/codegemma-7b
StarCoder2 7Bllamagate/starcoder2-7b

Vision Models​

ModelModel ID
Qwen 3 VL 8Bllamagate/qwen3-vl-8b
LLaVA 1.5 7Bllamagate/llava-7b
Gemma 3 4Bllamagate/gemma3-4b
olmOCR 7Bllamagate/olmocr-7b
UI-TARS 1.5 7Bllamagate/ui-tars-7b

Embedding Models​

ModelModel ID
Nomic Embed Textllamagate/nomic-embed-text
Qwen 3 Embedding 8Bllamagate/qwen3-embedding-8b
EmbeddingGemma 300Mllamagate/embeddinggemma-300m

Usage - LiteLLM Python SDK​

Non-streaming​

LlamaGate Non-streaming Completion
import os
import litellm
from litellm import completion

os.environ["LLAMAGATE_API_KEY"] = "" # your LlamaGate API key

messages = [{"content": "What is the capital of France?", "role": "user"}]

# LlamaGate call
response = completion(
model="llamagate/llama-3.1-8b",
messages=messages
)

print(response)

Streaming​

LlamaGate Streaming Completion
import os
import litellm
from litellm import completion

os.environ["LLAMAGATE_API_KEY"] = "" # your LlamaGate API key

messages = [{"content": "Write a short poem about AI", "role": "user"}]

# LlamaGate call with streaming
response = completion(
model="llamagate/llama-3.1-8b",
messages=messages,
stream=True
)

for chunk in response:
print(chunk)

Vision​

LlamaGate Vision Completion
import os
import litellm
from litellm import completion

os.environ["LLAMAGATE_API_KEY"] = "" # your LlamaGate API key

messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}
]

# LlamaGate vision call
response = completion(
model="llamagate/qwen3-vl-8b",
messages=messages
)

print(response)

Embeddings​

LlamaGate Embeddings
import os
import litellm
from litellm import embedding

os.environ["LLAMAGATE_API_KEY"] = "" # your LlamaGate API key

# LlamaGate embedding call
response = embedding(
model="llamagate/nomic-embed-text",
input=["Hello world", "How are you?"]
)

print(response)

Usage - LiteLLM Proxy Server​

1. Save key in your environment​

export LLAMAGATE_API_KEY=""

2. Start the proxy​

model_list:
- model_name: llama-3.1-8b
litellm_params:
model: llamagate/llama-3.1-8b
api_key: os.environ/LLAMAGATE_API_KEY
- model_name: deepseek-r1
litellm_params:
model: llamagate/deepseek-r1-8b
api_key: os.environ/LLAMAGATE_API_KEY
- model_name: qwen-coder
litellm_params:
model: llamagate/qwen2.5-coder-7b
api_key: os.environ/LLAMAGATE_API_KEY

Supported OpenAI Parameters​

LlamaGate supports all standard OpenAI-compatible parameters:

ParameterTypeDescription
messagesarrayRequired. Array of message objects with 'role' and 'content'
modelstringRequired. Model ID
streambooleanOptional. Enable streaming responses
temperaturefloatOptional. Sampling temperature (0-2)
top_pfloatOptional. Nucleus sampling parameter
max_tokensintegerOptional. Maximum tokens to generate
frequency_penaltyfloatOptional. Penalize frequent tokens
presence_penaltyfloatOptional. Penalize tokens based on presence
stopstring/arrayOptional. Stop sequences
toolsarrayOptional. List of available tools/functions
tool_choicestring/objectOptional. Control tool/function calling
response_formatobjectOptional. JSON mode or JSON schema

Pricing​

LlamaGate offers competitive per-token pricing:

Model CategoryInput (per 1M)Output (per 1M)
Embeddings$0.02-
Small (3-4B)$0.03-$0.04$0.08
Medium (7-8B)$0.03-$0.15$0.05-$0.55
Code Models$0.06-$0.10$0.12-$0.20
Reasoning$0.08-$0.10$0.15-$0.20

Additional Resources​