Skip to main content

Huggingface

LiteLLM supports the following types of Huggingface models:

Usage​

Open In Colab

You need to tell LiteLLM when you're calling Huggingface. This is done by adding the "huggingface/" prefix to model, example completion(model="huggingface/<model_name>",...).

By default, LiteLLM will assume a huggingface call follows the TGI format.

import os 
from litellm import completion

# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

# e.g. Call 'WizardLM/WizardCoder-Python-34B-V1.0' hosted on HF Inference endpoints
response = completion(
model="huggingface/WizardLM/WizardCoder-Python-34B-V1.0",
messages=messages,
api_base="https://my-endpoint.huggingface.cloud"
)

print(response)

Streaming​

Open In Colab

You need to tell LiteLLM when you're calling Huggingface. This is done by adding the "huggingface/" prefix to model, example completion(model="huggingface/<model_name>",...).

import os 
from litellm import completion

# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

# e.g. Call 'facebook/blenderbot-400M-distill' hosted on HF Inference endpoints
response = completion(
model="huggingface/facebook/blenderbot-400M-distill",
messages=messages,
api_base="https://my-endpoint.huggingface.cloud",
stream=True
)

print(response)
for chunk in response:
print(chunk)

Embedding​

LiteLLM supports Huggingface's text-embedding-inference format.

from litellm import embedding
import os
os.environ['HUGGINGFACE_API_KEY'] = ""
response = embedding(
model='huggingface/microsoft/codebert-base',
input=["good morning from litellm"]
)

Advanced​

Setting API KEYS + API BASE​

If required, you can set the api key + api base, set it in your os environment. Code for how it's sent

import os 
os.environ["HUGGINGFACE_API_KEY"] = ""
os.environ["HUGGINGFACE_API_BASE"] = ""

Viewing Log probs​

Using decoder_input_details - OpenAI echo​

The echo param is supported by OpenAI Completions - Use litellm.text_completion() for this

from litellm import text_completion
response = text_completion(
model="huggingface/bigcode/starcoder",
prompt="good morning",
max_tokens=10, logprobs=10,
echo=True
)

Output​

{
"id":"chatcmpl-3fc71792-c442-4ba1-a611-19dd0ac371ad",
"object":"text_completion",
"created":1698801125.936519,
"model":"bigcode/starcoder",
"choices":[
{
"text":", I'm going to make you a sand",
"index":0,
"logprobs":{
"tokens":[
"good",
" morning",
",",
" I",
"'m",
" going",
" to",
" make",
" you",
" a",
" s",
"and"
],
"token_logprobs":[
"None",
-14.96875,
-2.2285156,
-2.734375,
-2.0957031,
-2.0917969,
-0.09429932,
-3.1132812,
-1.3203125,
-1.2304688,
-1.6201172,
-0.010292053
]
},
"finish_reason":"length"
}
],
"usage":{
"completion_tokens":9,
"prompt_tokens":2,
"total_tokens":11
}
}

Models with Prompt Formatting​

For models with special prompt templates (e.g. Llama2), we format the prompt to fit their template.

Models with natively Supported Prompt Templates​

Model NameWorks for ModelsFunction CallRequired OS Variables
mistralai/Mistral-7B-Instruct-v0.1mistralai/Mistral-7B-Instruct-v0.1completion(model='huggingface/mistralai/Mistral-7B-Instruct-v0.1', messages=messages, api_base="your_api_endpoint")os.environ['HUGGINGFACE_API_KEY']
meta-llama/Llama-2-7b-chatAll meta-llama llama2 chat modelscompletion(model='huggingface/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")os.environ['HUGGINGFACE_API_KEY']
tiiuae/falcon-7b-instructAll falcon instruct modelscompletion(model='huggingface/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")os.environ['HUGGINGFACE_API_KEY']
mosaicml/mpt-7b-chatAll mpt chat modelscompletion(model='huggingface/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")os.environ['HUGGINGFACE_API_KEY']
codellama/CodeLlama-34b-Instruct-hfAll codellama instruct modelscompletion(model='huggingface/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")os.environ['HUGGINGFACE_API_KEY']
WizardLM/WizardCoder-Python-34B-V1.0All wizardcoder modelscompletion(model='huggingface/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")os.environ['HUGGINGFACE_API_KEY']
Phind/Phind-CodeLlama-34B-v2All phind-codellama modelscompletion(model='huggingface/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")os.environ['HUGGINGFACE_API_KEY']

What if we don't support a model you need? You can also specify you're own custom prompt formatting, in case we don't have your model covered yet.

Does this mean you have to specify a prompt for all models? No. By default we'll concatenate your message content to make a prompt.

Default Prompt Template

def default_pt(messages):
return " ".join(message["content"] for message in messages)

Code for how prompt formats work in LiteLLM

Custom prompt templates​

# Create your own custom prompt template works 
litellm.register_prompt_template(
model="togethercomputer/LLaMA-2-7B-32K",
roles={
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST] ",
"post_message": " [/INST]\n"
},
"assistant": {
"post_message": "\n"
}
}
)

def test_huggingface_custom_model():
model = "huggingface/togethercomputer/LLaMA-2-7B-32K"
response = completion(model=model, messages=messages, api_base="https://ecd4sb5n09bo4ei2.us-east-1.aws.endpoints.huggingface.cloud")
print(response['choices'][0]['message']['content'])
return response

test_huggingface_custom_model()

Implementation Code

Deploying a model on huggingface​

You can use any chat/text model from Hugging Face with the following steps:

  • Copy your model id/url from Huggingface Inference Endpoints
  • Set it as your model name
  • Set your HUGGINGFACE_API_KEY as an environment variable

Need help deploying a model on huggingface? Check out this guide.

output

Same as the OpenAI format, but also includes logprobs. See the code

{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "\ud83d\ude31\n\nComment: @SarahSzabo I'm",
"role": "assistant",
"logprobs": -22.697942825499993
}
}
],
"created": 1693436637.38206,
"model": "https://ji16r2iys9a8rjk2.us-east-1.aws.endpoints.huggingface.cloud",
"usage": {
"prompt_tokens": 14,
"completion_tokens": 11,
"total_tokens": 25
}
}

FAQ

Does this support stop sequences?

Yes, we support stop sequences - and you can pass as many as allowed by Huggingface (or any provider!)

How do you deal with repetition penalty?

We map the presence penalty parameter in openai to the repetition penalty parameter on Huggingface. See code.

We welcome any suggestions for improving our Huggingface integration - Create an issue/Join the Discord!