CodeLlama Server: Streaming, Caching, Model Fallbacks (OpenAI + Anthropic), Prompt-tracking

Works with: CodeLlama, Starcoder, Replit-Code-v1, Phind-CodeLlama, GPT-4, Claude-2, etc.

LIVE DEMO - https://litellm.ai/playground

What does CodeLlama Server do

Default model: Uses Together AI's CodeLlama to answer coding questions, with GPT-4 + Claude-2 as backups (you can easily switch this to any model from Huggingface, Replicate, Cohere, AI21, Azure, OpenAI, etc.)
- Together AI model + keys - https://together.ai/
Guardrail prompts: system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that."

Consistent Input/Output Format
- Call all models using the OpenAI format - completion(model, messages)
- Text responses will always be available at ['choices'][0]['message']['content']
- Stream responses will always be available at ['choices'][0]['delta']['content']
Error Handling Using Model Fallbacks (if CodeLlama fails, try GPT-4) with cooldowns, and retries
Prompt Tracking - Integrates with Promptlayer for model + prompt tracking

Example: Logs sent to PromptLayer
Token Usage & Spend - Track Input + Completion tokens used + Spend/model - https://docs.litellm.ai/docs/token_usage
Caching - Provides in-memory cache + GPT-Cache integration for more advanced usage - https://docs.litellm.ai/docs/caching/gpt_cache
Streaming & Async Support - Return generators to stream text responses - TEST IT 👉 https://litellm.ai/

API Endpoints

`/chat/completions` (POST)

This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc

Input

This API endpoint accepts all inputs in raw JSON and expects the following inputs

prompt (string, required): The user's coding related question
Additional Optional parameters: temperature, functions, function_call, top_p, n, stream. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/

Example JSON body

{
  "prompt": "write me a function to print hello world"
}

Making an API request to the Code-Gen Server

import requests
import json

url = "localhost:4000/chat/completions"

payload = json.dumps({
  "prompt": "write me a function to print hello world"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Output [Response Format]

Responses from the server are given in the following format. All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": ".\n\n```\ndef print_hello_world():\n    print(\"hello world\")\n",
                "role": "assistant"
            }
        }
    ],
    "created": 1693279694.6474009,
    "model": "togethercomputer/CodeLlama-34b-Instruct",
    "usage": {
        "completion_tokens": 14,
        "prompt_tokens": 28,
        "total_tokens": 42
    }
}

Installation & Usage

Running Locally

Clone liteLLM repository to your local machine:

git clone https://github.com/BerriAI/litellm-CodeLlama-server

Install the required dependencies using pip
```
pip install requirements.txt
```

Set your LLM API keys

os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY"
or
set OPENAI_API_KEY in your .env file

Run the server:
```
python main.py
```

Deploying

Quick Start: Deploy on Railway
GCP, AWS, Azure This project includes a Dockerfile allowing you to build and deploy a Docker Project on your providers

Support / Talk with founders

Our calendar 👋
Community Discord 💭
Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
Our emails ✉️ [email protected] / [email protected]

Roadmap

Implement user-based rate-limiting
Spending controls per project - expose key creation endpoint
Need to store a keys db -> mapping created keys to their alias (i.e. project name)
Easily add new models as backups / as the entry-point (add this to the available model list)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
__pycache__		__pycache__
imgs		imgs
Dockerfile		Dockerfile
README.MD		README.MD
litellm_uuid.txt		litellm_uuid.txt
main.py		main.py
requirements.txt		requirements.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeLlama Server: Streaming, Caching, Model Fallbacks (OpenAI + Anthropic), Prompt-tracking

What does CodeLlama Server do

API Endpoints

`/chat/completions` (POST)

Input

Example JSON body

Making an API request to the Code-Gen Server

Output [Response Format]

Installation & Usage

Running Locally

Deploying

Support / Talk with founders

Roadmap

About

Releases

Packages

Languages

BerriAI/litellm-CodeLlama-server

Folders and files

Latest commit

History

Repository files navigation

CodeLlama Server: Streaming, Caching, Model Fallbacks (OpenAI + Anthropic), Prompt-tracking

What does CodeLlama Server do

API Endpoints

/chat/completions (POST)

Input

Example JSON body

Making an API request to the Code-Gen Server

Output [Response Format]

Installation & Usage

Running Locally

Deploying

Support / Talk with founders

Roadmap

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`/chat/completions` (POST)

Packages