ScaleLLM

An efficient LLM Inference solution

ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more.

ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our Roadmap for more details.

News:

[06/2024] - ScaleLLM is now available on PyPI. You can install it using pip install scalellm.
[03/2024] - Advanced features support for ✅ CUDA graph, ✅ prefix cache, ✅ chunked prefill and ✅ speculative decoding.
[11/2023] - First release with support for popular open-source models.

Key Features

High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more.
Tensor Parallelism: Utilizes tensor parallelism for efficient model execution.
OpenAI-compatible API: An OpenAI-compatible REST API server that supports both chat and completions.
Huggingface models: Seamless integration with most popular HF models, supporting safetensors.
Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.

Getting Started

ScaleLLM is available as a Python Wheel package on PyPI. You can install it using pip:

# Install scalellm with CUDA 12.1 and Pytorch 2.4.0
pip install scalellm

If you want to install ScaleLLM with different version of CUDA and Pytorch, you can pip install it with provding index URL of the version. For example, to install ScaleLLM with CUDA 12.1 and Pytorch 2.2.2, you can use the following command:

pip install scalellm -i https://whl.vectorch.com/cu121/torch2.2.2/

Build from source

If no wheel package is available for your configuration, you can build ScaleLLM from source code. You can clone the repository and install it locally using the following commands:

python setup.py bdist_wheel
pip install dist/scalellm-*.whl

OpenAI-Compatible Server

You can start the OpenAI-compatible REST API server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct

Chatbot UI

A local Chatbot UI is also available on localhost:3000. You can start it with latest image using the following command:

docker pull docker.io/vectorchai/chatbot-ui:latest
docker run -it --net=host \
  -e OPENAI_API_HOST=http://127.0.0.1:8080 \
  -e OPENAI_API_KEY=YOUR_API_KEY \
  docker.io/vectorchai/chatbot-ui:latest

Usage Examples

You can use ScaleLLM for offline batch inference, or online distributed inference. Below are some examples to help you get started. More examples can be found in the examples folder.

Chat Completions

Start rest api server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct

You can query the chat completions with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

or with openai python client:

import openai

client = openai.Client(
    base_url="http://localhost:8080/v1",
    api_key="EMPTY",
)

# List available models
models = client.models.list()
print("==== Available models ====")
for model in models.data:
    print(model.id)

# choose the first model
model = models.data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello"},
    ],
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in stream:
    choice = chunk.choices[0]
    delta = choice.delta
    if delta.content:
        print(delta.content, end="")
print()

Completions

Start rest api server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B

For regular completions, you can use this example:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "prompt": "hello",
    "max_tokens": 32,
    "temperature": 0.7,
    "stream": true
  }'

import openai

client = openai.Client(
    base_url="http://localhost:8080/v1",
    api_key="EMPTY",
)

# List available models
models = client.models.list()

print("==== Available models ====")
for model in models.data:
    print(model.id)

# choose the first model
model = models.data[0].id

stream = client.completions.create(
    model=model,
    prompt="hello",
    max_tokens=32,
    temperature=0.7,
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in stream:
    choice = chunk.choices[0]
    if choice.text:
        print(choice.text, end="")
print()

Advanced Features

CUDA Graph

CUDA Graph can improve performance by reducing the overhead of launching kernels. ScaleLLM supports CUDA Graph for decoding by default. In addition, It also allows user to specify which batch size to capture by setting the --cuda_graph_batch_sizes flag.

for example:

python3 -m scalellm.serve.api_server \
  --model=meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable_cuda_graph=true \
  --cuda_graph_batch_sizes=1,2,4,8

The limitations of CUDA Graph could cause problems during development and debugging. If you encounter any issues related to it, you can disable CUDA Graph by setting the --enable_cuda_graph=false flag.

Prefix Cache

The KV cache is a technique that caches the intermediate kv states to avoid redundant computation during LLM inference. Prefix cache extends this idea by allowing kv caches with the same prefix to be shared among different requests.

ScaleLLM supports Prefix Cache and enables it by default. You can disable it by setting the --enable_prefix_cache=false flag.

Chunked Prefill

Chunked Prefill splits a long user prompt into multiple chunks and populates the remaining slots with decodes. This technique can improve decoding throughput and enhance the user experience caused by long stalls. However it may slightly increase Time to First Token (TTFT). ScaleLLM supports Chunked Prefill, and its behavior can be controlled by setting the following flags:

--max_tokens_per_batch: The maximum tokens for each batch, default is 512.
--max_seqs_per_batch: The maximum sequences for each batch, default is 128.

Speculative Decoding

Speculative Decoding is a common used technique to speed up LLM inference without changing distribution. During inference, it employs an economical approximation to generate speculative tokens, subsequently validated by the target model. For now, ScaleLLM supports Speculative Decoding with a draft model to generate draft tokens, which can be enabled by configuring a draft model and setting the speculative steps.

for example:

python3 -m scalellm.serve.api_server \
  --model=google/gemma-7b-it \
  --draft_model=google/gemma-2b-it \
  --num_speculative_tokens=5 \
  --device=cuda:0 \
  --draft_device=cuda:0

Quantization

Quantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization (GPTQ) and Activation-aware Weight Quantization (AWQ), with seamless integration into the following libraries: autogptq, exllama, exllamav2, and awq.

By default, exllamav2 is employed for GPTQ 4-bit quantization. However, you have the flexibility to choose a specific implementation by configuring the "--qlinear_gptq_impl" option, which allows you to select from exllama, exllamav2, or auto option.

Supported Models

Models	Tensor Parallel	Quantization	Chat API	HF models examples
Aquila	Yes	Yes	Yes	BAAI/Aquila-7B, BAAI/AquilaChat-7B
Bloom	Yes	Yes	No	bigscience/bloom
Baichuan	Yes	Yes	Yes	baichuan-inc/Baichuan2-7B-Chat
ChatGLM3	Yes	Yes	Yes	THUDM/chatglm3-6b
Gemma	Yes	Yes	Yes	google/gemma-2b
GPT_j	Yes	Yes	No	EleutherAI/gpt-j-6b
GPT_NeoX	Yes	Yes	No	EleutherAI/gpt-neox-20b
GPT2	Yes	Yes	No	gpt2
InternLM	Yes	Yes	Yes	internlm/internlm-7b
Llama3/2	Yes	Yes	Yes	meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-8B, meta-llama/Llama-2-7b
Mistral	Yes	Yes	Yes	mistralai/Mistral-7B-v0.1
MPT	Yes	Yes	Yes	mosaicml/mpt-30b
Phi2	Yes	Yes	No	microsoft/phi-2
Qwen	Yes	Yes	Yes	Qwen/Qwen-72B-Chat
Yi	Yes	Yes	Yes	01-ai/Yi-6B, 01-ai/Yi-34B-Chat-4bits, 01-ai/Yi-6B-200K

If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on GitHub Issues.

Limitations

There are several known limitations we are looking to address in the coming months, including:

Only supports GPUs that newer than Turing architecture.

Contributing

If you have any questions or want to contribute, please don't hesitate to ask in our "Discussions" forum or join our "Discord" chat room. We welcome your input and contributions to make ScaleLLM even better. Please follow the Contributing.md to get started.

Acknowledgements

The following open-source projects have been used in this project, either in their original form or modified to meet our needs:

License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 571 Commits
.github/workflows		.github/workflows
cmake		cmake
docker		docker
docs		docs
examples		examples
gateway		gateway
monitoring		monitoring
proto		proto
scalellm		scalellm
scripts		scripts
src		src
tests		tests
third_party		third_party
tools		tools
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.cppcheck-suppress		.cppcheck-suppress
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
scalellm.yml		scalellm.yml
setup.py		setup.py
vcpkg.json		vcpkg.json
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScaleLLM

An efficient LLM Inference solution

News:

Key Features

Table of contents

Getting Started

Build from source

OpenAI-Compatible Server

Chatbot UI

Usage Examples

Chat Completions

Completions

Advanced Features

CUDA Graph

Prefix Cache

Chunked Prefill

Speculative Decoding

Quantization

Supported Models

Limitations

Contributing

Acknowledgements

License

About

Releases 18

Packages

Contributors 5

Languages

License

vectorch-ai/ScaleLLM

Folders and files

Latest commit

History

Repository files navigation

ScaleLLM

An efficient LLM Inference solution

News:

Key Features

Table of contents

Getting Started

Build from source

OpenAI-Compatible Server

Chatbot UI

Usage Examples

Chat Completions

Completions

Advanced Features

CUDA Graph

Prefix Cache

Chunked Prefill

Speculative Decoding

Quantization

Supported Models

Limitations

Contributing

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 18

Packages 0

Contributors 5

Languages

Packages