Skip to content

itshusky01/rkllm_openai_like_api

 
 

Repository files navigation

RKLLM API Server

Introduction

A lightweight, high-performance API server for Rockchip NPUs (RKLLM), providing drop-in compatibility with OpenAI API, Claude API and Ollama API formats. This allows you to seamlessly integrate locally hosted large language models on Rockchip hardware with existing AI tools, frontends, and frameworks.

Features

  • 🚀 Hardware Optimized: Leverages Rockchip's NPU for fast inference.
  • 🔄 Triple API Compatibility: Supports both standard OpenAI, Claude and Ollama API endpoints.
  • 🌊 Real-time Streaming: Full support for Server-Sent Events (SSE) streaming token output.
  • 🐳 Docker Ready: Minimal footprint containerization for easy deployment.
  • 🛠️ No External Tokenizers: Operates independently without needing Hugging Face transformers or AutoTokenizer.

Supported Platforms

  • Hardware: RK3588 Series, RK3576 Series
  • RKNPU Driver Version: v0.9.8 (Recommended)

Note: Check your RKNPU version before proceeding:

cat /sys/kernel/debug/rknpu/version

If this command returns no output, your Linux kernel does not currently support the RKNPU.


🐳 Quickstart (Docker Recommended)

The easiest way to run the server is via Docker.

Option A: Docker CLI

docker run -d \
  --name rkllm-server \
  --restart unless-stopped \
  --privileged \
  -p 8080:8080 \
  -v /dev:/dev \
  -v /YOUR/PATH/TO/MODELS:/rkllm_server/models \
  -e TARGET_PLATFORM=rk3588 \
  -e RKLLM_MODEL_PATH=YOUR_MODEL_FILE_NAME.rkllm \
  -e PORT=8080 \
  dukihiroi/rkllm-server:latest

Option B: Docker Compose

Create a docker-compose.yml file:

services:
  rkllm-server:
    image: dukihiroi/rkllm-server:latest
    container_name: rkllm-server
    restart: unless-stopped
    privileged: true
    ports:
      - "8080:8080"
    volumes:
      - /dev:/dev
      - ./models:/rkllm_server/models
    environment:
      - TARGET_PLATFORM=rk3588
      - RKLLM_MODEL_PATH=qwen3-vl-2b-instruct_w8a8_rk3588.rkllm
      - PORT=8080

Then start the server:

mkdir models # Place your .rkllm files here
docker compose up -d

Test the deployment:

curl http://localhost:8080/health

🛠️ Manual Installation

If you prefer to run the server directly on the host OS without Docker:

1. Clone the repository:

git clone https://github.com/anand34577/rkllm_openai.git
cd rkllm_openai

2. Install RKLLM Dynamic Libraries:

sudo cp lib/*.so /usr/lib
sudo ldconfig

3. Install uv (Fast Python Package Installer):

curl -LsSf https://astral.sh/uv/install.sh | sh

4. Sync Dependencies:

uv sync

5. Run the Server:

uv run server.py \
  --rkllm_model_path=models/qwen3-vl-2b-instruct_w8a8_rk3588.rkllm \
  --target_platform=rk3588 \
  --port=8080

🔌 API Endpoints

Once running, the server listens on the configured port (default 8080).

API Type Endpoint Description
Server GET /health Check server status and NPU availability.
OpenAI POST /v1/chat/completions Standard chat completion (supports stream: true).
OpenAI GET /v1/models Returns the currently loaded RKLLM model ID.
Claude POST /v1/messages Anthropic-compatible message completion (supports stream: true).
Ollama POST /api/chat Ollama-compatible chat completion.
Ollama GET /api/tags Ollama-compatible model listing.

Testing with the Built-in Client

You can test the OpenAI streaming implementation using the included Python client:

uv run client.py --host http://localhost:8080 --prompt "Explain quantum mechanics briefly." --stream

⚠️ Important Limitations & Notes

Hardware Concurrency Limit Because the NPU handles one inference task at a time, the server can only process one conversation at a time. * Do not use this server for heavy background tasks (like bulk title/tag generation) if you also want it to remain responsive for interactive chat.

  • If a new request arrives while the NPU is busy, the server will briefly wait. If the NPU does not free up, it will return an HTTP 503 Service Unavailable error rather than crashing.

📦 Model Zoo

To download pre-converted .rkllm models, please refer to the official Rockchip rknn-llm Model Zoo.


About

The rkllm server which is compatible with OpenAI API, Ollama API, and Claude API format.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 92.2%
  • Shell 5.0%
  • Dockerfile 2.8%