A lightweight, high-performance API server for Rockchip NPUs (RKLLM), providing drop-in compatibility with OpenAI API, Claude API and Ollama API formats. This allows you to seamlessly integrate locally hosted large language models on Rockchip hardware with existing AI tools, frontends, and frameworks.
- 🚀 Hardware Optimized: Leverages Rockchip's NPU for fast inference.
- 🔄 Triple API Compatibility: Supports both standard OpenAI, Claude and Ollama API endpoints.
- 🌊 Real-time Streaming: Full support for Server-Sent Events (SSE) streaming token output.
- 🐳 Docker Ready: Minimal footprint containerization for easy deployment.
- 🛠️ No External Tokenizers: Operates independently without needing Hugging Face
transformersorAutoTokenizer.
- Hardware: RK3588 Series, RK3576 Series
- RKNPU Driver Version:
v0.9.8(Recommended)
Note: Check your RKNPU version before proceeding:
cat /sys/kernel/debug/rknpu/versionIf this command returns no output, your Linux kernel does not currently support the RKNPU.
The easiest way to run the server is via Docker.
docker run -d \
--name rkllm-server \
--restart unless-stopped \
--privileged \
-p 8080:8080 \
-v /dev:/dev \
-v /YOUR/PATH/TO/MODELS:/rkllm_server/models \
-e TARGET_PLATFORM=rk3588 \
-e RKLLM_MODEL_PATH=YOUR_MODEL_FILE_NAME.rkllm \
-e PORT=8080 \
dukihiroi/rkllm-server:latest
Create a docker-compose.yml file:
services:
rkllm-server:
image: dukihiroi/rkllm-server:latest
container_name: rkllm-server
restart: unless-stopped
privileged: true
ports:
- "8080:8080"
volumes:
- /dev:/dev
- ./models:/rkllm_server/models
environment:
- TARGET_PLATFORM=rk3588
- RKLLM_MODEL_PATH=qwen3-vl-2b-instruct_w8a8_rk3588.rkllm
- PORT=8080
Then start the server:
mkdir models # Place your .rkllm files here
docker compose up -d
Test the deployment:
curl http://localhost:8080/health
If you prefer to run the server directly on the host OS without Docker:
1. Clone the repository:
git clone https://github.com/anand34577/rkllm_openai.git
cd rkllm_openai
2. Install RKLLM Dynamic Libraries:
sudo cp lib/*.so /usr/lib
sudo ldconfig
3. Install uv (Fast Python Package Installer):
curl -LsSf https://astral.sh/uv/install.sh | sh
4. Sync Dependencies:
uv sync
5. Run the Server:
uv run server.py \
--rkllm_model_path=models/qwen3-vl-2b-instruct_w8a8_rk3588.rkllm \
--target_platform=rk3588 \
--port=8080
Once running, the server listens on the configured port (default 8080).
| API Type | Endpoint | Description |
|---|---|---|
| Server | GET /health |
Check server status and NPU availability. |
| OpenAI | POST /v1/chat/completions |
Standard chat completion (supports stream: true). |
| OpenAI | GET /v1/models |
Returns the currently loaded RKLLM model ID. |
| Claude | POST /v1/messages |
Anthropic-compatible message completion (supports stream: true). |
| Ollama | POST /api/chat |
Ollama-compatible chat completion. |
| Ollama | GET /api/tags |
Ollama-compatible model listing. |
You can test the OpenAI streaming implementation using the included Python client:
uv run client.py --host http://localhost:8080 --prompt "Explain quantum mechanics briefly." --stream
Hardware Concurrency Limit Because the NPU handles one inference task at a time, the server can only process one conversation at a time. * Do not use this server for heavy background tasks (like bulk title/tag generation) if you also want it to remain responsive for interactive chat.
- If a new request arrives while the NPU is busy, the server will briefly wait. If the NPU does not free up, it will return an HTTP
503 Service Unavailableerror rather than crashing.
To download pre-converted .rkllm models, please refer to the official Rockchip rknn-llm Model Zoo.