A single local API server that can discover and serve multiple model types from local files:
- LLM (.gguf) via
llama-cpp-python - Image generation (.safetensors / diffusers folders) via
diffusers - Speech models (.onnx) via
onnxruntime
It supports multi-model routing, lazy loading (models load only on first use), request logging, and a health endpoint.
Create a virtual environment and install dependencies:
cd ai-server
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtRun the server:
uvicorn server:app --host 127.0.0.1 --port 8000Put your local models here:
ai-server/
models/
llm/ # .gguf files (e.g. mistral.gguf)
diffusion/ # diffusers folders OR single-file .safetensors
speech/ # .onnx files
output/ # generated images saved here
Place one or more .gguf files into:
ai-server/models/llm/
Example:
ai-server/models/llm/mistral.gguf
Either:
- put a full diffusers model folder inside
ai-server/models/diffusion/<model_name>/(recommended), or - put a single
.safetensorsfile directly inai-server/models/diffusion/
Examples:
ai-server/models/diffusion/sd15/(containsmodel_index.json, etc.)ai-server/models/diffusion/sd15.safetensors
Place one or more .onnx files into:
ai-server/models/speech/
GET /
Shows server status, available models, and defaults:
curl http://127.0.0.1:8000/POST /text?model=mistral
If model is omitted, the first discovered LLM becomes the default.
curl -X POST "http://127.0.0.1:8000/text?model=mistral" ^
-H "Content-Type: application/json" ^
-d "{\"prompt\":\"Explain AI\",\"max_tokens\":256}"Response:
{ "response": "..." }POST /image?model=sd15
curl -X POST "http://127.0.0.1:8000/image?model=sd15" ^
-H "Content-Type: application/json" ^
-d "{\"prompt\":\"a futuristic city\",\"filename\":\"image.png\"}"Response:
{ "image_path": "output/image.png" }POST /speech?model=whisper
This endpoint is implemented as a generic ONNX runner (speech models vary widely).
- If you pass a path to an existing
.wavfile, the server will attempt inference by feeding the audio into the first ONNX input. - If you pass plain text, the server returns it as-is.
curl -X POST "http://127.0.0.1:8000/speech?model=whisper" ^
-H "Content-Type: application/json" ^
-d "{\"input\":\"D:/path/to/audio.wav\"}"Response:
{ "result": [...] }- Lazy loading: the server only loads a model when you call an endpoint that uses it.
- Diffusion GPU/CPU: if
torch.cuda.is_available()is true, diffusion runs on CUDA withfloat16; otherwise CPUfloat32. - GGUF: LLM inference runs via
llama-cpp-python(CPU by default). - Graceful missing-model handling: endpoints return clean JSON errors if no models are present.