Unified Local AI Server (FastAPI)

A single local API server that can discover and serve multiple model types from local files:

LLM (.gguf) via llama-cpp-python
Image generation (.safetensors / diffusers folders) via diffusers
Speech models (.onnx) via onnxruntime

It supports multi-model routing, lazy loading (models load only on first use), request logging, and a health endpoint.

Setup

Create a virtual environment and install dependencies:

cd ai-server
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Run the server:

uvicorn server:app --host 127.0.0.1 --port 8000

Where to place models

Put your local models here:

ai-server/
  models/
    llm/        # .gguf files (e.g. mistral.gguf)
    diffusion/  # diffusers folders OR single-file .safetensors
    speech/     # .onnx files
  output/       # generated images saved here

LLM (.gguf)

Place one or more .gguf files into:

ai-server/models/llm/

Example:

ai-server/models/llm/mistral.gguf

Diffusion (.safetensors or diffusers folder)

Either:

put a full diffusers model folder inside ai-server/models/diffusion/<model_name>/ (recommended), or
put a single .safetensors file directly in ai-server/models/diffusion/

Examples:

ai-server/models/diffusion/sd15/ (contains model_index.json, etc.)
ai-server/models/diffusion/sd15.safetensors

Speech (.onnx)

Place one or more .onnx files into:

ai-server/models/speech/

API

Health check

GET /

Shows server status, available models, and defaults:

curl http://127.0.0.1:8000/

Text generation

POST /text?model=mistral

If model is omitted, the first discovered LLM becomes the default.

curl -X POST "http://127.0.0.1:8000/text?model=mistral" ^
  -H "Content-Type: application/json" ^
  -d "{\"prompt\":\"Explain AI\",\"max_tokens\":256}"

Response:

{ "response": "..." }

Image generation

POST /image?model=sd15

curl -X POST "http://127.0.0.1:8000/image?model=sd15" ^
  -H "Content-Type: application/json" ^
  -d "{\"prompt\":\"a futuristic city\",\"filename\":\"image.png\"}"

Response:

{ "image_path": "output/image.png" }

Speech (ONNX)

POST /speech?model=whisper

This endpoint is implemented as a generic ONNX runner (speech models vary widely).

If you pass a path to an existing .wav file, the server will attempt inference by feeding the audio into the first ONNX input.
If you pass plain text, the server returns it as-is.

curl -X POST "http://127.0.0.1:8000/speech?model=whisper" ^
  -H "Content-Type: application/json" ^
  -d "{\"input\":\"D:/path/to/audio.wav\"}"

Response:

{ "result": [...] }

Notes / Behavior

Lazy loading: the server only loads a model when you call an endpoint that uses it.
Diffusion GPU/CPU: if torch.cuda.is_available() is true, diffusion runs on CUDA with float16; otherwise CPU float32.
GGUF: LLM inference runs via llama-cpp-python (CPU by default).
Graceful missing-model handling: endpoints return clean JSON errors if no models are present.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
core		core
services		services
README.md		README.md
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Local AI Server (FastAPI)

Setup

Where to place models

LLM (.gguf)

Diffusion (.safetensors or diffusers folder)

Speech (.onnx)

API

Health check

Text generation

Image generation

Speech (ONNX)

Notes / Behavior

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unified Local AI Server (FastAPI)

Setup

Where to place models

LLM (.gguf)

Diffusion (.safetensors or diffusers folder)

Speech (.onnx)

API

Health check

Text generation

Image generation

Speech (ONNX)

Notes / Behavior

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages