Skip to content

GLD110/Unified-Local-AI-Server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unified Local AI Server (FastAPI)

A single local API server that can discover and serve multiple model types from local files:

  • LLM (.gguf) via llama-cpp-python
  • Image generation (.safetensors / diffusers folders) via diffusers
  • Speech models (.onnx) via onnxruntime

It supports multi-model routing, lazy loading (models load only on first use), request logging, and a health endpoint.

Setup

Create a virtual environment and install dependencies:

cd ai-server
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Run the server:

uvicorn server:app --host 127.0.0.1 --port 8000

Where to place models

Put your local models here:

ai-server/
  models/
    llm/        # .gguf files (e.g. mistral.gguf)
    diffusion/  # diffusers folders OR single-file .safetensors
    speech/     # .onnx files
  output/       # generated images saved here

LLM (.gguf)

Place one or more .gguf files into:

ai-server/models/llm/

Example:

  • ai-server/models/llm/mistral.gguf

Diffusion (.safetensors or diffusers folder)

Either:

  • put a full diffusers model folder inside ai-server/models/diffusion/<model_name>/ (recommended), or
  • put a single .safetensors file directly in ai-server/models/diffusion/

Examples:

  • ai-server/models/diffusion/sd15/ (contains model_index.json, etc.)
  • ai-server/models/diffusion/sd15.safetensors

Speech (.onnx)

Place one or more .onnx files into:

ai-server/models/speech/

API

Health check

GET /

Shows server status, available models, and defaults:

curl http://127.0.0.1:8000/

Text generation

POST /text?model=mistral

If model is omitted, the first discovered LLM becomes the default.

curl -X POST "http://127.0.0.1:8000/text?model=mistral" ^
  -H "Content-Type: application/json" ^
  -d "{\"prompt\":\"Explain AI\",\"max_tokens\":256}"

Response:

{ "response": "..." }

Image generation

POST /image?model=sd15

curl -X POST "http://127.0.0.1:8000/image?model=sd15" ^
  -H "Content-Type: application/json" ^
  -d "{\"prompt\":\"a futuristic city\",\"filename\":\"image.png\"}"

Response:

{ "image_path": "output/image.png" }

Speech (ONNX)

POST /speech?model=whisper

This endpoint is implemented as a generic ONNX runner (speech models vary widely).

  • If you pass a path to an existing .wav file, the server will attempt inference by feeding the audio into the first ONNX input.
  • If you pass plain text, the server returns it as-is.
curl -X POST "http://127.0.0.1:8000/speech?model=whisper" ^
  -H "Content-Type: application/json" ^
  -d "{\"input\":\"D:/path/to/audio.wav\"}"

Response:

{ "result": [...] }

Notes / Behavior

  • Lazy loading: the server only loads a model when you call an endpoint that uses it.
  • Diffusion GPU/CPU: if torch.cuda.is_available() is true, diffusion runs on CUDA with float16; otherwise CPU float32.
  • GGUF: LLM inference runs via llama-cpp-python (CPU by default).
  • Graceful missing-model handling: endpoints return clean JSON errors if no models are present.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages