A nine-minute tour. Three tools. One open model format. Zero cloud. A practical guide to putting a capable AI on your laptop.
Running LLMs on your own machine.
Three tools. One open model format. Zero cloud. A practical guide to putting a capable AI on your laptop.
GPT4All · LM Studio · Ollama
For the tech-curious — Est. 9 min · 10 slides
Why bother running models locally?
- Privacy — Your data never leaves the device. Nothing is logged by OpenAI, Anthropic, or anyone else.
- Cost — Once the model is downloaded, inference is free. No per-token fees, no monthly bills.
- Offline — Works on a plane, in a basement, or on a laptop with no internet. The model lives on your disk.
- Control — Swap models, tune prompts, build private assistants that know your work. You're the platform.
The tradeoff: local models are slower than frontier models and not as smart. But for a lot of work — summarizing, drafting, Q&A over your notes — they're plenty.
Three tools dominate the local-LLM space.
For beginners & non-technical users. Looks like a chat app. Hides the complexity. Great first step if you've never done this before. Friendly desktop app. → gpt4all.io
For model browsers & explorers. Feature-rich. Beautiful GUI. Built-in Hugging Face browser. Every knob exposed. Rich desktop GUI. → lmstudio.ai
For developers & scripters. A tiny command-line tool plus a minimal Mac app. Dead-simple API that speaks OpenAI. CLI + simple app. → ollama.com
All free. All run GGUF models. All can share each other's models.
Choose by your device.
- Apple Silicon (M1–M4): Any of the three. LM Studio edges ahead with MLX support — roughly 30% faster on Apple chips.
- Intel Mac / older laptop: Reach for GPT4All or Ollama. LM Studio works, but its MLX advantage disappears.
- Windows / Linux: All three work. Ollama leads for headless server use, LM Studio for desktop.
| RAM | Pick |
|---|---|
| 8 GB | 3B models — Phi, Gemma 2B/3B |
| 16 GB | 7–8B models — Llama 3.1 8B, Qwen 7B, Mistral 7B (sweet spot) |
| 32 GB | 13B comfortable; 70B quantized if patient |
| 64 GB+ | 70B models at good quality |
More parameters = smarter but slower. You don't need a 70B model to summarize your notes.
Deep dive: What about phones? — On-device LLMs on iOS & Android
GPT4All — the friendliest front door.
The tool I'd hand to a non-technical friend. Install, pick a model from a short curated list, start chatting. No terminal. No decision paralysis.
- Looks like a normal chat app — no CLI, no config files
- Built-in LocalDocs: drop a folder, it becomes searchable context
- Curated model list — no decision paralysis
- Cross-platform, free, open source
- Familiar thread-style chat UX, like ChatGPT
LocalDocs is the killer feature. Private RAG over your own files with zero setup. You can always graduate to LM Studio or Ollama later.
Deep dive: LocalDocs — setup, RAG diagram, tips
LM Studio — everything surfaced.
Where GPT4All hides complexity, LM Studio surfaces it. Every model parameter is a visible knob.
- Built-in model discovery — browse Hugging Face without leaving the app
- MLX backend on Apple Silicon (noticeably faster)
- Per-chat system prompt, temperature, context length, quantization
- Local server mode with OpenAI-compatible API
lmsCLI for scripting workflows
The tradeoff: more surface area means more to learn. If you're brand new, it can feel overwhelming.
Ollama — the foundation layer.
# one-time install
$ brew install ollama
# pull a model
$ ollama pull llama3.1:8b
# run it
$ ollama run llama3.1:8b$ ollama run llama3.1:8b
>>> Write a haiku about Monday morning.
Alarm clock lying,
coffee not strong enough yet —
the week stares me down.
>>> /? # list slash commands
>>> /bye # exit
- Runs as a background daemon — no window to babysit
- Exposes an OpenAI-compatible API at
localhost:11434 - Any app that speaks ChatGPT can instead speak to your local Ollama
The real magic isn't the CLI — it's that every other chat app in the ecosystem can talk to the same Ollama instance: Open WebUI, Chatbox, Msty, Page Assist, Obsidian plugins, your own scripts.
Deep dive: API cookbook — curl, Python, streaming, tools
Bake personas into the model itself.
The Ollama Mac app doesn't have a system-prompt field. At first that seems limiting — until you realize you can bake the personality straight into a model file and it becomes portable across every client.
FROM gemma3:4b
SYSTEM """You are Steve Jobs. Direct.
Simplicity over features. Ask piercing
questions. Demand insanely great."""
$ ollama create steve-jobs -f Modelfile
$ ollama run steve-jobsFlow: Modelfile → ollama create → shows in every client's model dropdown → portable persona.
Deep dive: Persona gallery — 5 ready-to-copy Modelfiles + syntax ref
Obsidian: your notes become the AI's memory.
Local LLMs get exponentially more useful when they can reach into your actual knowledge — not just their training data. If you keep notes in Obsidian, three plugins turn your vault into live context.
- Plugin 01 — Copilot for Obsidian: Chat sidebar inside your vault. Points at your local Ollama or LM Studio — no cloud calls.
- Plugin 02 — Smart Connections: Embeds every note into a local vector DB. The LLM retrieves the most relevant pages before answering.
- Protocol 03 — Ollama MCP: Your vault becomes a live context source any MCP-aware client can query on demand.
Your notes are the brain. The LLM is the mouth.
Deep dive: Setup walkthrough — plugins, CORS fix, starter prompts
Where to begin.
- Never tried local AI? → GPT4All
- Apple Silicon, want it all? → LM Studio
- Developer who wants flexibility? → Ollama
You can install all three. They share models through the GGUF format — your downloads aren't locked to any one tool.
- gpt4all.io — APP
- lmstudio.ai — GUI
- ollama.com — CLI
- huggingface.co/models — MODELS
Now go make your laptop a little smarter.
LocalDocs — private RAG, zero setup.
Point GPT4All at a folder. It reads, indexes, and cites your documents when it answers — all on your machine, no cloud, no account.
LocalDocs turns a folder on your disk into retrieval context for whatever model you're chatting with. Ask "what's in my meeting notes from last Tuesday?" and it will find the relevant passages, hand them to the LLM, and ground the answer in your actual documents. The pattern is called retrieval-augmented generation (RAG) — it's how every "chat with your PDFs" product works under the hood.
Documents → Chunks → Embeddings → Retrieve → Answer
- Documents — PDFs, notes, text
- Chunks — split into passages
- Embeddings — vectors per chunk
- Retrieve — match your query
- Answer — LLM + context
Your question pulls the three-to-five most relevant chunks from your vault. The LLM only sees those — plus the question.
Step 01 — Install GPT4All and download a chat model. Grab the installer from gpt4all.io. Open the app, go to Models, and pick a model that fits your RAM — Llama 3.1 8B Instruct is a solid default for 16 GB machines.
Step 02 — Create a LocalDocs collection. Click LocalDocs in the sidebar, then + Add Collection. Give it a name (e.g. research, company-wiki) and point it at a folder. GPT4All will watch that folder and reindex when files change. Good folders to try: your Obsidian vault, a folder of PDFs you've been meaning to read, a dump of exported Notion pages, or a project-specific docs directory.
Step 03 — Pick an embedding model. The first time you add a collection, GPT4All will download a small embedding model (Nomic's embedder, ~137 MB). This runs locally too — no API calls. Once it finishes indexing, you'll see a chunk count next to the collection name.
Step 04 — Chat with it enabled. Open a new chat, toggle your collection on in the top bar, and ask a question. GPT4All will cite the source files it used at the bottom of each answer. Click a citation to open the source passage.
Works well:
- Markdown and text notes — clean structure, easy to chunk
- PDFs with real text (not scanned images)
- Medium-sized collections — hundreds to low-thousands of documents
- Focused questions with specific keywords
- Stable content that doesn't change daily
Struggles with:
- Huge codebases — structure-aware tools do better
- Scanned PDFs without OCR — nothing to embed
- Vague questions like "what's important?"
- Cross-document reasoning that needs five sources synthesized
- Fast-changing content — reindexing has cost
- Start narrow. One focused collection (e.g. "Q4 planning docs") outperforms one giant "everything" collection — retrieval is easier when the haystack is smaller.
- Name collections like categories. You can toggle them per chat, so it helps to have
legal,research,meeting-notesrather than one lump. - Check the citations. If the model cites the wrong chunk, the issue is almost always retrieval, not the model — try rephrasing your question with more specific terms.
- Chunk size matters. GPT4All's defaults are reasonable, but if answers feel fragmented, you may want larger chunks. Fragmented retrieval = fragmented answers.
Privacy note: LocalDocs never sends your documents to any server. The embedder runs locally, the LLM runs locally, the vector store lives on your disk. If you need to prove that to a security team, point them at the open-source code on GitHub.
Your vault, made chat-able.
A step-by-step for wiring Obsidian Copilot to a local Ollama, with the gotchas that will otherwise cost you an afternoon.
Four pieces: Obsidian (your notes), Ollama (runs the model), a chat model (answers questions), and an embedding model (makes your notes searchable by meaning, not just keywords). The glue is the Copilot plugin by Logan Yang — it supports cloud models too, but we're going fully local.
Step 01 — Install Ollama and pull two models. One model for chat, one for embeddings. nomic-embed-text is a small, fast embedder that pairs well with most chat models.
# chat model
$ ollama pull llama3.1:8b
# embedding model for RAG
$ ollama pull nomic-embed-textStep 02 — Fix the CORS trap (the afternoon-saving step). Obsidian calls Ollama from its renderer process, and Ollama will refuse the call unless you whitelist the Obsidian origin. This is the single most common reason "it just won't connect." Quit the Ollama app, then in a terminal:
# macOS / Linux
$ OLLAMA_ORIGINS=app://obsidian.md* ollama serveWindows (PowerShell): $env:OLLAMA_ORIGINS="app://obsidian.md*"; ollama serve
Make this permanent on macOS: launchctl setenv OLLAMA_ORIGINS "app://obsidian.md*", then restart the Ollama app. You won't have to run ollama serve manually anymore.
Step 03 — Install the Copilot plugin. In Obsidian: Settings → Community plugins → turn off restricted mode if it's on → Browse → search Copilot → install and enable the one by Logan Yang.
Step 04 — Wire up the chat model. Open Copilot settings. Under Add Custom Model:
- Model name:
llama3.1:8b— must match exactly whatollama listshows - Provider: Ollama
- Base URL:
http://localhost:11434(default, usually auto-filled) - CORS: enable it in the model form
Click Verify. Green check means you're in. Set it as your default chat model at the top of settings.
Step 05 — Wire up the embedding model. In Copilot's QA settings, set:
- Embedding model:
nomic-embed-text - Provider: Ollama
Then click Index Vault (or similar). First-time indexing takes a few minutes for large vaults. Subsequent reindexes are incremental.
- Summarize: "Summarize my notes from the last 30 days about [topic]."
- Retrieve: "What have I written about [person / project / concept]?"
- Connect: "Find connections between [note A] and [note B]."
- Extract: "Pull out every action item I've written this week."
- Teach: "Based on my notes, what don't I understand yet about [topic]?"
If Copilot feels like too much apparatus, Smart Connections is a simpler plugin. It embeds your notes into a local vector DB and shows the most similar notes in a sidebar as you write — no chat, just discovery. Pair it with Copilot for chat, or use it alone for serendipity.
It won't connect:
- CORS: re-check
OLLAMA_ORIGINSis set and Ollama was restarted - Port conflict: is anything else on 11434?
- Firewall: try
curl localhost:11434— should return "Ollama is running"
Answers are thin:
- Context window too short — try
num_ctx 32768in the model settings - Wrong embedding model — small vaults do fine with default; large vaults benefit from bigger embedders
- Chat model too small — a 3B model will be weak on synthesis; try 8B+
Tip on context: Copilot respects the num_ctx you set on the Ollama model. If you're getting truncated answers on long notes, run ollama run llama3.1:8b, then /set parameter num_ctx 32768, then /save llama3.1:8b to bake it in.
The Ollama daemon speaks two protocols: its own native API and an OpenAI-compatible one. Pick the second and most of your existing code works unchanged.
Ollama runs a local HTTP server at http://localhost:11434. Any client that can make an HTTP request can use it. If that client already speaks the OpenAI API, point its base_url at http://localhost:11434/v1 and the API key can be any non-empty string.
$ curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "why is the sky blue?"}],
"stream": false
}'If you already use openai, change two lines and your whole codebase runs locally.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required, but unused
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a terse assistant."},
{"role": "user", "content": "Summarize the French Revolution in 3 bullets."},
],
)
print(response.choices[0].message.content)stream = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "tell me a short story"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)Recent Ollama versions and recent models (Llama 3.1, Qwen 2.5, Mistral) support structured tool calls. Same OpenAI-compatible shape:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "weather in Paris?"}],
tools=tools,
)Heads up: smaller models are unreliable at tool-calling. Stick to 7B+ and check response.choices[0].message.tool_calls — it may be None if the model decided not to call the tool.
response = client.embeddings.create(
model="nomic-embed-text",
input="the quick brown fox",
)
vector = response.data[0].embedding # list[float], 768 dims$ docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui --restart always \
ghcr.io/open-webui/open-webui:mainVisit http://localhost:3000. Create an account (stored locally), and it auto-discovers your Ollama models.
- "Connection refused" — Ollama isn't running. Start it:
ollama serve(or open the Mac app). - "Model not found" — you didn't pull it yet.
ollama pull <model>. - "CORS error" from a browser app — set
OLLAMA_ORIGINSto your app's origin before starting Ollama. - Answers get cut off — context too short. Save a version of the model with a bigger
num_ctx.
Five ready-to-use personas plus the syntax reference you'll keep coming back to. Copy, customize, share.
A Modelfile is a plain text file — no extension required.
FROM <model>— the base model to build on (required)SYSTEM """..."""— the system prompt, baked inPARAMETER temperature 0.7— 0.0 is deterministic, 1.0+ is chaoticPARAMETER num_ctx 32768— context window sizePARAMETER top_p 0.9— nucleus sampling cutoffTEMPLATE """..."""— chat template override (rarely needed)MESSAGE user "...",MESSAGE assistant "..."— few-shot examples baked into context
Build any persona with:
$ ollama create <name> -f Modelfile
$ ollama run <name>Direct. Simplicity over features. Good for product critiques, pitch-deck review, and "should we even build this?" questions. Will tell you your idea is junk.
FROM llama3.1:8b
SYSTEM """You are Steve Jobs. You care about one thing: insanely great products.
Rules:
- Simplicity over features. Subtract before you add.
- Ask piercing questions about who this is for and why it matters.
- Demand taste. Call out mediocrity.
- Short, declarative sentences. No hedging.
- When something is bad, say so."""
PARAMETER temperature 0.7
Taste as a discipline. Good for creative work — writing, music, visual art. Asks what you're really trying to say. Slow, patient, unafraid of silence.
FROM llama3.1:8b
SYSTEM """You are Rick Rubin, the record producer. Your job is to help the artist find what they are really trying to say.
Rules:
- Ask more than you answer. Questions before opinions.
- Point at the feeling, not the technique.
- Name what is fake without meanness.
- Celebrate what is true.
- Short, calm sentences. No filler."""
PARAMETER temperature 0.8
Show your work. Good for fact-checking, argument-sharpening, and steelmanning. Will refuse to speculate without flagging it.
FROM llama3.1:8b
SYSTEM """You are a skeptical research assistant.
Rules:
- Distinguish clearly: what is established, what is contested, what is speculation.
- Flag confidence levels: high, medium, low, unknown.
- When you do not know, say so clearly. Do not invent citations.
- Offer the strongest counterargument before giving your own view.
- Prefer primary sources. Name them when you can."""
PARAMETER temperature 0.3
No preamble, no praise. Good for real PR review. Will not say "great start!" It reads the code and ships a list.
FROM qwen2.5-coder:7b
SYSTEM """You are a code reviewer. Assume the author is senior and short on time.
Rules:
- No preamble, no praise, no summaries.
- Output a terse bulleted list of issues, grouped: Correctness, Performance, Readability, Security.
- For each issue: file:line if known, then one sentence, then suggested fix.
- If the code is fine, say LGTM and stop.
- Never invent problems to look useful."""
PARAMETER temperature 0.2
Outputs diagrams, never prose. Good for system-design chats and flowcharts. Describe what you want; it returns a valid Mermaid diagram block and nothing else.
FROM llama3.1:8b
SYSTEM """You output Mermaid diagrams, nothing else.
Rules:
- Every response is a valid Mermaid code block: three backticks, then mermaid, then the diagram, then three backticks.
- No explanatory prose before or after.
- Pick the best Mermaid type (flowchart, sequenceDiagram, stateDiagram, erDiagram, classDiagram) for the request.
- Prefer clear node labels over clever ones.
- If the request is ambiguous, make a reasonable choice and diagram that."""
PARAMETER temperature 0.4
A Modelfile is just a text file — drop it in a gist, email it, commit it to a repo. The person on the other end runs ollama create with it and gets the exact same persona. The most lightweight way to share AI tools that exists: no API keys, no accounts, no SaaS dependency. You can also publish built models with ollama push yourname/model-name.
Temperature cheat sheet: 0.0–0.3 for code, extraction, and facts. 0.5–0.7 for general chat. 0.8–1.2 for creative writing and brainstorming. Above 1.5 and the model starts losing coherence.
AI in your pocket. On-device LLMs aren't a laptop-only story. Modern phones run small, capable models without a network.
Phones added neural processors years before they had anything to do. Now they do: Gemma 4, Llama 3.2, Phi-3 Mini, and Qwen 2.5 all ship variants designed specifically for mobile hardware. The pitch is the same as on a laptop — privacy, offline, zero per-query cost — with one sharper edge: your phone is always with you. A plane, a subway, a remote trail, a hospital waiting room. The cloud is intermittent there. Your device is not.
Realistic expectations. Mobile LLMs are small — usually 1B to 4B parameters. Great at summarization, short Q&A, transcription, image description, and structured tasks. Not a replacement for Claude or GPT on complex reasoning. A competent pocket assistant, not a research companion.
App 01 · Google — Google AI Edge Gallery The reference implementation from the folks making Gemma. Google's open-source showcase app for on-device AI. Download Gemma 4 E2B or E4B right inside the app. Fully offline after the first download, no account required. Feels most like a real product while being fully transparent — source on GitHub.
- AI Chat with Thinking Mode — watch the model's reasoning step by step
- Ask Image — take a photo, ask questions about it (multimodal)
- Audio Scribe — on-device transcription and translation
- Prompt Lab — tune temperature, top-k, test single-turn prompts
- Agent Skills — tool-using agents that can hit Wikipedia, show maps, generate QR codes
- Benchmarks — see tokens/sec on your specific device
Platform: iOS 17+ / Android 12+ · Free · Open source
App 02 · Community — PocketPal AI Any GGUF model, right on your phone. Free, open-source, cross-platform. Built on llama.cpp, so anything you'd run on a laptop with Ollama can run in PocketPal — pick from its catalog or sideload a GGUF you already have. The power-user flavor.
- Enable Metal acceleration on iOS (shift work from CPU to GPU) for a dramatic speed jump
- Turn on Flash Attention on capable devices for more
- Make "Pals" — system-prompt personas, exactly like Ollama Modelfiles but in your pocket
- Import your own GGUF from Hugging Face
Platform: iOS / Android · Free · Open source
App 03 · Apple — Apollo AI Open-source iOS chat, community-driven. iOS-only, llama.cpp under the hood. More technical than its polish-forward competitors — the "let me tinker" app. Fine control over model parameters on iPhone or iPad.
Platform: iOS / iPadOS · Free · Open source
App 04 · Apple — Haplo AI Polished, plug-and-play, across your Apple devices. Slick interface, no account required, runs across iPhone, iPad, and Mac with the same library. Good pick if you want local AI that feels like a real product.
Platform: iOS / iPadOS / macOS · Free tier + paid · Closed source
Works well on mobile:
- Short Q&A, text rewriting, grammar fixes
- Summarizing a single article or email
- Image description and OCR (with multimodal models like Gemma 3n/4)
- Voice transcription (Audio Scribe, on-device Whisper variants)
- Brainstorming short lists, titles, captions
Works poorly on mobile:
- Long documents — context windows are small (often 512–2k tokens)
- Heavy reasoning — the model is 2B, not 200B
- Coding tasks beyond snippets
- Sustained chat — battery and heat will become apparent
- Accurate factual recall on obscure topics
AI Edge Gallery as your first install (smooth, reliable, multimodal), then add PocketPal if you want to run your own models or experiment more. Use them for the narrow set of things mobile LLMs are genuinely good at — transcription, photo Q&A, quick rewrites — and let your laptop handle everything else.
Battery warning. LLM inference is compute-heavy. Expect noticeable battery drain and phone warmth during long sessions. If your phone gets hot, pause — the OS will throttle, and you'll stop getting good tokens/sec anyway.