Why Run AI Locally
Every prompt you send to ChatGPT, Claude, or any cloud API leaves your machine, crosses the internet, and lands on someone else's server. Your code snippets, personal notes, financial data, medical questions — all of it is logged, stored, and potentially used for training. With OpenClaw connected to Ollama running on your own Mac, none of that happens. Your data stays on your SSD. No telemetry, no logging, no third-party access. For developers handling client code, lawyers reviewing sensitive documents, or anyone who values digital privacy, local inference isn't a luxury — it's the only responsible choice.
Beyond privacy, the economics are compelling. Cloud API costs add up fast: a heavy GPT-4o user easily spends $50–200/month on tokens. A 14B parameter model running locally on a Mac Mini handles roughly 80% of everyday tasks — drafting emails, summarizing documents, generating code, answering questions — for exactly $0 in API fees. The Mac pays for itself within months. And because inference happens on-device, you get zero network latency. Responses start streaming the instant you hit enter, with no round-trip to a data center thousands of miles away.
Local AI also means offline capability. On a plane, in a remote cabin, or during an internet outage, your AI agent keeps working. OpenClaw paired with Ollama gives you a fully autonomous assistant that doesn't depend on any external service. Pull your models once, and you're set. This guide walks you through the complete setup: hardware selection, Ollama installation, model choices by RAM tier, OpenClaw configuration, and ClashX hybrid routing for when you do need cloud APIs alongside local inference.
Hardware Guide: Which Mac to Buy
Apple Silicon's unified memory architecture makes Macs uniquely suited for local LLM inference. Unlike discrete GPUs with limited VRAM, the CPU and GPU share the same memory pool — so a 24GB Mac can load a 14B model entirely into GPU-accessible memory without any offloading penalty.
| Tier | Mac | RAM | Best Models | Speed | Price |
|---|---|---|---|---|---|
| Entry | Mac Mini M4 | 16 GB | Qwen 3.5 7B, Llama 3.2 | 30–45 t/s | $599 |
| Sweet Spot | Mac Mini M4 Pro | 24 GB | Phi-4 14B, DeepSeek-R1 14B | 20–25 t/s | $1,399 |
| Power | Mac Mini M4 Pro | 48 GB | Qwen 2.5 32B, DeepSeek-R1 32B | 10–15 t/s | $1,999 |
24GB unified memory runs 14B models at 25 tokens/sec — the sweet spot for always-on local AI agents that handle real work.
View Mac Mini M4 Pro on Amazon →Why does unified memory matter so much for LLMs? Traditional PCs split memory between system RAM and GPU VRAM. A model that needs 10GB of VRAM requires a discrete GPU with at least that much — and consumer GPUs top out at 16–24GB for $800+. On Apple Silicon, the entire memory pool is accessible to the GPU at full bandwidth. A Mac Mini M4 Pro with 24GB of unified memory can load a Q4-quantized 14B model entirely into GPU-accessible space, running inference at 200 GB/s memory bandwidth. No other machine at this price point offers that combination of capacity and throughput for LLM workloads.
Install Ollama on macOS
Ollama is the simplest way to run open-source LLMs locally. One command to install, one command to pull a model, one command to run it.
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh2. Verify Installation
ollama --version
# Expected output: ollama version is 0.18.33. Pull Your First Model
ollama pull qwen3.54. Test It
ollama run qwen3.5 "Hello"5. Configure Ollama for OpenClaw
Ollama's default settings are conservative. For use with OpenClaw as an always-on agent backend, apply these optimizations:
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_CONTEXT_LENGTH 8192
launchctl setenv OLLAMA_NUM_PARALLEL 4
launchctl setenv OLLAMA_KEEP_ALIVE "30m"OLLAMA_HOST "0.0.0.0:11434" — Listens on all network interfaces, not just localhost. Required if OpenClaw connects from a container or another device on your network.
OLLAMA_CONTEXT_LENGTH 8192 — Increases the default context window from 2048 to 8192 tokens. Lets the model see more of the conversation history, critical for agent tasks that reference earlier instructions.
OLLAMA_NUM_PARALLEL 4 — Allows 4 simultaneous inference requests. OpenClaw often fires multiple queries in parallel (e.g., summarize + classify + respond). Without this, requests queue up.
OLLAMA_KEEP_ALIVE "30m" — Keeps the model loaded in memory for 30 minutes after the last request. Prevents the cold-start delay (5–10 seconds) when OpenClaw sends a new task after a brief pause.
After setting these variables, restart Ollama for the changes to take effect. If you installed Ollama via the macOS app, quit and relaunch it. If running as a service, restart with brew services restart ollama.
Best Models by RAM Tier
Not all models are created equal. Here's what actually runs well on each RAM tier, tested on M4 Pro hardware with Q4_K_M quantization:
| RAM | Model | Speed (M4 Pro) | Best For |
|---|---|---|---|
| 16 GB | Qwen 3.5 7B | 45 t/s | Daily chat, quick tasks |
| 16 GB | Llama 3.2 7B | 38 t/s | General purpose |
| 24 GB | Phi-4 14B | 25 t/s | Complex reasoning |
| 24 GB | DeepSeek-R1 14B | 22 t/s | Math, logic |
| 24 GB | Qwen 2.5-coder 14B | 24 t/s | Code generation |
| 48 GB | Qwen 2.5 32B | 12 t/s | Near GPT-4 quality |
| 48 GB | DeepSeek-R1 32B | 10 t/s | Advanced reasoning |
Our practical recommendation: start with Qwen 3.5 7B for daily use. It's the fastest model in the table and handles 80% of common tasks — answering questions, drafting text, light coding, summarization — with surprisingly good quality for its size. When you hit a task that needs deeper reasoning (complex code refactoring, multi-step logic problems, nuanced writing), switch to Phi-4 14B or DeepSeek-R1 14B on a 24GB machine. You can keep multiple models pulled and switch between them instantly with OpenClaw's model routing.
Install and Configure OpenClaw
With Ollama running and models pulled, it's time to install OpenClaw — the AI agent framework that turns your local LLM into an autonomous assistant.
1. Install OpenClaw
curl -fsSL https://openclaw.ai/install.sh | bash2. Run Onboarding
openclaw onboard --install-daemonThe onboarding wizard walks you through initial setup. When prompted for an LLM provider, select Ollama. OpenClaw auto-discovers all models you've pulled locally — no manual endpoint configuration needed.
3. Set Environment Variable
export OLLAMA_API_KEY="ollama-local"Ollama doesn't require an API key, but OpenClaw expects one to be set. Use any placeholder string.
4. Start the Gateway
openclaw gateway --port 18789 --verboseThe gateway is OpenClaw's central router — it receives requests and dispatches them to the appropriate model (local or cloud). Port 18789 is the default; --verbose shows real-time request logs, useful for debugging.
5. Open the Dashboard
openclaw dashboardThis launches the web-based control panel in your browser, where you can monitor active agents, view logs, switch models, and configure routing rules.
macOS-Specific Features
OpenClaw on macOS includes several platform-exclusive features. The menu bar app gives you quick access to agent status, model switching, and gateway controls without opening Terminal. Voice Wake lets you activate the agent with a hotword — say "Hey Claw" and start dictating a task. Talk Mode overlay provides a translucent floating window for voice conversations with your agent, similar to the Apple Intelligence Siri overlay but connected to your local models.
ClashX Hybrid Routing: Local + Cloud
Most users will want a hybrid setup: local Ollama for everyday tasks, cloud APIs (OpenAI, Anthropic) for tasks that demand frontier-model quality. The key is routing — local traffic should stay on your machine (DIRECT), while cloud API calls go through your proxy for privacy and geo-optimization.
Here's a minimal ClashX rule configuration for hybrid OpenClaw routing:
rules:
- IP-CIDR,127.0.0.0/8,DIRECT
- IP-CIDR,192.168.0.0/16,DIRECT
- DOMAIN-SUFFIX,openai.com,🤖 AI Agent
- DOMAIN-SUFFIX,anthropic.com,🤖 AI Agent
- DOMAIN-SUFFIX,deepseek.com,DIRECT
- MATCH,DIRECTThe first two rules ensure all local traffic (Ollama on 127.0.0.1:11434, OpenClaw gateway on 127.0.0.1:18789, and any LAN devices) bypasses the proxy entirely. OpenAI and Anthropic API calls route through your 🤖 AI Agent proxy group for privacy and optimal routing. DeepSeek is globally accessible without restrictions, so it goes direct to minimize latency. Everything else defaults to DIRECT.
The rules above are a minimal starting point. For the full configuration including dedicated proxy groups, messaging platform routing, stability-optimized DNS, and 24/7 Mac Mini settings, see our OpenClaw + ClashX Proxy Routing Guide.
ClashX's basic version requires manually editing config.yaml for all rule changes. ClashFX provides a visual rule editor — add proxy groups and routing rules with clicks instead of code. If you're not comfortable with YAML, download ClashFX first.
VPS Alternative: No Mac Required
Not everyone has a Mac — or wants to keep one running 24/7. A cloud VPS is a viable alternative for running OpenClaw with Ollama as an always-on AI agent server. The tradeoff: no Apple Silicon GPU acceleration, so you're limited to CPU inference (significantly slower), but the server runs independently with guaranteed uptime.
For a budget-friendly option, Contabo Cloud VPS S (4 vCPU, 8GB RAM, $6.99/mo) can run 7B models at usable speeds via CPU inference. The setup is nearly identical to macOS — install Ubuntu 22.04, then run the same Ollama and OpenClaw commands:
# On your VPS (Ubuntu 22.04)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3.5
curl -fsSL https://openclaw.ai/install.sh | bash
openclaw onboard --install-daemonRun OpenClaw 24/7 on a VPS instead of keeping your Mac on. High-performance Cloud VPS from $6.99/mo with 11 global data centers.
View Contabo VPS Plans →Disclosure: affiliate link. See our ad policy.
Important limitation: standard VPS instances lack GPU hardware, so all inference runs on CPU. Expect 3–8 tokens/second for a 7B model on a 4-core VPS — usable for background tasks and async agents, but noticeably slower than Apple Silicon's 30–45 t/s. For real-time conversational use, a Mac with Apple Silicon remains the better choice. GPU-equipped cloud instances (A10G, L4) exist but cost $0.50–1.50/hour, which defeats the cost-saving purpose of local inference.
FAQ
Q: Can OpenClaw work 100% offline?
A: Yes. When configured with local Ollama models only, OpenClaw runs entirely offline. No internet connection is needed after the initial model download. Your data never leaves your machine — all inference, storage, and processing happens locally on your Mac.
Q: How much RAM do I really need?
A: 16GB is the minimum for running 7B parameter models comfortably. 24GB is the recommended sweet spot — it lets you run 14B models (Phi-4, DeepSeek-R1) that offer dramatically better quality while still generating 20–25 tokens per second. 48GB is only necessary if you need 32B models approaching GPT-4 quality.
Q: Can I mix local and cloud models?
A: Yes. OpenClaw supports simultaneous routing to Ollama for local inference and OpenAI/Anthropic for cloud-powered tasks. You can even set rules — e.g., use local Qwen 3.5 for quick tasks and route complex reasoning to Claude via API. ClashX handles the network split: local Ollama traffic stays DIRECT while cloud API calls go through your proxy.
Q: Ollama vs LM Studio — which is better?
A: Performance is comparable on the same hardware and models. Ollama is CLI-first, lighter on resources, and designed for headless server operation — ideal for a Mac Mini running 24/7 without a display. LM Studio provides a polished GUI for model browsing and management. For always-on OpenClaw deployments, Ollama wins on simplicity and resource efficiency.
Q: How to keep Mac Mini running 24/7?
A: Three steps: run caffeinate -s -d & in Terminal to prevent sleep, disable all sleep options in System Settings > Energy, and set ClashX to launch at login via its Preferences panel. For remote management, enable Screen Sharing in System Settings > General > Sharing so you can access the Mac from anywhere.