Installing Ollama and pulling your first model.
Ollama is a small local server that downloads quantized model weights, runs inference on CPU or GPU depending on your hardware, and exposes an HTTP API on 127.0.0.1:11434. Install Ollama from your package manager or the Ollama distribution for your platform, then verify the server is running. Once it is up, pull a model — for a 16 GB laptop the small tier is a safe start, for a 32 GB laptop the medium tier is comfortable, and for a workstation or Apple Silicon with unified memory the large tier is in reach.
With a model pulled, OpenCode picks it up automatically. When you run the agent with the Ollama provider selected, the CLI lists the models Ollama has available, you pick one, and the first prompt runs locally. The CLI install guide covers the OpenCode side of the setup; the Ollama side is a standard ollama pull command. No OpenCode-specific model format, no custom quantization — the adapter uses whatever Ollama gives it.
Recommended Ollama models for OpenCode.
The table below is the canonical recommendation list the OpenCode maintainers keep current. Specific model names move around as new quantizations ship, but the tiers — small, medium, large — are stable. Match the tier to your RAM budget first and your patience second; a small-tier model on a 16 GB laptop is faster than a medium-tier model that swaps to disk.
Zero-click summary. Three tiers. Small fits a 16 GB laptop. Medium fits 32 GB. Large needs 48+ GB of fast RAM or Apple Silicon unified memory.
| Model | Quant | RAM | Speed | Use case |
|---|---|---|---|---|
| Small tier (7B class) | Q4_K_M | 8 GB | Fast | Everyday edits, small refactors |
| Small tier (8B class) | Q5_K_M | 10 GB | Fast | Code completion, inline fixes |
| Medium tier (14B class) | Q4_K_M | 16 GB | Moderate | Multi-file changes |
| Medium tier (20B class) | Q4_K_M | 22 GB | Moderate | Plan mode on real repos |
| Large tier (32B class) | Q4_K_M | 32 GB | Slower | Hard refactors, long contexts |
| Large tier (70B class) | Q4_K_M | 48 GB | Slow | Cross-module reasoning |
Pointing OpenCode at the local Ollama socket.
The OpenCode config lives at ~/.config/opencode/config.toml on macOS and Linux, or %APPDATA%\OpenCode\config.toml on Windows. To wire up the Ollama adapter, add a provider block keyed on Ollama and point it at the Ollama HTTP endpoint. OpenCode fetches the available models from Ollama at startup, so you do not list them manually in the config — the CLI picks them up from ollama list.
Below is a minimal Ollama provider block. Adjust the model name to match a model you have pulled, and optionally set a default context window if the model supports something other than the library default.
# ~/.config/opencode/config.toml
[provider.ollama]
kind = "ollama"
base_url = "http://127.0.0.1:11434"
default_model = "opencode-small-q4"
context_window = 8192
[agent]
provider = "ollama"
tool_call_format = "json_fallback"
stream = true
The tool_call_format key is the one most users forget. OpenCode prefers native function calls when the model supports them, but most local models surface tool calls as a structured JSON block in the response. Setting tool_call_format = "json_fallback" tells the adapter to parse that block and route it through the same tool executor the native calls use. The custom API guide documents the tool-call schema for teams that want to audit it.
Zero-click summary. Three keys wire up Ollama: kind, base_url, default_model. JSON fallback covers models without native function calls.
Latency posture and when local models are fast enough.
Local inference has a different latency shape than a hosted API. A hosted model returns first tokens in 100–300 ms but the round trip includes TLS, queueing, and rate-limit hops. A local Ollama model returns first tokens in 500–1500 ms on a mid-range laptop but every subsequent token is as fast as your CPU or GPU can produce it. For interactive agent work, the experience often feels similar because OpenCode streams tokens back into the inline diff as they arrive.
The tasks where local models shine: small refactors, inline fixes, code completion, short plan-and-apply loops. The tasks where hosted models still win: long-context reasoning over a 200-file monorepo, cross-module refactors with tricky invariants, and ambiguous natural-language specifications. Mixing tiers is a valid pattern — run a small local model for the routine work and reach for a hosted model when the agent tells you the horizon is long. OpenCode supports multiple providers in one config so the switch is a single command.
Zero-click summary. Local models feel interactive for short tasks. Hosted models still win for long contexts. Mixing tiers is fine.
Tool-call JSON fallback for models without native function calls.
Not every open-weights model exposes native tool calls the way a frontier hosted model does. OpenCode handles that by asking the model to emit a structured JSON block whenever it wants to call a tool, then parsing that block on the adapter side. The format is documented so teams writing custom prompts can inspect it and the NIST software quality group guidance on structured inputs applies cleanly to the schema.
The fallback format is intentionally boring: an opening marker, a JSON object with a tool name and an args map, and a closing marker. Any model that can follow a system prompt can produce it. If a model struggles, the usual fix is a shorter system prompt and a more explicit example in the OpenCode tool descriptor. The documentation covers the exact format.
Offline installs and air-gapped workstations.
OpenCode with Ollama is one of the few coding-agent combinations that actually works air-gapped. The OpenCode CLI is a single static binary, so you can vendor it onto an internal package mirror. Ollama model weights can be pulled on a connected workstation, exported as blobs, and imported on the air-gapped machine. Once both are in place, set telemetry off in OpenCode, start Ollama, and the CLI never reaches for the network.
For enterprise deployments, the common pattern is to mirror the OpenCode release bundle, a curated set of Ollama model weights, and an internal signing key through the endpoint management system. The trust and safety page documents the SBOM and signing attestations that make a mirrored deployment auditable. Guidance on OSS supply-chain posture from CMU research on reproducible builds informed our mirror design.
Troubleshooting the Ollama adapter.
The most common failure mode is a pulled model that OpenCode cannot see. That usually means Ollama is running under a different user than the OpenCode CLI, or the base URL in the OpenCode config points at a stopped Ollama instance. Run curl http://127.0.0.1:11434/api/tags — if it returns a JSON list, OpenCode will see the same list; if it fails, restart Ollama and re-check. The second most common failure is a slow first response: that is usually the model warming up and is normal on the first prompt after a restart.
The third issue engineers hit is a context overflow — a long repository does not fit in the small-tier model's context window. The fix is either a larger-tier model, a tighter selection in the OpenCode VSCode extension, or a context_window override in the config if Ollama supports it for your model. The documentation covers the overflow strategy.
We legally cannot send source to SaaS endpoints. OpenCode plus Ollama was the first agent we tried where local inference was a first-class path, not a forgotten subroutine. The plan/apply flow feels identical to the hosted case.
The JSON tool-call fallback is the unsung hero. It meant we could run open-weight models that don't ship native function calling without writing glue.
On a 64 GB Apple Silicon laptop the medium tier is fast enough that I default to local for all the routine work. I reach for hosted only when the task horizon is long.