Ollama Just Got Way Faster on Mac — MLX Integration Makes Apple Silicon a Serious AI Workstation
If you’ve been running AI models locally on a Mac, things just got a major upgrade. Ollama — the go-to tool for running open-source AI models on your own hardware — has released version 0.19, rebuilt on top of MLX, Apple’s native machine learning framework.
The result? Dramatically faster performance on every Apple Silicon chip, from M1 to the latest M5.
What Changed Under the Hood
Ollama previously relied on its own inference engine. Now it leverages MLX, which is purpose-built to exploit Apple’s unified memory architecture — the design where the CPU and GPU share the same memory pool instead of shuffling data back and forth.
On Apple’s M5, M5 Pro, and M5 Max chips, Ollama takes advantage of the new GPU Neural Accelerators — dedicated hardware for AI inference. This boosts both:
- Time to first token (TTFT) — how fast the model starts generating a response
- Tokens per second — how fast it produces output
In benchmarks using Alibaba’s Qwen3.5-35B-A3B model, the prefill speed hit 1,851 tokens per second with int4 quantization. That’s local inference performance that would have been unthinkable on a laptop even a year ago.
NVFP4: Production-Grade Quantization
The update also introduces support for NVIDIA’s NVFP4 format — a quantization method that shrinks model size while preserving accuracy. Why does this matter for Mac users?
Because more AI providers are deploying models in NVFP4 for production. Running the same format locally means your results match what you’d get from a cloud API. No more wondering if your local setup is giving different answers than GPT-4 or Claude.
Smarter Caching for Coding Agents
Here’s where it gets practical for developers. Ollama’s cache system has been overhauled:
- Cross-conversation reuse — shared system prompts (common in tools like Claude Code) are cached once and reused, cutting memory usage
- Intelligent checkpoints — the cache stores snapshots at smart points in the prompt, reducing reprocessing
- Smarter eviction — shared prefixes survive longer even when old conversation branches are dropped
If you use Claude Code, OpenCode, or any local coding assistant on a Mac, this means noticeably snappier responses during long coding sessions.
What You Need
A Mac with 32GB+ unified memory. The preview release is optimized for the Qwen3.5-35B-A3B model with coding-tuned sampling parameters. To get started:
ollama run qwen3.5:35b-a3b-coding-nvfp4
Or launch it with Claude Code or OpenClaw directly:
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
The Bottom Line
This isn’t just a speed bump — it’s a statement. Apple Silicon is becoming a legitimate platform for serious AI inference work, not just experimentation. With MLX under the hood, NVFP4 quantization, and smart caching, Ollama 0.19 makes a MacBook Pro with enough RAM competitive with cloud-based inference for many workloads.
For developers, AI enthusiasts, and anyone who prefers keeping their data on their own machine, this is the update to download today.
