Ollama Just Got Way Faster on Mac — MLX Integration Makes Apple Silicon a Serious AI Workstation – AirdropHunt

If you’ve been running AI models locally on a Mac, things just got a major upgrade. Ollama — the go-to tool for running open-source AI models on your own hardware — has released version 0.19, rebuilt on top of MLX, Apple’s native machine learning framework.

The result? Dramatically faster performance on every Apple Silicon chip, from M1 to the latest M5.

What Changed Under the Hood

Ollama previously relied on its own inference engine. Now it leverages MLX, which is purpose-built to exploit Apple’s unified memory architecture — the design where the CPU and GPU share the same memory pool instead of shuffling data back and forth.

On Apple’s M5, M5 Pro, and M5 Max chips, Ollama takes advantage of the new GPU Neural Accelerators — dedicated hardware for AI inference. This boosts both:

Time to first token (TTFT) — how fast the model starts generating a response
Tokens per second — how fast it produces output

In benchmarks using Alibaba’s Qwen3.5-35B-A3B model, the prefill speed hit 1,851 tokens per second with int4 quantization. That’s local inference performance that would have been unthinkable on a laptop even a year ago.

NVFP4: Production-Grade Quantization

The update also introduces support for NVIDIA’s NVFP4 format — a quantization method that shrinks model size while preserving accuracy. Why does this matter for Mac users?

Because more AI providers are deploying models in NVFP4 for production. Running the same format locally means your results match what you’d get from a cloud API. No more wondering if your local setup is giving different answers than GPT-4 or Claude.

Smarter Caching for Coding Agents

Here’s where it gets practical for developers. Ollama’s cache system has been overhauled:

Cross-conversation reuse — shared system prompts (common in tools like Claude Code) are cached once and reused, cutting memory usage
Intelligent checkpoints — the cache stores snapshots at smart points in the prompt, reducing reprocessing
Smarter eviction — shared prefixes survive longer even when old conversation branches are dropped

If you use Claude Code, OpenCode, or any local coding assistant on a Mac, this means noticeably snappier responses during long coding sessions.

What You Need

A Mac with 32GB+ unified memory. The preview release is optimized for the Qwen3.5-35B-A3B model with coding-tuned sampling parameters. To get started:

ollama run qwen3.5:35b-a3b-coding-nvfp4

Or launch it with Claude Code or OpenClaw directly:

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

The Bottom Line

This isn’t just a speed bump — it’s a statement. Apple Silicon is becoming a legitimate platform for serious AI inference work, not just experimentation. With MLX under the hood, NVFP4 quantization, and smart caching, Ollama 0.19 makes a MacBook Pro with enough RAM competitive with cloud-based inference for many workloads.

For developers, AI enthusiasts, and anyone who prefers keeping their data on their own machine, this is the update to download today.

Ollama Just Got Way Faster on Mac — MLX Integration Makes Apple Silicon a Serious AI Workstation

What Changed Under the Hood

NVFP4: Production-Grade Quantization

Smarter Caching for Coding Agents

What You Need

The Bottom Line

IBM Joins Forces With Arm — Why This Collaboration Could Reshape Enterprise Computing in 2026

OpenAI Just Raised $122 Billion — And Retail Investors Are Along for the Ride. Should You Be?

Oracle Just Laid Off 10,000 People — And Blamed AI for Making Them Unnecessary

OpenAI Kills Its Viral Sora App — What Went Wrong

Tech CEOs Now Blame AI for Mass Layoffs — Here is What Is Really Going On

Is Your Health Data Safe? Hims & Hers Breach Exposes 9 Million Users in 2026

Leave a Reply Cancel reply

What Changed Under the Hood

NVFP4: Production-Grade Quantization

Smarter Caching for Coding Agents

What You Need

The Bottom Line

Similar Posts

Leave a Reply Cancel reply