Samuel Gregory

14,300 subscribers

⏱ 👁 11,479 views

Finally, The CORRECT Way to Run Local AI on a Mac

Video Overview & Insights

█▀█ █▀▀ ▄▀█ █▀▄ █▀▄▀█ █▀█ █▀█ █▀▀

“

Omlx have trouble with managing context size so i went back to ollama + mlx

— @ThanadejSubsit

█▀▄ ██▄ █▀█ █▄▀ █ ▀ █ █▄█ █▀▄ ██▄

Download oMLX: https://omlx.ai/

“

Question, does anyone really care about ssd cache? I find it not useful because I love to clear cache all the time

— @DavidDalcu

This video explores why OMLX is the definitive choice for founders looking to reclaim their data and run powerful LLMs locally on Mac hardware.

Key Takeaways:

“

I worry about the SSD KV cache destroying the SSD

— @that_rendle

- Why OMLX is superior to Ollama and LM Studio for professional Mac workflows.

- The technical benefits of SSD-backed caching and LRU policies for persistent context.

“

I only have 16 gb M2. I just want an llm to tinker with my home assistant. I like the idea of Hermes. It seems that I should at least get 64 gb of memory in order to run a decent llm.

— @PaulEddie

- How to set up agentic models like Qwen 3.6 MoE for real-world coding tasks.

- A breakdown of why the M5 Max is the current sweet spot for personal AI infrastructure.

“

Be advised Apple Intelligence isn't. I do not recommend running while using OMLX.

— @JeepMarshall

- Practical steps to integrate local models into tools like Pie and Open Code.

Code examples: https://samuelgregory.co.uk/videos/finally-the-correct-way-to-run-local-ai-on-a-mac

“

Apple only.

— @TubeSkaterRudy

Work with me: https://samuelgregory.co.uk

---

“

Hey just wanted a second opinion on this - I have been eyeing a refurbished m5 macbook pro 32 gb to run local llms for code completions (I know it wont be enough for code harnesses). Should be fine, right? Btw you are the reason I am going for local first AI - been watching your videos since a month now. Thanks!

— @ujjawalverma99

Support the content: https://www.patreon.com/0x5am5

Twitter: @0x5am5

“

Dude you just put me on to game so hard you don't even know. I've been running MLX versions through LM studios On my 256 M3 Ultra (yes, a beast)
I a voice agent, was unhappy with the latency. Was getting about 10 seconds latency. Then Nvidia just dropped that .8 b and took my latency from 10 seconds to 5. Huge difference. It's running that Jarvis level. I think with this I'm about to have Godspeed.
Can't wait to get into this. 5,000 thank yous!. I should probably finish the video 😅

— @juslostone

$ cat tools.txt

────────────────────────────────

“

Love oMLX, but I gotta say, LM Studio's LM Link feature is hard to let go of.

— @DarioDAversa

Kilo: https://samuelgregory.co.uk/kilo-code

Replit (Favourite Vibe Code Tool) : https://samuelgregory.co.uk/replit

“

Try POP Switcher , just the onbording is in French . Better than ollama

— @florentbrunet5059

Perplexity (deep research): https://samuelgregory.co.uk/perplexity

Claude Code: https://claude.ai/api/referral/jZ9vnMedyQ&v=p-CzOtUYEyA

“

Have you looked into Inferencer, including its SSD cache, batching, multiprocessing. I’m not advertising for them, I just want a more expert opinion. There is also a similar project vMLX, but for lower ram system, I guess.

— @Tititototo

Warp Terminal: https://samuelgregory.co.uk/warp

⚒️ more at https://samuelgregory.co.uk/tools

“

The fact that this feature is not the first thing that was made before even developing LLM is telling me how lazy the AI development is going.

— @dadlord689

$ cat services.txt

────────────────────────────────

“

Ollama is no longer missing MLX support on Mac. Since March 30, 2026, Ollama has full MLX support on Apple Silicon with its own MLX runner/engine. oMLX is a serving layer on top of Apple’s MLX / mlx-lm stack, similar to LM Studio’s MLX integration, while Ollama implemented its own MLX runner path.That makes a difference. If you don't trust me, ask your favorite LLM. Olama can be run in headless server mode, which minimizes the load on the system. You don't need to have the app installed. I don’t mean this as a knock against your recommendation. I’m also a heavy local-model user and spend a lot of time with oMLX, Ollama, and other runtimes.

For my own daily workflow, I still prefer Ollama over oMLX, even with the smaller MLX model selection. oMLX is definitely interesting because it exposes more tuning options and system parameters, but more control doesn’t necessarily translate into better performance for every use case.

— @roygbiv138

Domain Names: https://samuelgregory.co.uk/namecheap

Hosting: https://www.hostg.xyz/aff_c?offer_id=6&aff_id=130549

“

oMLX’s SSD-backed KV cache is a very interesting feature for local agent workflows, but users should be aware that persisting KV blocks to disk can add extra SSD write activity over time. For heavy local LLM use with long contexts, that write volume may not be trivial.

I’m not saying this automatically kills an SSD, but on Macs with internal storage that isn’t realistically user-replaceable, I think it’s a tradeoff worth mentioning. Personally, I’d prefer using this kind of cache on an external SSD, or at least monitoring total bytes written.

— @roygbiv138

Online Storage ($200 credit): https://samuelgregory.co.uk/digital-ocean

⚒️ more at https://samuelgregory.co.uk/tools

“

the correct way would be to rent it out when it's idle. otherwise it's simply not competetive.

— @SebGreen

$ cat gear.txt

────────────────────────────────

“

AFAIK DS4 (Dwarf Star 4) is by far the best local AI you can run on a Mac. It is based on DeepSeek 4 Flash but as we speak it is being programmed to work with GLM5.2.

— @matazmataz2493

Sony A7c II: https://amzn.to/40qaYEJ

Lens Sigma 16-28mm: https://amzn.to/3IaDzqx

“

Try MLX-Serve if you have a Mac, it goes beyond model and harness, and fixes tool calling at the api layer, fixing all models and harnesses

— @DavidD475

Microphone Samson QU2: https://amzn.to/3TkshCE

Macbook Pro M1 Max: https://amzn.to/48736M6

“

I’m on the same page, using oMLX every day with Hermes and OpenCode. On the M1 Ultra 64GB we have enough resources to run models like Darwin 35B A3B oQ4 and Ornith 1.0 35B oQ5 with decent speed (35 - 40 tok/s) and intelligence, and no doubt these models run even better on M2 through M5.

— @ScottLahteine

$ cat books.txt

────────────────────────────────

“

Don't know what LRU is... "Vibe coders"

— @SpartanR61

The Full Stack Agency: https://flowst8.dev/store

Lingo: Agile: https://thefullstackagency.gumroad.com/l/agile-lingo

“

Nice work. Clear and thorough. Does oMLX support SSD streaming for LLMs that exceed available physical RAM? I've seen it suggested that SwiftLM does. I've also seen that GLM 5.2 produces acceptable results when quantized and shrunk to run in 256GB. So 2+2= How fast can SSD streaming run GLM5.2 at 256GB on a 128GB MacbookPro M5 Max?

— @yogaman4113

Lingo: Startup: https://thefullstackagency.gumroad.com/l/startup-lingo

$ cat timestamps.txt

“

I don't know but LM Studio seems to be waay better than omlx on my 128GB M4 Max

— @theabsdude

────────────────────────────────

00:00 Finally, the correct way to run AI on a Mac

“

I use LM studio as a beginner on mac studio m1max 32g as coder with claude code .
I use gguf since I found mlx is crash more than gguf. i know that mlx is faster but may be my config is wrong.

is there any correct way to setup mlx on lmstudio?
Does oMLX fix this issue?

— @dpfam

00:29 oMLX has a special trick up its sleeve

02:40 Where do Ollama and LM Studio land?

“

I'm running the same 128GB M5 Max. I'll share my biggest revelations:

1) Deepseek-V4-Flash QAT-Mixed-4B2B8 on the DS4 inference engine is unequivocally the best model you can run on this system by a long shot. Though keep in mind that DS4 is a proprietary inference engine only for specific Deepseek 4 models. You'll still want to configure all of your auxiliary models to something like Gemma 4 E4B QAT running through oMLX.

2) oMLX is great, but it has its flaws. The biggest issue I've had was running MTP models, which had major memory leak issues that filled up my RAM and ROM cache until my system was unusable. I used MTPLX for a while, but shelved it after using DS4 - Deepseek V4.

3) Artificially increase your context window in Hermes by creating a "second brain," which is just a robust memory system, and do this as soon as possible. I use the OKF memory standard established by Google, mapped to Obsidian Vault (this is an application created by the man who also created the OKF standard) so I can visibly see what content my model is choosing to store. You'll still need to negotiate with your model about what sort of information you want it storing automatically, and most importantly you have to work out some sort of context drift prevention to implment into your memory system.

The second brain was possibly the biggest game changer for my system. It allows me to turn my profiles into actual professionals on any given subject, just by instructing it to perform deep research or scrape data from databases.

— @SimplestUsername

03:27 Downloading oMLX

03:48 Rundown of the UI and downloading models

“

I recently built LightLX - free on Github - allowed me to stream GLM 5.2 (full 1.5TB) model on my 24GB Macbook Pro.

— @DewaldN

04:49 Serving your local model

06:12 Seeing the cache in action

“

omlx is my main go to on a m5 max 128. although i've been experimenting antirez's ds4

— @donwellsmedia

06:54 Playing around with parameters

07:29 Configuring providers in harnesses

“

Doesn't oMLX degrade your SSD?

— @6alisk

#LocalLLM #LocalAI #AI

“

Hi, I really like your videos. With this 128GB Mac, can you produce code efficiently, similar to what frontier models do?

— @bikepass

More User Perspectives

I finally have 128GB of RAM and I still spend most of my time worrying about how many Chrome tabs I have open. Is it even a real workstation if you aren't constantly checking the Activity Monitor?

@0x5am5