Running LLMs Locally: From Magic Trick to Daily Driver.

Running LLMs Locally: From Magic Trick to Daily Driver

Table Of Contents

TL;DR

  • Running a capable LLM entirely on your own machine - no cloud, no API key, no data leaving your laptop - is one of the two genuine “wow” moments AI still hands an experienced engineer. (The other is writing your own agent.)
  • I first tried this two years ago with Ollama, an M1 Mac, and Llama 3. It worked for simple things like translating text - and it felt magical - but it was painfully slow. Not practical for anything real.
  • Two years later, with an M5 Mac and 64GB of RAM (I should have ordered 128GB), I can run models like Gemma 4 E4B or Qwen 3.6 27B / 35B A3B locally. They are fast, and they hold up for one-shot tasks and agentic tool use. Even tiny models like Gemma 4B get a surprising amount done.
  • LM Studio makes this almost trivial: it only offers models that will actually run on your hardware, downloads them, and serves an OpenAI-compatible API on localhost.
  • Two settings save you a lot of pain: pick a large enough context window (aim for 100K+ tokens), and tell LM Studio to stop at the context limit instead of silently truncating - otherwise models quietly “forget” things and produce baffling results.
  • The kicker: because the API is OpenAI-compatible, you can point Claude Code at your local Qwen model and run a coding agent that talks to nothing but your own machine.

The first “wow”, two years ago

I have written before about my two recurring “wow” moments in the age of AI: building your own agent, and running an LLM locally. This post is about the second one - and about how dramatically it has improved.

The first time I ran a model on my own laptop was about two years ago. The setup was Ollama, an M1 Mac, and Llama 3. I typed a prompt, hit enter, and watched a real language model answer me from a file on disk - no cloud, no API key, no data leaving the machine. That is genuinely magic the first time you see it. I even got it doing something useful: translating chunks of text from one language to another, entirely offline.

But there was a catch, and it was a big one. It was slow. Painfully, watch-the-tokens-dribble-out slow. Fine for a party trick or a one-off translation you could walk away from, but nowhere near fast enough to fold into daily work. So I filed it under “amazing demo, not yet practical” and moved on.

Two years later, the game changed

I recently came back to local LLMs almost by accident, and the difference is night and day.

A few things compounded at once. The hardware moved on by several chip generations - I am now on an M5 Mac with 64GB of unified memory (and honestly, I wish I had sprung for 128GB). The models got dramatically better and smaller for the same capability. And the tooling matured.

The result is that I can now comfortably run models like Gemma 4 E4B and Qwen 3.6 27B / 35B A3B locally - and they are fast. Not “technically running” fast, but “this is genuinely usable” fast. They handle one-shot tasks - summaries, translations, rewrites, quick code questions - without breaking a sweat. More surprisingly, they hold up for agentic use: reasoning across multiple steps, calling tools, and following the kind of strict format an agent loop depends on.

What really got me was how much even the small models can do. A 4B model like Gemma 4B has no business being as competent as it is. For a lot of everyday tasks it is more than enough - and it runs on hardware that would have laughed at the idea two years ago.

The thing that was a slow magic trick in 2024 has quietly become a daily driver.

Know your models

Running locally teaches you something the cloud APIs hide: models have personalities, and the trade-offs between them are very real. “It runs” is not the same as “it is the right tool for this job”. Here is how the three I reach for most actually behave in practice.

Gemma E4B - the quick helper. Good for simple, one-shot tasks. It does not do elaborate thinking and it does not go off and research a problem - it answers what is in front of it. It can come across as a bit clumsy, and it messes up tool calls now and then, so I would not lean on it for agentic work. But for a fast translation, a quick rewrite, or a simple question, it is more than enough - and it is remarkably capable for such a tiny model.

Qwen 3.6 35B A3B - fast and good at code. A large model, but a mixture-of-experts one: only about 3B parameters are active at any moment, which is why it feels so quick. It has a tendency to answer fast and does not do much research, so like Gemma it is best for simple one-shot answers - but it is notably good at coding. If you want speed and the task is well-defined, this is a strong pick.

Qwen 3.6 27B - the one I trust. This is the model that actually does research and combines things - it reasons across steps, pulls pieces together, and handles more complex tasks. It is a bit slower than the other two, which is the price of that extra deliberation. Of the three, it is the one I trust most when the answer matters.

The pattern is the one you would expect: the faster, lighter models are great for quick, well-scoped work, and the slower, more deliberate model earns its keep when the task is genuinely hard. Knowing which is which - and not reaching for the heavyweight when a 4B model would do - is half the skill of working locally.

LM Studio is your friend

Two years ago, getting a model running meant fiddling with command-line tools, quantization formats, and a fair amount of “will this even fit in memory?” guesswork. Today I reach for LM Studio, and it removes almost all of that friction.

The single best thing it does is only show you models that will actually run on your setup. No more downloading a 40GB model only to discover it swaps your machine into oblivion. LM Studio knows your hardware and filters accordingly, so the models it offers are ones you can realistically load and run. You pick one, it downloads, and you are chatting in minutes.

That alone turns “a weekend project” into “a coffee break”.

Two settings that save you a lot of pain

LM Studio is easy to start with, but two defaults will bite you if you do not change them. Both cost me some confused debugging before I worked out what was going on.

1. Give the model a large enough context window

LM Studio sometimes loads models with a surprisingly small context window. For anything beyond a toy prompt - and especially for agentic use, where the conversation accumulates thoughts, actions, and observations - a small window fills up fast.

My rule of thumb: aim for 100K tokens or more to comfortably get started. Most modern local models support large contexts; you just have to ask for them. Bump the context window up before you start real work, not after you have been mystified by the results.

2. Stop at the limit - do not silently truncate

This one is subtle and genuinely nasty. When a conversation grows larger than the context window, LM Studio’s default behavior can be to silently truncate the messages - quietly dropping the oldest part of the conversation to make room.

The symptoms are bizarre. The model starts “forgetting” things it was just told. In agentic loops it re-runs steps it already completed, because the record of having done them got trimmed away. You stare at the output convinced the model is broken, when really the harness is amputating its memory behind its back.

In my opinion, silent truncation is the wrong default. If the messages no longer fit, the right thing to do is return an error, not pretend everything is fine and hand the model a lobotomized history. An error tells you exactly what happened; silent truncation sends you on a wild goose chase.

LM Studio lets you choose this behavior. Set it to stop at the limit:

Inference  =>  Context Window  =>  Stop at Limit

With that flipped on, an oversized conversation fails loudly and honestly instead of degrading into nonsense. Much better.

The real kicker: a local OpenAI-compatible API

Here is where it goes from “neat” to “genuinely useful for engineering work”.

LM Studio does not just let you chat with a model. It can load a model and expose it through an OpenAI-compatible API running right on your machine - the same request and response shape that countless tools already speak, served from localhost.

That compatibility is the unlock. Any tool that talks to the OpenAI API can, with a change of base URL, talk to your local model instead. No code changes, no special SDK - just point it at your machine.

Which means you can do something that still makes me grin: point Claude Code at your local Qwen model. Set the base URL to LM Studio’s local endpoint, pick your model, and you have a coding agent reasoning, planning, and calling tools - with every token generated on your own laptop. Nothing leaves the machine. For experimenting, for working offline, or just for understanding what these tools really need from a model, it is a fantastic way to learn.

It also ties neatly back to the agent I built from scratch: that little Go agent talks to an OpenAI-compatible endpoint too, so swapping in a local model is a one-line change. The same loop, the same tools - just a brain running on my desk instead of in a data center.

Why this matters

Two years ago, “run an LLM locally” was a magic trick: impressive, slow, and impractical. Today it is something I actually reach for - fast enough, capable enough, and private by construction.

In keeping with my general bias toward simplicity and toward understanding tools rather than treating them as magic, running models locally is one of the most clarifying things you can do. You feel exactly what costs what - context length, model size, memory - and you stop thinking of “the LLM” as a remote oracle and start thinking of it as a program that runs on a computer you own.

If you tried this a couple of years ago and walked away unimpressed, it is worth another look. The hardware caught up, the models got better, and the tooling got out of the way. Download LM Studio, grab a model it recommends, give it a big context window, tell it to stop at the limit - and watch your laptop think.

That is the first wow, and two years on, it finally graduated from demo to daily driver.

Related posts