DwarfStar (DS4): The Local Inference Revolution Is Just Beginning

Some projects start quietly, and some explode in a matter of days. DwarfStar, better known as DS4 — the local inference engine for DeepSeek V4 written by Salvatore Sanfilippo (antirez), the legendary creator of Redis — firmly belongs to the second category. Launched in early May 2026, it gathered over 13,000 GitHub stars in a month, catching the attention of Georgi Gerganov (author of llama.cpp) and even the CEO of Y Combinator.

What makes DS4 so special? And why is it sparking a debate that goes far beyond the circle of AI enthusiasts?

Not Just Another Inference Engine

DS4 is not yet another wrapper around llama.cpp. It's an engine written entirely from scratch in C, specifically optimized for DeepSeek V4 Flash (284 billion parameters, of which only 11 billion are active thanks to the Mixture of Experts architecture) and for DeepSeek V4 PRO. It supports Metal (Mac), CUDA (NVIDIA), and ROCm (Strix Halo) backends.

The fundamental difference, as Sanfilippo explains, is that DS4 is designed as a finished product, not a research project. Every component — from the API server to the coding agent, from the on-disk KV cache to the directional steering tools — is engineered to work together coherently, with the goal of making local inference not just possible, but pleasant to use.

DS4 terminal during local inference

The KV Cache on Disk Insight

One of the key differentiators is the implementation of KV cache on disk. Models with enormous context windows (DeepSeek V4 reaches 1 million tokens) generate a sizeable cache. Saving and restoring this cache on SSD allows you to:

Resume conversations without recalculating the entire context
Preserve cache across server sessions, with instant hits on restart
Work with massive system prompts (e.g., 25,000 tokens from Cloud Code) paying the prefill cost only once

On modern SSDs, sequential write and read of the KV cache is surprisingly fast, because it follows a linear access pattern that SSD controllers handle extremely efficiently.

Directional Steering: Controlling Model Behavior

Another feature that captured attention is directional steering. Sanfilippo reimplemented in DS4 the findings of a 2024 paper showing that refusal vectors in large language models always point in the same direction, regardless of the type of blocked request.

With DS4 you can load a tensor representing this direction and apply a force to cancel it, obtaining a model that no longer refuses to answer certain questions. But the most interesting use is probably the "creative" one: you can generate adapters to subtly modify the model's behavior — make it more or less verbose, change the tone of responses, adapt it to specific contexts like cybersecurity.

Why All The Attention?

Sanfilippo says he was surprised by the enormous attention received. In a recent video, he reflects that the pieces for local inference already existed: llama.cpp, quantized GGUF files, Hugging Face full of models. What was missing, in his view, was the product — something that puts all the pieces together in a coherent, tested way that actually works for daily use.

"The difference," he explains, "is the product. Aside from some technical details — the disk-based KV cache, the insight that DeepSeek V4 Flash was the perfect model for laptops due to its extreme sparsity, the symmetric quantization that squeezes the model in half without losing quality — the real innovation is that everything works together."

And the results speak for themselves: with DS4 on a MacBook with 128 GB of RAM, DeepSeek V4 Flash generates code, implements Tetris in C with SDL, runs benchmarks, writes entire applications. Prefill flies at 240 tokens per second, generation runs at around 13-14 tokens per second. Not quite at OpenAI server levels, but for a 284-billion parameter model running entirely locally, it's extraordinary.

SSD Streaming: The Breakthrough for 64 GB Machines

The most recent — and perhaps most important — development is support for machines with 64 GB of RAM (and potentially 32 GB). The idea, suggested to Sanfilippo by a developer named Liu (involved in the Draw Things project for macOS), is to perform SSD streaming inference: model weights are continuously loaded from disk instead of residing entirely in RAM.

The challenge is both technical and fascinating:

During prefill (when the model reads the prompt), you can compute one layer while loading the next in the background, almost completely masking disk latency. The result is that prefill drops from 400 to about 250 tokens per second — an acceptable degradation.
During decoding (when the model generates text), the problem is more complex because the model is strictly autoregressive: each token requires all layers. The adopted solution is twofold: a static table of the most frequent experts (generated cold with a script) to pre-load them, and an LRU algorithm to decide which experts to keep in memory.

Preliminary results on machines with 64 GB of RAM show that 70-75% of the time the needed experts are already in memory, with generation performance around 13-14 tokens per second. For comparison, Qwen 27B (a much smaller model) fully in memory does 18-19 tokens per second, but scales terribly with context because it lacks DeepSeek's compressed attention.

DeepSeek V4 Flash performance graph - tokens per second

DeepSeek V4 PRO performance graph - tokens per second

Implications for Privacy and Autonomy

Sanfilippo is explicit about the project's deeper motivations: "Local inference liberates us from paid providers who centralize their gravitational force, from which you cannot escape because they decide what we can use, how we can use it, and to what level we must communicate our internal data."

Conversation data is valuable for reinforcement learning. Even if the signal is weak, those with powerful LLMs can screen conversations that matter and use them as training data. Then there's the freedom issue: you can't have an erotic chat, you can't do certain cybersecurity tasks, you can't explore topics considered "sensitive." It's also a limitation on the freedom to access artificial intelligence.

And there's the concrete risk of being banned: "If at some point someone decides I'm banned from OpenAI's servers, I have to make another account. If there's a billing issue, if they check my credit card, I can be cut off from something that has become essential."

Having a near-frontier model running locally means no longer having to choose between privacy and power.

The Future: Specialized Models and Local Agents

Sanfilippo looks ahead with optimism. DeepSeek appears ready to release specialized versions of its V4 Flash — a Flash Coder, a Flash Math, and so on — leveraging the cross-distillation technique described in their papers. The result would be 284-billion parameter models specialized by domain, capable of competing with frontier models in specific areas, running on a high-end MacBook.

"If you specialize Flash for law, for medicine, for coding, you essentially get models that are very similar to frontier models."

The paradigm shift is around the corner: with such a powerful local model, you can switch from one model to another based on the task in seconds. The local coding agent is no longer an experiment, but a daily reality. Sanfilippo himself admits to using DS4 for most of his programming work, resorting to remote services only in exceptional cases.

Conclusions

DS4 represents much more than a simple open source project. It's the proof that quality local inference is not only possible, but is becoming practical, affordable, and desirable. With SSD streaming, the dream of running near-frontier models on machines with 64 GB of RAM — and perhaps one day 32 GB — is getting closer.

For anyone who cares about privacy, technological autonomy, and the freedom to use AI without restrictions, DS4 is not just an interesting project: it's a turning point.