... WORDS ...

Steering Mixture-of-Experts Models by Choosing the Route

A continuation of activation steering: in sparse MoE models, the router itself starts to look like part of the control surface.

Peter Flo Spring 2026 arXiv:2509.09660v1

TL;DR. In the earlier steering posts, I was thinking mostly in geometric terms: find a direction in activation space, push the model along it, and see behavior move. This paper argues that mixture-of-experts models expose a second lever. Instead of only perturbing representations, you can perturb the router that decides which experts get to process each token. By identifying experts associated with document-groundedness or safety and then softly activating or deactivating them at inference time, you can improve faithfulness and safety without retraining. Uncomfortably, you can also push in the other direction. The paper feels important less because it solves steering, and more because it makes routing look like part of the control surface.

Paper: https://arxiv.org/pdf/2509.09660v1

The paper in one sentence

The core claim is simple: in an MoE language model, some experts are more associated with one behavior than another, and if you detect those experts from contrastive prompt pairs, you can steer the model by biasing the router toward or away from them at test time, without modifying weights.

Why this felt like a continuation of the earlier steering posts

In the concept-vector post, the picture was geometric. A concept looked like a direction in activation space, and steering meant adding that direction back into the forward pass. In the attention-guided post, the emphasis shifted from the intervention to the measurement problem: maybe steering looked brittle in part because we were reading out the wrong token, steering the wrong layers, or pretending concept activity was cleaner than it really was.

This paper lives in the same neighborhood, but the substrate is different. Instead of nudging a hidden state inside a fixed computation, it nudges which sparse subnetwork gets to run in the first place. That sounds like an implementation detail. It may be something deeper. In dense models, the metaphor was a semantic knob. In MoEs, the better metaphor may be a switchboard.

The core move

The detection procedure is almost disarmingly plain. For faithfulness, the authors compare a question with its supporting document against the same question without the document. For safety, they compare an unsafe instruction paired with a refusal against the same instruction paired with a harmful answer. Experts that fire more often under one condition than the other become candidates for steering.

At inference time, nothing is fine-tuned: you nudge the router so some experts are more likely to be chosen and others less, then renormalize. The next section spells out exactly what “nudge” means in logits and probabilities.

A detail I liked is that deactivation often works better than activation. In a sparse MoE, forcing one expert on everywhere is a hard intervention. Turning some experts off is weaker: the router still has room to recover using alternatives, and fluency degrades more slowly.

How the steering actually works

The slogan, "bias the router toward certain experts," sounds high level, but the implementation is clean: first you measure which experts correlate with a behavior, then you shift the router’s scores to promote or suppress them.

Step 1: Find behavior-linked experts

Start with paired prompts $(x^{(1)}, x^{(2)})$ that differ only in the behavior you care about. Typical contrasts:

Faithfulness: (question + document) vs. (question alone)
Safety: (unsafe prompt + refusal) vs. (unsafe prompt + harmful answer)

For each expert $i$, count how often it is selected on tokens from each condition, and compare activation rates (not raw counts) so long and short prompts stay on the same footing. Write $A_i^{(1)}$ for the number of token positions where expert $i$ is selected under $x^{(1)}$, and $N^{(1)}$ for the total number of tokens in that forward pass (and analogously for $(2)$). Then

$$p_i^{(1)} = \frac{A_i^{(1)}}{N^{(1)}}, \quad p_i^{(2)} = \frac{A_i^{(2)}}{N^{(2)}}.$$

The contrastive score is simply the gap between those rates:

$$\Delta_i = p_i^{(1)} - p_i^{(2)}.$$

How much more often does this expert get used when the model exhibits behavior (1) instead of (2)?

Large positive $\Delta_i$ means expert $i$ tracks behavior (1); large negative values track behavior (2). What I like here is that experts are treated as measurable switches, not as opaque blobs: you are doing literal counting and differencing.

Step 2: Bias the router

From those scores you form two sets: $A^+$, experts to promote, and $A^-$, experts to suppress. At each token the router produces logits $z = (z_1, \ldots, z_E)$. The paper works in log-probability space: $s = \log \mathrm{softmax}(z)$, so $s$ is a vector of log routing weights. The intervention is a small, structured shift: let $s_{\max} = \max_j s_j$ and $s_{\min} = \min_j s_j$, pick a step size $\epsilon$, then

for $k \in A^+$ (activate): $s_k \leftarrow s_{\max} + \epsilon$;
for $k \in A^-$ (deactivate): $s_k \leftarrow s_{\min} - \epsilon$.

Finally renormalize to a proper distribution:

$$p_i = \frac{e^{s_i}}{\sum_j e^{s_j}}.$$

No weight updates, no hard mask that forces a single expert to run everywhere. Just a tilted distribution over experts, then business as usual for the rest of the layer.

Why this feels different from activation steering

Three consequences matter for how I think about it:

Weights stay frozen.
You are not forcing one expert to take over the full depth of the model.
You only change the probabilities over which experts get a say at this routing step.

Classic activation steering is a geometric move on a hidden state: $h \leftarrow h + \alpha v$. Here the primitive is distributional: routing is $\mathrm{route} \sim \mathrm{softmax}(z)$, and you implement steering as $\mathrm{softmax}(z) \to \mathrm{softmax}(z + \delta)$ after the structured update to $s$. Same broad recipe, small intervention and global behavioral effect, but applied one level up, at who computes, not only at what vector they see.

A useful intuition

If a dense model is like one circuit you can poke, an MoE is closer to a committee: each token is handled by a sparse mixture, schematically

$$\text{output} = \sum_{i \in \mathcal{T}} p_i \cdot \mathrm{Expert}_i(h).$$

Steering is not rewriting what any single expert does. It is changing who gets a vote.

What I found most interesting

The headline results are strong. But the more interesting point is structural. It suggests that in MoE systems, behaviors like groundedness and refusal are not only properties of a global representation. They may also be properties of route selection.

Alignment, on this view, is not just about what the model knows. It is also about which experts the router chooses to consult.

That connects directly to the attention-guided steering post. The intervention itself is not ornate. The important move is identifying a behavior-linked subset of experts from contrastive data instead of treating all experts as interchangeable. The deeper contribution may be better measurement of where the behavior actually lives.

Why the faithfulness result matters more to me than persona steering

One thing I kept coming back to in the earlier posts was the difference between theatrical steering and epistemic steering. Personas are a good testbed because they make the effect visible. But the concepts I actually care about are things like source-groundedness, evidence-seeking, calibrated uncertainty, and resistance to bluffing.

In that sense, the faithfulness section matters more to me than the safety red-team charts. The document-groundedness setup is basically a routing-level intervention on whether the model follows retrieved evidence or falls back to parametric memory. That feels much closer to the kind of internal control I want these methods to reach.

There is also a clean conceptual symmetry with the concept-vector work. In the dense-model setting, the intervention is additive: modify the representation. Here the intervention is combinatorial: change who gets to process that representation. One is geometric control. The other is topological control.

What I want to try next

One of the unresolved questions from the attention-guided post was whether internal steering methods could move beyond vivid but shallow concepts and toward quieter epistemic ones. This paper makes me think the answer may be yes, at least in sparse architectures.

The recipe is straightforward: build contrastive pairs for the behavior you care about, find the experts whose routing changes, then test whether biasing those routes generalizes.

The obvious next experiments are calibrated uncertainty, citation-seeking, source attribution, and distinguishing evidence-following from bluffing.

The uncomfortable part

The dual-use point is not a side note. The same machinery that can make an MoE model more grounded or more safe can also make it less safe, without changing model weights.

This exposes a new failure mode: safety may be concentrated in a relatively small set of experts, while alternate unsafe pathways remain available and can be resurfaced by small routing shifts.

Some aligned MoE systems may not be aligned everywhere. They may be aligned along the routes the router usually takes. That turns safety from a global property into a path-dependent one.

Middle layers, again

A small result I liked is that the experts most responsible for safety and grounding cluster in the middle layers.

This rhymes with earlier observations that concept enrichment also concentrates in the middle of the network. The pattern keeps showing up: the middle layers look like the place where higher-level behavioral traits are easiest to grab.

What breaks

A few caveats seem important.

First, contrastive datasets can smuggle in template artifacts. Some detected experts may track superficial patterns rather than deep abstractions.

Second, evaluation relies on proxies. These are reasonable but do not fully resolve what the experts represent.

Third, this is not a full mechanistic account. It identifies usable handles, not complete circuits. But that may be enough to make progress.

Why this feels like a next post, not a separate topic

What I like most is that this makes the earlier steering posts feel less like isolated curiosities and more like pieces of one larger picture.

One piece is geometric: internal directions act like semantic controls.

One piece is measurement: those controls improve with better localization.

This paper adds a third piece: in sparse architectures, behavior may also live in routing decisions over specialized subnetworks.

The control surface may be larger than I thought. Not just vectors. Routes too.

What changed my mind

Before reading this paper, I mostly thought of MoE routing as an efficiency trick with interpretability side effects. After reading it, that frame feels too small.

Routing looks more like governance.

It decides which tiny committee inside the model gets to think.

And once you see that, it becomes hard not to ask a bigger question: if groundedness, refusal, and safety are partly routing phenomena, what else is hiding in the router?

Thanks for reading.