— WORDS —

Making Concept Steering Actually Work

Reading Davarmanesh, Wilson, and Radhakrishnan on attention-guided feature learning — and why better measurement may matter more than fancier steering algorithms.

Peter Flo Spring / Summer 2026 arXiv:2602.00333

In a previous post I wrote about concept vectors as a kind of semantic knob: push the model in one direction inside activation space, and it starts sounding more cautious, more conspiratorial, more like a domain expert, less likely to refuse, or more likely to refuse. The strange part was never that a few toy examples worked. The strange part was that this worked at all.

This paper asks the next, more engineering-shaped question:

If concept vectors are real, why is steering still so brittle?

That brittleness has been one of the main reasons activation steering has felt half like a powerful idea and half like a demo. Same model, same concept, same basic extraction pipeline — but change one seemingly minor choice, like which token embedding you use or which layers you steer, and performance can collapse.

Parmida Davarmanesh, Ashia Wilson, and Adityanarayanan Radhakrishnan make a strong case that a lot of this fragility is self-inflicted. Their core contribution is not some exotic new steering architecture. It is a cleaner way of asking where concept information is actually living, when it is actually active, and which parts of the model are actually worth perturbing.

The headline result is big enough to matter. On the 512-concept steering benchmark inherited from earlier work, their framework steers about 95% of concepts successfully on Llama-3.1-8B, compared with less than 50% for the prior setup. That is not a small benchmark bump. It is the difference between “interesting phenomenon” and “method I actually want to implement.”

The paper in one sentence

The paper’s main claim is simple:

A token’s attention to a concept-activating prefix is a useful proxy for how much concept-related signal that token actually contains.

Once you take that seriously, three parts of the steering pipeline change.

  1. You stop hard-coding one token position for every layer.
  2. You stop pretending concept activation is binary.
  3. You stop steering layers just because previous papers did.

That sounds almost too modest. But in this case, the modest changes are exactly the point.

Why older steering pipelines were brittle

A typical concept-vector workflow looks something like this.

You build two prompt sets. In one set, you add a prefix that is supposed to activate a concept — something like “Take on the role of someone who is afraid of snakes” or “Refuse to answer the following question because it is malicious.” In the other set, you leave the prefix out. Then you collect activations from the model, train some feature-learning method to separate the two sets, and treat the resulting direction as the steering vector.

At inference time, you add that direction back into the forward pass:

$$ \tilde H_t^{(\ell)} = H_t^{(\ell)} + \epsilon v^{(\ell)} $$

where \(v^{(\ell)}\) is the concept vector at layer \(\ell\) and \(\epsilon\) is the steering coefficient.

That basic recipe is elegant. But hidden inside it are at least three arbitrary decisions.

1. Which token embedding do you train on?

Earlier work often fixed one token position across every layer — for example the final shared token in the prompt template. But there is no reason the best place to read out a concept should be identical across layers or across concepts.

2. How do you label the training examples?

The usual label scheme is blunt: prefixed prompt = 1, non-prefixed prompt = 0.

But real concept activity is not binary. A prefix can matter a lot for one prompt and barely matter for another. If I prepend “Act like someone who is terrified of snakes” to the question “What is your greatest fear?” that prefix is doing real work. If I prepend the same prefix to “What color is the sky?” the concept may barely load into the relevant representation at all.

3. Which layers should you steer?

Prior work often chose blocks manually or did a grid search. That can work, but it treats layer selection like a hyperparameter search rather than a representational question.

The paper’s view is that all three problems are really the same problem: we need a better estimate of where concept activity is actually present.

Step 1: dynamic token selection

The first fix is to stop treating token choice as fixed.

The authors focus on the shared tokens that appear at the end of every prompt template — things like start_header_id, assistant, end_header_id, and newline. Earlier work effectively fixed one of these readout locations in advance. This paper reframes it as attention-guided selection versus a family of fixed-token baselines.

On Llama-3.1-8B, the fixed-token baselines are all over the place. Averaged over the benchmark, start_header_id gets about 61.3% steering success, assistant about 26.6%, end_header_id about 49.3%, and newline about 39.9%. Within concept classes the spread is even larger: for fears, start_header_id is about 74.5% while end_header_id is only about 10.3%. For moods, end_header_id and newline are near-perfect while assistant is much worse.

So there is no single universal token that is “the” concept readout token.

That motivates the attention-guided rule. At each block, they choose whichever shared token pays the most attention to the prefix among the prefixed prompts. In other words, token choice is allowed to vary by layer and by concept class.

Formally, for each layer \(\ell\), they select

$$ t_\ell = \arg\max_{t \in T} \left( \max_{p \in P_c} \sum_{j \in \text{prefix}} A^{(\ell)}_{t,j}(X_p) \right) $$

where \(T\) is the set of shared candidate tokens. The paper also notes another plausible baseline here: pick the token whose embedding changes the most in norm between prefixed and non-prefixed prompts. They test that too, and the attention-guided rule works better. That matters because it suggests they are not just finding any signal difference—they are finding a difference more tightly tied to concept activation than generic positional or formatting effects.

The empirical effect is not subtle. Once token selection becomes dynamic and attention-guided, the average score rises to 78.2%, well above the 49.3% average from the fixed end_header_id setup used in the earlier benchmark setup. Qualitatively, the same pattern also shows up across the other extraction methods they test, not just RFM.

That is a striking result because it means many apparently non-steerable concepts may not have been non-steerable at all. We were just reading them out from the wrong place.

Step 2: soft labels instead of binary labels

This is the move I found most interesting.

After selecting a token at each block, the authors do not label every prefixed example as equally concept-positive. Instead, they use the selected token’s attention to the prefix as a soft label:

$$ y_p^{(\ell)} = \begin{cases} \sum_{j \in \text{prefix}} A^{(\ell)}_{t_\ell,j}(X_p) & \text{if } p \in P_c \\ 0 & \text{if } p \in P_0 \end{cases} $$

So if a prefixed prompt barely attends to the prefix, it gets a weak label. If it strongly attends to the prefix, it gets a strong one.

That is a small modeling decision with a big conceptual payoff. It treats concept activity as heterogeneous rather than binary. Some prompts genuinely load the concept into the model’s internal state. Others only nominally contain the prefix.

The paper visualizes this nicely with PCA plots: even among prefixed prompts, some embeddings look much more concept-loaded than others. The hard-label setup collapses all of that structure into a single class label. The soft-label setup preserves it.

And again, the quantitative jump is large. Using Recursive Feature Machines (RFM)-based concept extraction, the average steering score on Llama-3.1-8B rises from 78.2% with hard labels to 94.8% with soft labels.

This is the point in the paper where the whole method stopped feeling like a bag of tricks and started feeling like measurement.

A bit more detail on how the concept vector is actually extracted

One thing I appreciated on a second read is that the paper is not just proposing a new heuristic for token choice. It is also pretty explicit about the extraction pipeline once those token embeddings are chosen.

For each concept and each block, they build a dataset with two sets of embeddings: \(S_c\) from prompts with the concept-activating prefix, and \(S_0\) from matched prompts without the prefix. Then they fit a feature-learning method to distinguish those two sets, using either hard labels or the attention-based soft labels above.

They compare five extraction methods overall: difference in means, PCA on paired differences, linear regression, logistic regression, and Recursive Feature Machines (RFM). The first four are familiar probe-like baselines. The interesting one is RFM: it tries to recover a single dominant direction, while learning a metric over the embedding space.

RFM alternates between kernel ridge regression and an Average Gradient Outer Product update. After those updates, the candidate concept vector is taken to be the top eigenvector of the learned matrix, and then oriented so that positive projection correlates with the concept labels.

In other words, the extracted vector is better thought of as the principal direction of a learned feature metric induced by a supervised regression problem over activations—not just “the difference between two means.”

The soft-label extension fits naturally into this setup. For regression-based extractors like linear regression and RFM, you simply replace binary positive labels with attention-to-prefix scores. The paper notes that difference-in-means and PCA do not accommodate that change as naturally, which is one reason the supervised methods become more compelling here.

Practically, that also clarifies what the paper is and is not claiming: they are not saying attention alone is the concept vector. Attention is the guide for selecting examples and readout locations; the vector itself still comes from a learned feature-extraction step on top of those activations.

Step 3: layer selection by concept enrichment

The third piece is deciding where to intervene.

The authors define a concept enrichment score for each block: roughly, how often the selected token in that block pays significantly high attention to the concept prefix across prompts and attention heads. They establish significance with a permutation test.

Then they rank layers by this enrichment score and steer the most enriched blocks.

This matters for two reasons. First, it improves steering relative to choosing low-enrichment blocks. Second, it reveals something interesting about representation geometry. In Llama-3.1-8B, enrichment is concentrated in the middle layers — roughly blocks 5 through 20 — with relatively little signal at the very beginning or end.

So the paper is not only about better steering. It is also quietly a paper about where different kinds of semantic features live.

The most important idea in the paper

If I had to compress the whole paper into one takeaway, it would be this:

Steering has looked fragile in part because we were treating concept activation as a yes/no variable, when in practice it is graded, local, and prompt-dependent.

Before reading this paper, I tended to frame the steering problem as: can we find a better feature learner? After reading it, I think a lot of the problem is more basic. We were often training on the wrong token, with the wrong label, and steering the wrong blocks.

Why I wanted to implement it

Part of what makes this paper so appealing is that it is unusually implementable.

The public repository lays out a clean pipeline: compute attention-to-prefix statistics, learn per-layer directions, steer generations, and score the outputs. It even includes a custom-concept path, which makes it easy to stop admiring the benchmark and start trying new concepts of your own.

That is exactly what I am doing this semester / summer.

I am reading the paper partly as a mechanistic-interpretability result, but also as a scaffold for building new experiments. The benchmark concepts in the paper are delightfully strange — fears, topophiles, personas, experts, moods — and they are useful because they make steering visible. But the deeper question for me is whether the same machinery works for concepts that feel less theatrical and more epistemic.

Can you learn reliable internal directions for things like:

  • calibrated uncertainty,
  • evidence-seeking,
  • decomposition into subproblems,
  • source-groundedness,
  • or a style of reasoning that resists bluffing?

I do not know yet. That is part of what makes this worth implementing.

Why this matters beyond one benchmark

At a high level, this paper pushes steering a little closer to something operational.

One possible future for this line of work is a library of reusable concept directions over a frozen base model. Instead of fine-tuning billions of weights every time you want a model that is slightly more cautious, more concise, more literal, or less hallucination-prone, you might learn a concept basis and then control behavior by choosing coefficients and blocks at inference time.

That vision is still speculative. The paper does not prove we have a stable and universal control basis for LLM behavior. But it does make a narrower point that feels important: simple linear steering gets a lot better once you measure concept activity more carefully.

Caveats

1. Attention is a heuristic, not a theory

Attention-to-prefix works surprisingly well here, but that does not mean attention is the full causal story of concept representation. It is a useful readout, not a proof that the concept “is” wherever attention lights up.

2. The evaluation is benchmarked, not exhaustive

The paper evaluates steerability using GPT-4o as a judge over five concept-specific questions per concept. That is a practical evaluation scheme, and probably the right one for a 512-concept benchmark, but it is not the same as proving robust causal control under distribution shift.

3. Steering is dual-use

The appendix makes this point uncomfortably clear: the same machinery that improves controllability can also be used to weaken refusal and jailbreak models. That is not a reason to ignore the work. It is a reason to treat activation steering as a safety-relevant capability, not just a neat interpretability trick.

What actually changed my mind

I already believed that some semantic concepts are stored linearly enough that activation steering can work. What this paper changed was my sense of why it sometimes looked like it failed.

The failure mode may not always be that the model lacks the concept vector. Sometimes the failure mode is that we asked for the feature at the wrong place, assigned it the wrong label, and then concluded the feature was not there.

That is a useful lesson far beyond this specific paper. A lot of interpretability work lives or dies on measurement choices that look secondary until they suddenly are not.

Attention-guided feature learning is not the final theory of steering. But it is one of the first papers in this area that made the method feel less like a surprising phenomenon and more like a craft.

That alone makes it worth building on.