— WORDS —

Steering Large Language Models with Concept Vectors

A reflection on Beaglehole et al., Science (2026)

Peter Flo March 2026

TL;DR. The paper extracts a concept vector at each layer of a model and steers behavior by adding that vector to activations during the forward pass. Surprisingly often, that simple move behaves like semantic control. The deeper question is not whether it works — but why it works so broadly, and what we can build on top of it.


The core move

Large language models compute in layers. At each layer $\ell$, the model maintains an activation $A_\ell$: a high-dimensional representation of what it currently "knows" about the input.

The authors ask:

Can we identify a direction in this activation space that corresponds to a semantic concept?

For example:

  • "conspiracy tone"
  • "anti-refusal"
  • "translate to C++"
  • "hallucination"
  • "toxic content"

They build labeled datasets for each concept, compute activations at every block, and use a feature-learning method (Recursive Feature Machines) to extract a direction $v_\ell$ per block that best tracks the concept.

Then, during generation, they modify the forward pass:

$$A_\ell \leftarrow A_\ell + \varepsilon v_\ell$$

That's the steering rule.

If $\varepsilon > 0$, the model moves toward the concept.
If $\varepsilon < 0$, it moves away from it.

For multiple concepts, they simply add a linear combination of vectors.

No fine-tuning. No weight updates. Just additive perturbations inside the computation.

And across hundreds of concepts, model sizes, and even modalities, this often works.


What makes this interesting

Two empirical results matter.

First, steering is surprisingly general. They evaluate 512 concepts across multiple Llama models and find that larger models are generally more steerable. They also show steering can improve performance on precision tasks like Python-to-C++ translation and even reasoning tasks.

Second, monitoring via internal representations outperforms LLM judges prompted to evaluate outputs. Instead of asking a model to critique text, they project internal activations onto concept vectors and train classifiers on those projections. On several hallucination and toxicity benchmarks, these internal probes outperform output-based judges.

That suggests something subtle:

The model's internal representations may contain signal that its outputs do not fully express.

Steering and monitoring are two sides of the same interface: concept vectors give you both control and measurement.


The linearity puzzle

The authors connect this to the "linear representation hypothesis": the idea that semantic relationships are encoded as linear structure in representation space.

But the interesting tension is this:

  • A direction that separates two classes (e.g., English vs Hindi) need not be the direction that maps one to the other.
  • Classification and steering vectors could, in principle, differ.

Yet in practice, classification-style extraction often yields directions that also steer.

The paper is explicit that this alignment is not guaranteed. That it works as often as it does is part of the mystery.

Two open questions remain:

  1. Why are so many concepts linearly represented?
  2. Why does classification reliably recover usable steering directions?

The results are operationally strong. The underlying mechanism is still not fully understood.


From steering to steerable oversight

If you treat concept vectors as a low-dimensional control interface to a model, an obvious next question is: what sits on top of that interface?

Here's one possible pipeline.

Step 0: Precompute a "control library" for model $t$

Using the paper's method, build a library of concept vectors for the target model $t$.

For each concept $c$ and each block $\ell$, extract $v_{\ell,c}$.

Optionally extract multiple vectors per block for monitoring (the paper uses top-$p$ eigenvectors for this).

This gives you a menu of controls:

  • "more cautious"
  • "less hallucination"
  • "more literal"
  • "refuse harmful"
  • "cite sources"
  • "more concise"
  • "less speculative"
  • etc.

These are not claimed to be mechanistic circuits. They are usable control directions.

Step 1: A $t-1$ controller proposes a steering policy

Now introduce a smaller model — call it $t-1$ — as a controller.

Given an input $x$, the controller outputs:

  • which concepts to activate or suppress
  • which layers to apply them to
  • steering strengths $\varepsilon_{\ell,c}$

The intervention becomes:

$$A_\ell \leftarrow A_\ell + \sum_c \varepsilon_{\ell,c} v_{\ell,c}$$

This is just automated multiconcept steering.

Instead of a human choosing the linear combination, the controller does.

Step 2: Run model $t$ under steering

Generate output $y$ with those perturbations applied during the forward pass.

Step 3: Score against a bundle of criteria

Evaluate $y$ according to multiple objectives, for example:

  • safety policy adherence
  • factuality / groundedness
  • helpfulness
  • not over-refusing
  • tone and formatting constraints

Crucially, scoring need not rely only on output-based judgments.

You can incorporate the paper's monitoring signals:

  • projections of activations onto hallucination or toxicity vectors
  • ensemble blockwise features

This combines output-based metrics with internal representation metrics.

Step 4: Optimize the controller

Now treat steering as an action space and optimize the controller to improve the score.

Possible approaches:

  • supervised learning (if steering labels exist)
  • bandits or reinforcement learning
  • Bayesian optimization over coefficients
  • offline preference learning using human ratings

The appeal is that the action space is small.

You are not fine-tuning billions of weights.
You are tuning coefficients over a concept basis.

Steering becomes a structured control problem rather than a prompting problem.


Why this matters

Prompting is rhetorical. Fine-tuning is entangled.

Concept steering exposes a geometric control surface.

If that surface is stable and general, you can:

  • calibrate behavior dynamically
  • reduce hallucinations without full retraining
  • trade off safety and helpfulness more precisely
  • potentially build layered oversight systems that operate inside the model rather than outside it

But there are risks.

  • Optimizing against monitors may encourage gaming them.
  • Classification directions may not remain causal under pressure.
  • Steering may generalize poorly outside curated settings.

The paper doesn't solve these problems. It exposes the interface.


What actually changed my mind

I expected representation-level control to require deep mechanistic understandingcircuit maps, sparse decompositions, neuron-level analysis.

This paper suggests a cruder form of control already exists.

The surprising fact may not be that steering works for some concepts.

It's that concept vectors exist at all.

If that's true, the next frontier isn't discovering whether models have handles.

It's learning how to use them responsibly — and understanding what those handles actually are.


© Peter Flo