— WORDS —

What Survives Post-Training Inside a Language Model?

A mechanistic walkthrough of Du et al. (COLM 2025), “How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence.”

Peter Flo Spring 2026 arXiv:2504.02904

Post-training is the modern alchemy of LLMs: take a base model that mostly “knows things,” then apply supervised fine-tuning (SFT) and preference-based alignment (e.g., RLHF-style stages) until it becomes an instruction-following assistant that’s ideally more truthful, safer, and better calibrated.

We have lots of work evaluating these models as black boxes. But if you care about mechanistic interpretability, there’s a sharper question:

When we post-train an LLM, do we rewrite the model’s internal machinery… or do we mostly add a thin control layer on top of what pre-training already built?

Du et al. tackle this directly by comparing base models against their post-trained variants from four angles (knowledge, truthfulness, refusal, and confidence) using standard mechanistic toolbox techniques like activation patching and linear directions in representation space.

The paper’s headline result is pleasantly crisp:

Some mechanisms are shockingly stable under post-training (where facts live; the “truth” direction).
Some mechanisms are meaningfully rewritten (the “refusal” direction).
And at least one popular hypothesis about confidence doesn’t explain the base→instruct shift (entropy neurons).

TL;DR.

Knowledge storage locations don’t move. The neurons/layers/tokens that matter for recalling a fact look almost identical in base vs post-trained models.
Knowledge representations partially transfer. If you patch knowledge activations from base → instruct, it usually works; instruct → base often fails. Post-training reuses base representations while also building new ones.
Truthfulness is (approximately) a shared linear handle. The truthfulness direction in activation space is highly similar between base and post-trained models, and transfers well for interventions.
Refusal is also linear, but it changes. The refusal direction differs sharply between base and post-trained models, and does not transfer forward (base → instruct) very well.
Confidence differences aren’t explained by entropy neurons. The entropy-neuron sets overlap heavily across base and post-training, so whatever changes confidence is subtler than “these 10 neurons moved.”

If you’re building steering vectors, probes, or model edits: you can probably port truth tooling forward from base → instruct. But for safety/refusal, expect post-training-specific geometry.

The experimental setup

The authors compare:

BASE: the pre-trained model (no instruction tuning).
SFT: supervised fine-tuning on instruction data.
INSTRUCT: a “fully post-trained” model (the end product of the post-training pipeline).

They run analyses across major open model families, especially Llama-3.1-8B and Mistral-7B-v0.3, plus additional models for specific refusal/confidence experiments (e.g., Qwen, Gemma, Llama-2 variants).

And they use three kinds of datasets:

Simple true/false factual statements (e.g., city-in-country templates) drawn from prior truth-geometry work.
An in-distribution post-training dataset (“tulu extracted”) built by sampling factual statements from Tulu SFT mixtures and generating counterfactual false variants.
Harmful vs harmless instruction prompts (AdvBench vs Alpaca) for refusal.

1. Knowledge: post-training doesn’t move “where facts live”

The intuition: treat causal tracing like a CT scan

The core tool here is causal tracing / activation patching. You take a true statement and a closely matched false statement, then surgically replace one internal activation in the false run with the activation from the true run. If that flip makes the model output “TRUE” instead of “FALSE,” that location mattered for retrieving the relevant knowledge.

Concretely, they compute a score like:

$$M^l_i(s,\hat{s}) := \log \frac{P(\text{TRUE})}{P(\text{FALSE})} \;\Big|\; \text{patch } h^l_i(\hat{s}) \leftarrow h^l_i(s)$$

and then aggregate these scores across many aligned statement pairs to get a heatmap over (layer × token position).

The result: the heatmaps are basically the same

Across datasets and model families, the “hot spots” consistently show up at:

the subject token position,
the object token position,
and the last token (which is always important because it pools context for the final decision).

Critically: BASE and INSTRUCT look nearly identical in these knowledge-location maps. In the Llama-3.1 family, correlations between BASE and INSTRUCT knowledge maps are extremely high (often ~0.98 to 0.99+ across datasets), with small maximum differences at knowledge-relevant positions.

Interpretation: post-training doesn’t “move” factual knowledge into new layers. Whatever pre-training built as a knowledge-retrieval pathway largely remains the pathway.

2. Knowledge representations: base → instruct patches work; instruct → base patches don’t

Knowledge locations may match, but the paper asks a second (more mechanistic) question: even if facts are stored “in the same place,” are they encoded in the same language?

They test this with cross-model patching:

Forward patching: patch activations from BASE into POST (SFT/INSTRUCT).
Backward patching: patch activations from POST into BASE.

The pattern they report is asymmetric:

BASE → POST patching is usually successful, recovering similar effects to same-model patching.
POST → BASE patching often fails (larger discrepancies).

Interpretation: post-training tends to adapt the base representational scheme (so base activations still “make sense” inside instruct), while also developing additional representations that the base model doesn’t understand.

This is a nice middle ground between two extremes: “post-training rewrites everything” (nope) and “post-training is just superficial prompting polish” (also nope). Instead: the base model’s internal knowledge interface mostly survives; post-training extends it.

3. Truthfulness: a stable linear handle that transfers

A big thread in recent mech interp is the idea that “truth” is represented (at least partly) as a direction in activation space: true statements cluster on one side of a hyperplane; false statements cluster on the other.

Du et al. take that idea and ask: does post-training change the geometry?

The core move: a difference-in-means “truthfulness direction”

They compute a per-layer truthfulness direction:

$$t^l = \mathbb{E}_{s \in D_{\text{true}}}[h^l(s)] - \mathbb{E}_{s \in D_{\text{false}}}[h^l(s)]$$

using hidden states at a chosen layer and token position (typically the final token of a truth-evaluation prompt).

Result 1: the directions are highly similar across base and post-trained models

For Llama-3.1-8B, cosine similarity between truthfulness directions in BASE/SFT/INSTRUCT is high (the paper shows strong similarity in the truthfulness-direction heatmaps).

Result 2: probes trained on BASE transfer cleanly to INSTRUCT

They train a simple linear probe using $t^l$ and test whether it classifies truth vs false on held-out datasets. Then they do cross-model transfer: train on BASE activations, test on INSTRUCT activations. The transferred probe performs similarly to the native probe trained within INSTRUCT, with small deltas in accuracy across multiple datasets.

Result 3: steering via truth directions transfers too

They also perform activation steering: add or subtract the truthfulness direction during the forward pass to flip model outputs between TRUE/FALSE. The base model’s direction steers post-trained models nearly as well as the post-trained model’s own direction (small differences in their “Intervention Effect” metric).

Interpretation: post-training preserves a lot of the internal “belief of truth” geometry. If you learn a truthfulness handle on BASE (which is often easier/cleaner to study), it tends to remain a handle on INSTRUCT.

4. Refusal: also linear, but not stable under post-training

Now the contrast. The paper treats refusal as another linearly mediated behavior: you can compute a refusal direction from harmful vs harmless prompts, and then add/ablate that direction to induce or suppress refusal.

They learn the refusal direction $r$ analogously to truth:

Take a set of harmful instructions (AdvBench) and harmless instructions (Alpaca).
Compute a difference-in-means direction in activation space.
Evaluate by intervening and measuring how often the model starts its answer with refusal phrases (“I’m sorry…”, “I can’t…”, etc.).

The punchline: refusal directions diverge between BASE and INSTRUCT

Cosine similarity between BASE and INSTRUCT refusal directions is low in their experiments (they show a stark mismatch compared to truthfulness).

And more importantly, forward transfer fails:

Using the BASE refusal direction to steer INSTRUCT usually has little effect.
Using the INSTRUCT refusal direction to steer INSTRUCT is highly effective (ablating it can dramatically reduce refusals on harmful prompts; adding it can induce refusals on harmless prompts).

This supports their broader claim: post-training changes the refusal mechanism in a way that breaks the “learn a handle on base, reuse it on instruct” workflow.

They also connect to prior work suggesting a kind of backward transfer (post → base) is more promising for refusal: you can potentially take the refusal handle learned in a post-trained model and apply it to a base model to induce safer behavior without full post-training.

5. Confidence: entropy neurons aren’t the explanation

Finally, they look at confidence through the lens of entropy neurons, neurons in the final MLP layer that modulate uncertainty by affecting logit scale more than token preference.

They identify entropy neurons using two criteria:

large output weight norms, and
low variance in direct logit attribution (computed by projecting neuron output weights through the unembedding matrix and measuring variance).

Then they compare entropy neuron sets across BASE vs POST models.

Result: big overlap, tiny differences

They find substantial overlap in which neurons qualify as “entropy neurons” between base and post-trained models (often 8 to 10 out of 10 overlap), with very small ratio differences for overlapping neurons.

Interpretation: post-training may change confidence, but it’s not as simple as “entropy neurons got swapped out.” Whatever shifts calibration likely lives in more distributed or subtler mechanisms.

The meta-takeaway: what post-training preserves vs what it rewrites

If I had to compress the paper into one mental model:

Pre-training builds the world model. Post-training mostly learns how to talk about it, except for safety, where it learns a genuinely new control axis.

More concretely:

Stable under post-training

Where knowledge is retrieved
A linear truthfulness/belief signal
Entropy-neuron-style confidence regulators (at least as identified here)

Changed by post-training

Refusal behavior geometry
Some knowledge representations (post-training adds new representational content that doesn’t cleanly map back to base)

Why this matters (especially if you’re building tooling)

1. You can prototype some mech-interp tools on BASE and port them forward

If truthfulness directions and knowledge locations transfer, you can develop truthfulness probes/steering vectors on BASE, potentially develop model-editing targets on BASE, and then apply those tools on INSTRUCT with minimal re-learning. This is practically useful because BASE models are often “cleaner” objects for interpretability: fewer chat templates, fewer alignment layers, fewer refusal edge cases.

2. Safety is not a free ride

Refusal doesn’t behave like truthfulness here: you should not assume that a refusal direction learned on base will be a robust control handle on an instruct model.

3. Backward transfer is an underexplored lever

If post-training produces a high-quality refusal direction, you might be able to “import” that direction into base models (as a form of lightweight distillation / steering) without running the whole post-training pipeline.

The most interesting part of this paper isn’t any single plot: it’s the emerging taxonomy. Some linear features are “pre-training-native” and persist through alignment; some are “post-training-native” and don’t survive transfer in the same way. If you care about interpretability as an engineering discipline, this is exactly the kind of map you want: it tells you where you can reuse tools, where you should expect breakage, and where post-training is genuinely doing new internal work.

Thanks for reading.