Open Problems in Mechanistic Interpretability
A summary of Sharkey, Chughtai et al. (2025) — a 36-page roadmap (plus 46 pages of references!) for understanding what neural networks actually learn, and why it matters for making AI safe.
Why this paper
I'm writing this as a deliberate entry point into alignment research. My background is in data science and economics — causal inference, mechanism design, the kind of work where you care deeply about why a model gives the answer it does, not just that it does.
That instinct maps well onto mechanistic interpretability: the study of how neural networks compute. But what really pulled me in is a structural parallel. A model that pursues some learned objective while being supervised by a principal who can't observe its internal reasoning — that's a textbook principal-agent problem. And right now, we're the principal with almost no monitoring tools. This paper is the most comprehensive map of what those tools could look like.
My goal is to build in the open, stake a research direction, and have a paper ready for a conference by September 2026.
What is mechanistic interpretability?
Here is the basic situation: neural networks learn capabilities that their developers did not design. Developers design the training process — the loss function, the data, the architecture — but the algorithms that emerge inside the network are learned, not programmed. And in almost all cases, nobody understands them.
Mechanistic interpretability aims to change that. The paper defines "understanding a neural network" as the ability to use knowledge about its internal mechanisms to predict its behavior on arbitrary inputs, or to accomplish practical goals like controlling its outputs or improving its design.
This is distinct from earlier interpretability work. The paper identifies three historical threads:
- Interpretable-by-design models — decision trees, linear models, GAMs. Small and transparent, but limited.
- "Why this decision?" methods — saliency maps, LIME, SHAP. Explain individual predictions, but often unreliable.
- "How does it generalize?" — the current thread. Instead of explaining one prediction, we ask how the model solves an entire class of problems. This is mechanistic interpretability.
The shift from thread 2 to thread 3 matters. A saliency map tells you which pixels mattered for this image. A mechanistic understanding tells you what algorithm the model learned — and lets you predict what it will do on inputs you've never tested.
Each thread emerged as models grew more capable and prior methods hit their limits.
The reverse engineering pipeline
The paper frames the field around two complementary approaches. Reverse engineering takes the network apart and asks "what does this piece do?" Concept-based interpretability starts with a concept and asks "where does the network represent this?"
Reverse engineering follows three steps, iterated until you're satisfied:
Each step has deep open problems. The paper's 36 pages of content are primarily devoted to them. What follows is a detailed walk through each.
Decomposition: finding the atoms
The first question in reverse engineering is deceptively simple: what are the fundamental units of computation in a neural network? If we could identify them, we could study each one in isolation and build up a picture of the whole.
The naive candidates — individual neurons, attention heads, layers — don't work. The paper is clear about why.
Neurons are polysemantic. A single neuron might respond to cats, blue objects, and the letter Q. This was first observed in artificial networks and mirrors findings in neuroscience, where individual biological neurons also encode mixed signals. The "Neuron Doctrine" — that individual neurons are the fundamental functional unit — doesn't hold for either biological or artificial networks.
Attention heads are polysemantic too. Research shows that studying attention patterns can be actively misleading — the pattern of "who attends to whom" doesn't reliably tell you what information is being transmitted.
Layers are too coarse. Representations can span multiple layers. Intervening on a single layer — as in early model editing work like ROME — doesn't reliably carve the network at its joints.
So researchers turned to unsupervised methods. Early approaches used dimensionality reduction (PCA, SVD, non-negative matrix factorization) to find structure in activation vectors. These methods work well when the number of meaningful directions is smaller than the number of dimensions. But a crucial insight changed the game.
Sparse dictionary learning
The superposition hypothesis says that neural networks represent more features than they have dimensions. They can do this because each feature activates sparsely — most features are "off" for any given input — and high-dimensional spaces have room for many nearly-orthogonal directions. This means dimensionality reduction methods, which can only find as many directions as there are dimensions, fundamentally cannot recover the full set of features.
Sparse dictionary learning (SDL) was developed to overcome this. The idea: train a small encoder-decoder network where the hidden layer is much wider than the input. The encoder maps activations into this overcomplete space, with a sparsity constraint that forces most latents to be zero on any given input. The decoder provides a dictionary of directions. If all goes well, each dictionary element corresponds to a single interpretable feature.
SDL rests on two assumptions. First, the linear representation hypothesis — that concepts map to linear directions in activation space. Second, the superposition hypothesis — that there are more of these directions than dimensions. Both have empirical support, but both have known exceptions.
SDL variants include Sparse Autoencoders (SAEs), Transcoders (which reconstruct the next layer instead of the same one), and Crosscoders (which can reconstruct across many layers simultaneously, capturing cross-layer features). SDL is currently the most popular decomposition method in mechanistic interpretability.
But the paper is remarkably candid about its limitations.
Eight problems with SDL
The paper catalogs eight distinct limitations — practical and conceptual — that constrain what SDL can tell us about neural networks. These aren't minor quibbles; they're fundamental challenges that the field needs to resolve or route around.
Limitations of sparse dictionary learning
Click through each limitation to understand the problem.
Why these limitations matter together
Individually, each limitation is a research challenge. Together, they paint a picture of a method that's useful but far from sufficient. SDL gives us activations, not mechanisms. It assumes linearity in nonlinear models. It's expensive, lossy, data-dependent, and ignores the geometric relationships between the features it finds. The paper's implicit message is that the field needs either fundamentally better methods or much stronger theoretical foundations to move beyond where we are now.
Theoretical foundations
The paper makes a striking admission: despite being the central object of study, we don't have a satisfying formal definition of a "feature." We don't know, in a rigorous sense, whether the superposition hypothesis is fundamentally true or merely pragmatically useful. And we don't have a principled theory that tells us how to carve a neural network at its joints.
Several theoretical frameworks have been proposed. Causal abstraction (Geiger et al.) tries to ground interpretability in formal causality, but hasn't yet yielded canonical causal mediators that could serve as a basis for decomposition. Singular learning theory (Watanabe) studies the geometry of loss landscapes and offers a view of how mechanistic structure emerges during training, but hasn't produced practical interpretability tools. Spline theory, neural tangent kernels, simplicity bias — all attempt to explain why networks generalize, but none have been successfully connected to approaches for interpreting them.
One promising direction is intrinsic interpretability: instead of interpreting opaque models post-hoc, train models that are interpretable by design. Approaches include sparse activation functions (TopK, SoLU), mixture-of-experts architectures with enough sparsely-activating experts that individual experts might become interpretable, weight sparsity via pruning, and modular architectures that encourage geometrically local connections. The challenge is that attempts so far have either sacrificed performance or allowed "superposition to sneak through."
Description and validation
Decomposition gives you components. Description gives you hypotheses about what those components do. The paper emphasizes a failure mode that has been "regrettably commonplace": conflating hypotheses with conclusions.
How to describe a component
Descriptions can focus on causes (what makes a component activate) or effects (what happens downstream when it fires). Each approach has tools and pitfalls:
Highly activating examples — show the inputs on which a component fires most, then eyeball commonalities. This is the simplest and most widely used method, but it's dangerously prone to interpretability illusions. Humans project familiar concepts onto alien representations. Bolukbasi et al. showed that human annotators identified dramatically different meanings for the same direction in BERT depending on which dataset they drew examples from. Worse, you can find plausible-looking explanations for arbitrary directions — not just the ones that actually matter.
Attribution methods — gradient-based (integrated gradients, grad-CAM) or perturbation-based (SHAP, ablation) measures of causal importance. Theoretically stronger, but gradient-based methods often identify only a first-order approximation, and Adebayo et al. showed that some are independent of both the model and the data. An adversary can train a model to produce any attribution map they want.
Logit lens and direct logit attribution — project intermediate representations into vocabulary space to see what a component "is thinking about." The logit lens measures direct effects. It's clean and cheap, but can't capture indirect effects — how a representation influences downstream layers.
Causal interventions — the gold standard. Activation patching substitutes the value of a component with a value from a different input and observes what changes. Path patching isolates the effect of one component on one specific downstream component. Causal scrubbing generalizes this to test hypotheses about arbitrary component relationships. The downside: each intervention requires a full forward pass, making large-scale application expensive.
How to validate a description
The paper lists five levels of validation, roughly from weakest to strongest:
- Predict activations — use your explanation to predict the component's behavior on new inputs. Can be done by humans or AI.
- Explain failures — does your interpretation predict adversarial examples or hallucinations?
- Handcraft replacements — can you build a simple substitute that reproduces the component's behavior? Cammarata et al. did this for a curve detector in a CNN.
- Test on ground truth — use toy models with known algorithms to benchmark your tools.
- Achieve engineering goals competitively — the highest bar. Fair comparison against non-interpretability baselines on real, non-cherry-picked tasks.
Two pieces of infrastructure could help: establishing model organisms (the interpretability community's "Drosophila melanogaster") and building benchmarks with known ground-truth explanations. Currently, researchers mostly use GPT-2 or small transformers trained on modular addition, but there's no consensus on what the standard should be.
Circuit discovery
Circuit discovery is the most developed pipeline in mechanistic interpretability. It combines decomposition, description, and validation into a concrete workflow:
- Define a task — pick something the model can do (e.g., "complete the pattern: The capital of France is ___").
- Decompose into a graph — represent the network as a DAG where nodes are components (attention heads, MLP layers, or SDL latents) and edges are "abstract weights."
- Identify the relevant subgraph — use activation patching or attribution to find which nodes and edges matter for this task.
- Describe each component — the creative, labor-intensive step. Hypothesize functions, design experiments, iterate.
- Validate — check faithfulness (does the circuit approximate the full model?), minimality (are all components necessary?), and completeness (are we missing any?).
This pipeline has produced real insights — the induction head circuit, the indirect object identification circuit — but the paper is forthright about its shortcomings. Task definitions are human-imposed and may not match the model's internal task structure. Within-task variance is large: a circuit may approximate average performance well but individual datapoints poorly. Circuit faithfulness is low for complex end-to-end tasks. And the tasks studied so far were deliberately chosen to be simple, giving a misleading impression of tractability.
The paper calls this "streetlight interpretability" — studying what's easy to study rather than what matters most. Attempts to apply circuit discovery to arbitrary, complex tasks have proceeded less successfully.
Applications
Section 3 of the paper maps interpretability methods onto concrete goals. A key insight: different goals require progress along different axes. Some need better decomposition, some need deeper descriptions, some need understanding of entire models, some need understanding of training dynamics.
Application areas
Explore what mechanistic interpretability could unlock.
One claim that stood out to me: the paper argues that monitoring for unsafe cognition is where interpretability has the strongest comparative advantage over other areas of ML. Most other subfields already focus on controlling input-output behavior. Only interpretability tries to understand the mechanisms of cognition itself. This means it's uniquely positioned to catch deception, sycophancy, and sandbagging — behaviors that are specifically designed to defeat behavioral evaluations.
Scalable oversight and why mech interp comes first
I'm increasingly drawn to scalable oversight as a research area — the problem of how humans can meaningfully supervise AI systems that are more capable than they are. But I've come to believe that mechanistic interpretability is a prerequisite, not a parallel track.
Here's why. Scalable oversight techniques — debate, recursive reward modeling, market-making — all assume you can eventually verify the output of an AI system, even if you can't produce it yourself. But verification gets harder as models get more capable. A model that can reason about concepts beyond human understanding creates outputs that humans can't meaningfully check.
Mechanistic interpretability offers a way to break this bottleneck. If you can inspect the model's internal reasoning — not just its final answer — you have a fundamentally richer signal for oversight. The paper's discussion of mechanistic anomaly detection illustrates this: you don't need to understand every mechanism in the model to notice when it's reasoning in unusual ways. Just as a doctor can flag an abnormal lab result without fully understanding the underlying biochemistry, a monitor could flag unusual internal activity without complete interpretability.
The connection to my economics background is direct. In mechanism design, a principal who can partially observe an agent's actions can design better contracts than one who can only see outcomes. White-box evaluations (using model internals) are the AI safety analog of partial observability — and the theory says partial observability should be worth a lot. But you need to know what to observe, which is exactly the question mechanistic interpretability is trying to answer.
The paper hints at this connection in its discussion of enumerative safety — the idea that if you can decompose a network and describe every component, you could verify that no component implements dangerous capabilities. That's the strongest possible monitoring contract: full observability. We're very far from it, but even partial progress — understanding some components, flagging anomalous ones — would dramatically strengthen oversight.
Sociotechnical problems
The paper's final section is a clear-eyed reckoning with the field's position in society. Three points are especially sharp:
The governance gap. Current AI governance frameworks rarely specify how mechanistic understanding would be operationalized. OpenAI commits to "mitigating biological risks" but doesn't say how. The EU AI Act requires "model evaluation … to identify and mitigate systemic risk." Interpretability could help — but only if it produces tools that policymakers can actually use.
The competitiveness problem. Interpretability methods have not proven competitive with non-interpretability baselines on real tasks. The field evaluates tools on their own curve rather than against alternatives. Until interpretability demonstrates unique value — solving problems that are hard or impossible without it — there's a risk of developing techniques that look good in cherry-picked demos but don't generalize.
The lobbying risk. Modest progress in mech interp has already been used by industry actors to argue against AI regulation, claiming the "black box" problem is solved. Andreessen Horowitz submitted written evidence to UK Parliament making exactly this claim. The paper pushes back firmly: selective transparency can be used to actively mislead.
"The focus on interpretability artificially constrains the solution space by characterizing one possible solution as the problem itself." — Krishnan (2020)
This is a tension the field needs to sit with. Interpretability is likely necessary for some problems (detecting deceptive cognition) and unnecessary for others (behavioral control via RLHF). Intellectual honesty about where the comparative advantage lies is essential.
Where I'm headed
Reading this paper carefully, a few research threads pull at me:
1. The economics of alignment as monitoring. The principal-agent framing is underexplored in the mech interp literature. Mechanism design has deep results about when monitoring can substitute for control, and under what information conditions. I want to formalize the connection between white-box evaluations and optimal monitoring contracts. What's the value of partial interpretability? How does the cost of monitoring trade off against the cost of misalignment?
2. Scalable oversight via partial mechanistic understanding. Full enumerative safety is far away. But I think there's a useful middle ground: using imperfect interpretability tools to strengthen oversight protocols. What does an oversight scheme look like when you can observe 10% of a model's mechanisms with moderate confidence? How does that change the equilibrium of a principal-agent interaction?
3. Feature geometry and economic network theory. The paper flags that SDL ignores feature geometry — the relationships between features. I'm drawn to the parallel with economic network analysis, where the structure of relationships (not just the existence of agents) determines outcomes. Can we import tools from network economics to understand how features interact?
My plan: one of these becomes a paper for ICML, NeurIPS, or a safety workshop by September. I'll start with (1) because it's closest to a clean, novel contribution — the formalism exists in economics, the application to alignment is new, and it directly addresses a gap the paper identifies.
Thanks for reading.
© Peter Flo