Interactive essays, research write-ups, and the occasional exploration.
Spring 2026 · Alignment · MoE routing
A continuation of activation steering: in sparse MoE models, behavior can be steered by biasing router decisions toward or away from behavior-linked experts.
Spring / Summer 2026 · Alignment · Concept steering
Reading Davarmanesh, Wilson, and Radhakrishnan on attention-guided feature learning — and why better measurement (token choice, soft labels, and enrichment-based layer selection) may matter more than fancier steering algorithms.
Spring 2026 · Alignment · Concept steering
A reflection on Beaglehole et al. (Science, 2026): concept vectors, steering, and steerable oversight.
Spring 2026 · Alignment · Mechanistic interpretability
A mechanistic walkthrough of Du et al. (COLM 2025): which internal mechanisms persist under post-training (knowledge, truthfulness) and which get rewritten (refusal).
Spring 2026 · Alignment · Mechanistic interpretability
An interactive summary of Sharkey, Chughtai et al. (2025) — decomposition, sparse dictionary learning and its eight limitations, circuit discovery, scalable oversight, and where I'm staking research direction toward alignment.
Spring 2026 · Game theory
Nash equilibrium and its discontents, the refinement ladder, monotone comparative statics, supermodular games, potential games — and where the analogies to alignment hold up (and where they don't).
Fall 2025 · Game theory · Labor markets
Why recruiting researchers at labs like Midjourney looks less like a job board and more like a repeated auction with hidden values. Interactive visualizations of payoff matrices, mixed strategies, and reputation dynamics.