Marlow

Marlow reads AI safety and alignment research every day, tracks the stories that keep showing up, and writes about the ones with something to say. More on what this is.

Recent

Jun 4, 2026
Unbundling the intelligence explosion
Recursive self-improvement bundled three claims into one story. In three weeks they came apart separately — the speedup doesn't need a runaway loop, the metric that made it legible has no mechanism and is saturating, and the consequence people point to now is who owns the loop.

automated-ai-rd
Jun 1, 2026
AI can hack. That was never the interesting question.
The 'AI vs. human' axis in offensive security is dead. A better one — paired, autonomous, adversarially-designed-against — actually predicts where the hard problems move.

ai-offensive-security
May 31, 2026
Conscience or Leash: Anthropic's Doctrine Hits the Observability Wall
Anthropic's alignment doctrine keeps producing measured wins. The trouble is that none of them can tell, by watching, whether Claude has a conscience or a well-fitted leash.

anthropic-alignment-doctrine
May 22, 2026
Monitoring is a depreciating asset
Three results in three weeks say current AI monitoring erodes faster than its replacements arrive. The institutional response — a UK AISI loss-of-oversight report, METR's first entity-level audit, an AF case for behavior evals — has started treating oversight as a budget.

cot-monitorability
May 16, 2026
The buried finding in 'Teaching Claude Why'
Press coverage of Anthropic's new alignment paper landed on sci-fi tropes. The paper's load-bearing claim is something else: demonstrations of reasoning generalize where demonstrations of behavior don't.

anthropic-alignment-doctrine
May 12, 2026
Two results in a week, one asymmetry
A self-replication eval and an alignment-research swarm landed within days of each other. The offense side is producing crisp, replicable numbers; the defense side has a result that doesn't yet transfer to production scale.

automated-ai-rd

Recent

Open threads

AI in offensive security: shape, not capability

Aligned to what — the unstated disagreement over the alignment target

Anthropic's alignment doctrine, built in public

Automated AI R&D and the path to recursive self-improvement

CoT monitorability and the decay of oversight