The agentic engineering Cambrian explosion; The radiologist who never saw an X-ray; Ten metaphors for AI
I've been using up my blogging time with actual AI projects recently (VoxieCam is one of many), so I'm a bit behind on AI news updates, hence a few more links than usual.
Agentic engineering, and the Cambrian explosion it's driving
Three great podcasts to recommend. Steve Yegge, whose Gas Town multi-agent architecture I wrote about in February, turns up on Gergely Orosz's The Pragmatic Engineer in From IDEs to AI Agents. His argument is that the IDE as we know it is dying: the interface of the future looks less like a code editor and more like a place to watch and steer several agents running in parallel. He offers a ladder of AI adoption for engineers, from those still avoiding the tools entirely up to people routinely orchestrating multiple agents at once. On the same podcast, David Heinemeier Hansson ("DHH", creator of Ruby on Rails and co-founder of 37signals) explains in his inimitable style his journey from being a skeptic of AI coding to a convert more recently, and how he thinks it will reshape software careers. And Simon Willison does a comprehensive overview of his views on the recent history of "agentic engineering" and how it is developing on Lenny Rachitsky's podcast. He pins November 2025 as the inflection point when coding agents crossed from "mostly works" to "actually works", walks through his daily patterns (red-green TDD, templates, what he calls "hoarding"), and introduces the unsettling "dark factory" framing where nobody writes the code, nobody reviews the code, and the agents handle their own QA. Simon's own highlights post is a good text companion if you'd rather read than listen, and you can keep up with his "non-book" of agentic engineering patterns as it develops.
We can see evidence of the resulting Cambrian explosion across the technology world (the original, 540M years ago, was the period when animal life rapidly went from a handful of simple forms to almost every major body plan we see today). GitHub's COO Kyle Daigle shared some huge numbers last week: the platform handled 1B commits across the whole of 2025, but is now running at 275M commits per week, on pace for 14B this year if the growth holds. GitHub Actions minutes have gone from 500M per week in 2023 to 2B per week now, a quadrupling in three years for a service already operating at enormous scale. Apple's App Store is going through something similar: 9to5Mac reports an 84% jump in new app submissions, with close to 600K new apps in the latest period, and Apple has started banning code-generation apps that break review rules. Kevin Roose at the New York Times has given the race at AI-native companies a name: tokenmaxxing. One engineer apparently processed 210B tokens in a single week. Generous token budgets are becoming something engineers look for in a new role.
AI radiology benchmarks aren't really testing vision
A surprising paper from a Stanford group including Fei-Fei Li, whose ImageNet kick-started modern computer vision: MIRAGE: The Illusion of Visual Understanding. The researchers trained a small text-only language model on the written questions from a standard chest X-ray benchmark, deliberately leaving the actual X-rays themselves out. It topped the leaderboard, beating the frontier systems and outperforming human radiologists. It had never seen an X-ray. The implication for the impressive-sounding AI radiology results that keep showing up in the literature is uncomfortable: a lot of what looked like visual understanding turns out to be the model picking up on textual clues and their correlation with known diagnoses. This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The optimisation process silently routes around the hard part (actually looking at the picture) and exploits the easy part (the words next to it).
The paper's suggested way forward is simple: strip out any question a text-only model can already answer and re-run the benchmarks on what remains. When they did this on the major multimodal leaderboards, the top models' scores collapsed. Gemini 3 Pro's accuracy dropped from nearly 70% to 23%. The broader recommendation is that any new multimodal result should be reported alongside the improvement the image itself contributes, so we can tell how much the model is actually looking. Link via Rodney Brooks on Bluesky, always a good cut-through-the-hype voice on AI.
Ten metaphors for AI
John Naughton in The Observer collects the metaphors people have reached for to make sense of AI: from Alison Gopnik's cultural technology and the folk favourite enthusiastic intern, through the stochastic parrot and autocomplete on steroids, to the more alarming shoggoth and Cory Doctorow's asbestos. A nice piece with wonderful pen-and-ink illustrations by Chris Riddell, which are as much a draw as the text.
Wikis as AI agent companions
Two different takes on a vaguely similar idea this month, that feels like the beginning of a good direction. Andrej Karpathy describes the "llm-wiki", a pattern for building personal knowledge bases that an AI agent maintains over time. Instead of re-deriving answers from raw documents every time (the retrieval-augmented ("RAG") approach), the agent compiles and updates a wiki as it goes. The wiki becomes a structured, long-term artifact, something closer to how a human researcher would actually build up expertise on a topic. Separately, whoami.wiki is a lovely project that takes a more personal angle: a system that uses MediaWiki (the open source wiki software powering Wikipedia) and Claude Code to turn all kinds of family information (old photos, location history, messages, bank transactions) into interconnected encyclopedia pages about your own life. The author started by organising 1,351 family photos and interviewing his grandmother about her wedding, and ended up with a self-maintaining personal archive that surfaces forgotten connections.
François Chollet's new test for intelligence
François Chollet (creator of Keras, originator of the ARC Prize) is on the Y Combinator podcast talking about ARC-AGI-3, the latest version of his benchmark for measuring intelligence. The previous versions tested pattern recognition on static grids. This one is fully interactive: hundreds of turn-based environments, handcrafted by game designers, with no instructions, no rules and no stated goals. The agent has to explore each environment from scratch and figure out how it works (and the time taken to do that is part of the test), work out what "winning" means, and carry what it learns forward to harder levels. Every task is designed to be human-solvable, and scoring is measured relative to human efficiency, so humans get 100% by definition. Frontier AI currently scores 0.26%. Chollet's longstanding argument is that we are measuring the wrong thing when we test AI on memorisation-friendly benchmarks, and ARC-AGI-3 is his attempt to test what he thinks actually matters: the ability to learn new skills efficiently in genuinely unfamiliar situations.
Jargon Watch
Tokenmaxxing: maximising your personal AI token consumption as a workplace status signal. Coined by Kevin Roose at the New York Times, describing engineers at AI-native companies competing on internal leaderboards that track weekly token usage.
Related posts
- Agentic Engineering: 5 lessons from a 4-day AI experiment
- AI for radiology: do we need a new platform?
- Gas Town satire; AGI is already here; Consciousness explained; Claw on a Pi
- Seemingly conscious AI & emotional agents; AI as 4 kinds of cultural technology
- Vibe code as tech debt; Dancing robots; Detecting and changing AI personalities