Conducting smarter intelligences than me: new orchestras

<aside> 💡

The banner for the article was Notion’s first choice when I asked for a cover. Somehow it felt wrong to change it.

</aside>

This report about Claude Code was inspired by Geoff Huntley’s Post from March, a free Friday, and some hours trying to see how far we could push broad agentic intelligence.

After all the work, it was about 30K tokens of generated output from Opus 4. This post (about 10K tokens worth) is entirely from yours truly, and was more useful in understanding agentic systems for me than the original report.

What follows is everything I’ve learned orchestrating subagents made of every flagship model, manually reviewing their output, trying different methods of coordination, and figuring out what works. If you liked the Claude Code report, this is how it was made. If you didn’t, this should help you figure out what I could’ve done better!

<aside> 🫣

For fun, here’s me painstakingly writing this post, character by character, unedited, false starts and everything. At this level of time-lapse I can pretend my tokens per second are pretty bearable. It’s interesting to see how many times I change my mind.

</aside>

Writing P2 16x.mov

So what did I do?

Here’s a quick list of the things I learned, before we cover them in detail:

Chunking is important, but not for the reason you think.
Agent hordes will hallucinate in very different ways than individual models - the reason is agentic intermediates.
Models are incredibly smart - almost superhuman - but comprehension is not the same thing as explanation. Intelligence is sometimes only as good as what can be expressed (same as this article).
Context management is extremely important, perhaps the most important thing. Summarisation is a BAD IDEA.
1. Agentic coordination is MapReduce with finite memory. Turns vs Spread is always going to be an ongoing decision. Do you do more turns with fewer agents, or more agents with fewer turns?
The appropriate agentic flows will (for a long time) involve multiple different models. No one model is good at everything, and often because characteristics compete.
Figuring out how and when to inject human preference (and knowledge) is hard.

However, what struck me (like with daedumi and ipgu) is how much of the work I did here can be turned into code, to be run unsupervised. I think we’re leaving a lot on the table with current agentic loops.

Here’s the final process as a whole: