It’s possible I’m wrong about this. If I’m not - there are big things happening with large implications - soon. It’s worth reading the next section, but skip it if you just want the answer 🫠

<aside> 💭

It’s weird when something happens just as you write things up. I want to emphasize something here - and it makes all the difference - THIS IS NOT THE w/thinking model. This is just a straight up comparison between o1 and two non-thinking models, one of them a (likely) tiny one!

</aside>

Take the time

Before we start, I’d like you to compare three outputs, on two tasks.

Comprehension + Writing

The first is writing a README. This is a harder task than it looks like - a good README has to be instructive, simple to get, steadily increasing in complexity, and cover how something can be used, where, and how it works in a way that makes sense. For an open-source project, the README has to speak a large audience, and do it well.

One of my favorite tasks for LLM testing has been to provide all of the code in a project (20k+ tokens), an example README, and some instructions how to write. In our case, we’re generating a README for the as-yet-unreleased zodsheriff (I guess you’ll hear about it here first)

You can see the prompt here:

Prompt

Here are three outputs from three different models and providers. Which one do you like best?

1.pdf

2.pdf

3.pdf

(You don’t have to read the whole thing, just skim and have an opinion before you proceed)

To me, 1 loses points immediately because it doesn’t cover the options and switches provided by the package. 3 loses fewer points for not actually explaining the features before launching into code. 2 seems like the one I’d pick.