o3 DR (12 minutes)

Gemini 2.5 DR (9 minutes)

Claude 3.7 DR (54 minutes)

Prompt

Review (from o3, but Sonnet and Gemini agreed)

1 Task-following

Aspect requested in the prompt o3-DR Gemini-2.5-DR Sonnet-3.7-DR
Provider coverage (OpenAI, Anthropic, Gemini) ✔ – but invents extra fictitious model names. ✔ – sticks to doc-backed names. ✔ – sticks to doc-backed names.
Model list & pricing table ✔ – huge, but many figures wrong. ✔ – broadly correct, cites tiers & caching. ✔ – covers core prices but fewer variants.
“How to access” in TypeScript Partial – OpenAI/Anthropic only; Gemini SDK never actually imported.
Reasoning vs non-reasoning models explained ✔ – but relies on invented “o3/o4” taxonomy. ✔ – explains chain-of-thought visibility vs hidden. ◑ – mentions but shorter.
Thinking tokens / chain-of-thought Mentions for Gemini & Claude but conflates with OpenAI. ✔ – flags uncertainty. ✔ – demonstrates Claude thinking, notes not exposed on OpenAI.
Idiosyncrasies / known issues from forums Sparse. ✔ – has dedicated section. Minimal.
Streaming, JSON mode, schema, multimodal Talks about all, few concrete examples. ✔ – many TS examples. ✔ – code for each.
Cost-calculation formula Qualitative only. ✔ – shows formulas & pitfalls. ✔ – includes helper fn.
Contradictions or “what we could not find” Rare. ✔ – calls out thin docs & tokeniser gaps. Few.

Winner on task-adherence: report Gemini-2.5-DR


2 Detail depth

Rank order: o3-DR (highest volume) → Gemini-2.5-DR (high but curated) → Sonnet-3.7-DR (concise).

o3-DR is ~2× longer than the others but much is repetition or imagined specs; Gemini-2.5-DR’s detail is better targeted.


3 Wasted space


4 Correctness check (places they disagree)