We’re going to be analyzing an earlier version of eloranker (~2k tokens) using the bf16 version of the model.

Source

Overall Rankings:

  1. new sonnet-3.5 (A): Best combination of finding real bugs and providing concrete fixes
  2. o1-preview (A-): Found the most critical bug but included some generic feedback
  3. 4o (B): Some good specific points but missed critical bugs
  4. qwen (C-): Almost entirely generic feedback with few specific issues identified

o1-preview: Grade A-

Strengths

Weaknesses

4o: Grade B

Strengths