ComputeBench: Instruction-following benchmarks for long, step-by-step arithmetic

1 comments

Vibecoded this after seeing models do amazing things but still drift on simple recursive steps; tracks exact match, answer accuracy, prefix correctness. Feedback welcome.