The boundaries of autonomous programming are being pushed toward extreme long-horizon tasks. Epoch AI and METR have introduced MirrorCode, a benchmark that challenges AI models to recreate entire programs from scratch, relying solely on software behavior—such as binaries and documentation—without any access to the original source code.
Breaking Traditional Inference Budgets
Unlike standard software engineering benchmarks that often cap inference costs at $1 to $10 per task, MirrorCode removes these limits to uncover the true ceiling of AI capabilities. This methodology has led to startling results: in one of the most demanding tasks, an AI model operated continuously for 19 days, incurring a compute cost of $2,600 for a single run, entirely without human supervision.Claude Opus 4.7 Takes the Lead
According to Epoch AI, Claude Opus 4.7 currently leads the benchmark with a 56% solve rate. A standout achievement was the reimplementation of gotree, a bioinformatics toolkit featuring roughly 16,000 lines of Go code and over 40 commands. While researchers estimate a human engineer would need between two to seventeen weeks for the same task, Opus 4.7 completed it in just 14 hours at a cost of $251.GPT-5.5 followed with a 44% solve rate, while Gemini 3.1 Pro Preview scored 32%. Notably, even when models failed to fully reimplement a program, they typically passed more than 90% of the tests.
