Mirrorcode: AI models coding for 19 days straight \xe2\x80\x94 AlexTech

The boundaries of autonomous programming are being pushed toward extreme long-horizon tasks. Epoch AI and METR have introduced MirrorCode, a benchmark that challenges AI models to recreate entire programs from scratch, relying solely on software behavior—such as binaries and documentation—without any access to the original source code.

Breaking Traditional Inference Budgets

Unlike standard software engineering benchmarks that often cap inference costs at $1 to $10 per task, MirrorCode removes these limits to uncover the true ceiling of AI capabilities. This methodology has led to startling results: in one of the most demanding tasks, an AI model operated continuously for 19 days, incurring a compute cost of $2,600 for a single run, entirely without human supervision.

Claude Opus 4.7 Takes the Lead

According to Epoch AI, Claude Opus 4.7 currently leads the benchmark with a 56% solve rate. A standout achievement was the reimplementation of gotree, a bioinformatics toolkit featuring roughly 16,000 lines of Go code and over 40 commands. While researchers estimate a human engineer would need between two to seventeen weeks for the same task, Opus 4.7 completed it in just 14 hours at a cost of $251.

GPT-5.5 followed with a 44% solve rate, while Gemini 3.1 Pro Preview scored 32%. Notably, even when models failed to fully reimplement a program, they typically passed more than 90% of the tests.

The Challenge of Large-Scale Software

MirrorCode categorizes tasks as small, medium, or large. While smaller programs are reliably handled by all tested models, the largest and most complex tasks remain unsolved across the board. This suggests that while agentic coding is advancing rapidly, a significant gap persists in managing massive software architectures autonomously.

Note: AI-Generated Content: This article was created with the support of AI tools and subsequently supervised by the site curator. There may be inaccuracies or missing updates; we recommend verifying original sources before making decisions based on the content.

Breaking Traditional Inference Budgets

Claude Opus 4.7 Takes the Lead

The Challenge of Large-Scale Software

Related Articles