A few weeks ago I kept seeing references to something called the “METR time horizon graph” — described by MIT Technology Review as the most important graph in AI. I finally went deep on it. And then in January 2026, METR published a significant update. What I found is directly relevant to every accountant and finance professional evaluating AI tools right now.
Here’s the short version: AI can now complete a task that takes a human about five hours — with 50% reliability. That number is doubling roughly every three months. And for the messy, judgment-heavy work that defines accounting and finance, the real-world capability is orders of magnitude lower. Let me walk you through what the research actually says.
What Is a “Task Time Horizon”?
For years, AI was evaluated by benchmark scores — accuracy percentages on multiple-choice tests and math problems. The problem is that a model can ace a benchmark and still be useless in production.
METR (Model Evaluation & Threat Research), a nonprofit AI research organization, introduced a better framework in their landmark March 2025 paper: instead of asking “what percentage of questions does AI get right?”, they ask “how long a task can AI complete before it starts failing?”
They measure this by timing skilled human professionals on a suite of tasks, then testing frontier AI models on the same tasks. The “50% time horizon” is the task length at which the AI succeeds half the time. Simple idea. Powerful framing.
Six Years of Exponential Growth
The original March 2025 paper tracked 13 frontier AI models from 2019 to early 2025. The trend line is striking:
- GPT-2 (2019): ~1 minute time horizon
- GPT-3.5 (2022): ~5 minutes
- GPT-4 (early 2023): 5–30 minutes
- Claude 3.7 Sonnet (early 2025): ~54 minutes
- Claude Opus 4.5 (late 2025): ~320 minutes (~5.3 hours)
On a log scale this is a straight line — classic exponential growth. No plateau. No sign of slowing. The doubling time across the full six-year period: approximately seven months.
What the January 2026 Update Changed
On January 29, 2026, METR published Time Horizon 1.1 — a significant update to the original methodology. Two main changes:
Bigger task suite. The benchmark grew from 170 to 228 tasks. Long tasks (8+ human hours) doubled from 14 to 31. They removed 15 tasks that were easy to game or confusing, and added 73 new tasks from HCAST — a high-quality agentic evaluation dataset. More tasks means tighter confidence intervals, especially at the upper end where it matters most.
New evaluation infrastructure. METR migrated from their in-house tool to Inspect, an open-source framework developed by the UK AI Security Institute. Results are more reproducible and independent.
The core finding held: the seven-month doubling time over the full 2019–2025 period was confirmed. But when you zoom in to just the post-2023 data, the doubling time drops to 131 days. Zoom into 2024 onward and it drops further — to 89 days. The trend is accelerating.
There’s also a notable re-rating at the model level. GPT-4’s time horizon dropped 57% under the harder, less-gameable task suite — earlier scores were inflated. GPT-5 came in 55% higher than originally estimated. The new benchmark is doing its job.
The Caveats the Lead Author Flagged Himself
One of the most useful things METR published alongside TH1.1 was a candid limitations note from Thomas Kwa, one of the paper’s main authors. A few that stand out for accounting applications:
Error bars are enormous. Claude Opus 4.5’s stated ~5-hour time horizon has a 95% confidence interval of 1 hour 49 minutes to 20 hours 25 minutes. Kwa writes that he genuinely doesn’t know whether the true time horizon is 3.5 hours or 6.5 hours. Treat the headline numbers as rough estimates, not precise figures.
The domain gap is massive. The benchmark is almost entirely software engineering tasks. For GUI-based computer use — clicking through screens, navigating interfaces, the kind of thing most accounting software requires — time horizons are 40–100x shorter than for coding tasks.
50% is not a deployment threshold. Kwa is explicit: some reliability-critical tasks require 98%+ success probabilities to be worth automating. Financial reporting, tax filings, audit documentation — these fall squarely in that category.
The Real-World Productivity Paradox
The benchmark data is one thing. METR also ran a real-world randomized controlled trial in July 2025 — 16 experienced developers, 246 tasks on codebases they’d averaged five years of experience with, using Cursor Pro and Claude 3.5/3.7 Sonnet.
The result: developers using AI took 19% longer to complete tasks. They had predicted they’d be 24% faster. After the study, they still believed they’d been 20% faster. The perception-reality gap was total — and in the wrong direction.
This matters for accounting teams building AI workflows. The intuition that AI is making things faster is not always borne out by measurement. In accounting, where accuracy matters more than speed, this gap is especially worth monitoring.
What This Means for Accountants Evaluating AI
Vendor claims need a time-horizon lens. When a software vendor says their AI can “automate your month-end close,” they’re claiming AI can reliably complete tasks that take your team hours. The right question isn’t “can your AI do this?” It’s “what is its success rate on tasks of this complexity and duration, measured over weeks of real production use — not a demo?”
Also worth asking: how much of the AI claim is actually traditional automation underneath? Deterministic rules-based automation carries much higher reliability than probabilistic AI. Both can save time — but they carry very different risk profiles.
Break work into subtasks. AI succeeds on short, clean, well-defined tasks and fails increasingly on long, messy, ambiguous ones. If a task takes you five hours, AI will struggle with it autonomously. Break it into fifteen-minute chunks with clear inputs and outputs and your success rate goes way up.
Redesign oversight for longer AI tasks. Doubling the time horizon doesn’t double the degree of automation — it changes the nature of failure. Errors become rarer but more complex and harder to catch. A silent reconciliation error building over three weeks is categorically different from a flagged exception on a single transaction. Internal controls need to account for this.
Key Takeaways
- METR’s time horizon is the most data-driven framework available for understanding AI autonomous capability — doubling roughly every 3 months by recent estimates
- The January 2026 TH1.1 update expanded the benchmark to 228 tasks and confirmed the trend, with the recent pace at an 89-day doubling time
- Frontier models handle ~5-hour tasks at 50% reliability — but for GUI/office work resembling accounting tasks, the real horizon may be 40–100x shorter
- 50% reliability is not a deployment threshold for any accounting function where accuracy matters
- Break AI work into small, well-defined subtasks — the single highest-impact change you can make to improve reliability
Want the CPE credit? Take the full lesson on EverydayCPE and earn 0.2 CPE credits: [lesson link]


Leave a Reply