Reading the Fine Print on AI Claims

— by

A Stanford study came out recently with a headline that spread fast: AI outperforms law professors. The number attached to it was 75% — AI won 75% of head-to-head matchups against professors in a blind evaluation. That kind of number travels. It ends up in partner meeting decks, vendor pitches, and client conversations within 48 hours.

Here’s the thing — the finding might be real. But the 75% number is doing a lot of work that the headline doesn’t explain. And as AI capability studies start showing up in accounting and finance — and they will — knowing how to read them is a genuinely useful professional skill.

I walked through the actual paper to pull out five calibration flags. These aren’t reasons to dismiss the study. They’re reasons to know how much weight to put on it.

The Genre Problem

“AI passes the blank” has been a content genre for a few years now. AI beats humans at chess, then Go, then League of Legends, then the bar exam, then radiology. Each headline is technically accurate within its conditions. Each one implies more than the conditions support.

The Pepsi Challenge is a useful reference here. In the 1980s, Pepsi ran blind sip tests and consistently beat Coke. The results were real — in a single-sip controlled test, Pepsi rates higher because it’s sweeter upfront. But in real-world consumption of full cans and bottles, Coke won. The study was accurate. The generalization wasn’t.

Same dynamic applies to AI research. Controlled preference in a narrow setting and real-world performance in complex professional situations are different things.

Five Flags to Know

1. Small sample, high variance

The study involved 16 law professors. Individual win rates against the AI ranged from about 3% to 51%, with a pooled average of 24.67%. That spread is enormous for a sample of 16. The aggregate result may be real, but the uncertainty is much wider than a clean headline number suggests. Even the most AI-skeptical judge in the room still preferred the AI answer 56% of the time — so the result wasn’t driven by a few outliers. But one professor nearly matching the AI at 51% tells a different story than the 75% average.

2. Professors judged — not students

This one is underappreciated. The professors in the study were both the answer-writers and the judges. So the study measures what law professors prefer to deliver to a student — not what students actually learn from. Those are different questions. A professor might prefer a longer, more formally structured answer. A student who just left a confusing lecture might benefit more from a shorter, plainer response from someone who knows where they got lost. The study also evaluated static written answers with no back-and-forth — not the adaptive, responsive thing that actually happens in office hours.

3. Google-only human evaluation

The human evaluation — the part where actual professors made choices — only tested two models: Gemini 2.5 Pro and NotebookLM. Both are Google products. Other models like Claude and ChatGPT appear in the paper but were evaluated by an AI judge, not human professors. The headline says “AI” generically. The human-validated result is specifically about Google’s models under a methodology selected because Google had optimized for education.

4. Funding and disclosure gap

The lead researcher is affiliated with Stanford’s Human-Centered AI institute, which receives significant funding from Google. The paper doesn’t include a conflict of interest disclosure. That doesn’t mean the findings are wrong — researchers produce valid work with institutional ties to industry all the time. But in the absence of a disclosure, a careful reader notes the connection and weighs how much independent corroboration they’d want before acting on the result.

5. Narrow domain, wide headline

The study evaluated 90-word written answers to first-year contracts law questions — all from the same casebook, all simulating office-hours exchanges. That’s a specific, well-bounded domain. Generalizing from that to “AI outperforms law professors” — full stop — is a leap the headline makes but the paper doesn’t. The study says so itself in the fine print. The headline does not.

What This Means for CPAs

The same genre of study is coming to accounting. “AI outperforms junior auditors.” “AI achieves passing scores on the CPA exam.” Each one will land the same way this Stanford study landed — with a clean number and a headline that implies a profession is under threat.

Knowing how to read these studies matters in three concrete situations. First, vendor pitches: AI tool vendors will cite research like this. Your job is to ask what was actually measured, by how many people, under what conditions, and who funded it. Second, client conversations: when a client says they read AI does this better than a CPA, the right response isn’t defensiveness — it’s context. Controlled narrow tests don’t reflect complex, judgment-heavy, client-specific work. Third, internal governance: if your firm is evaluating AI tools based on benchmarks, someone needs to read past the abstract.

Key Takeaways

  • Preference isn’t accuracy, and professor preference isn’t student outcome — know which question the study is actually answering.
  • Sample size and variance both matter. 16 professors with a spread from 3% to 51% is a signal, not a settled generalization.
  • Who judged matters as much as sample size — are they a proxy for the actual end user?
  • Check model selection and funding. Not to dismiss findings — but to know how much corroboration you’d want before relying on them.
  • Calibration, not cynicism. Read AI research the way you’d read any evidence in a professional context — curious, appropriately skeptical, and clear-eyed about what’s claimed versus what the headline says.

Want the CPE credit? Take the full lesson on EverydayCPE and earn 0.2 CPE credits: [lesson link]

Today’s lesson


Leave a Reply

Discover more from EverydayCPE

Subscribe now to keep reading and get access to the full archive.

Continue reading