The Man Out to Prove How Dumb AI Still Is

Q: 2. Did OpenAI’s AI finally pass the test?

Sort of. In late 2023, OpenAI’s o3 model scored 87% on ARC-AGI—matching humans. But there was a catch: . It took 14 minutes per puzzle (vs. seconds for humans). . It brute-forced answers by generating 1,000+ guesses (unlike human insight). . The compute cost was astronomical (likely $100,000s per run).

Q: 3. Why can’t AI companies agree on what AGI means?

. OpenAI/Microsoft reportedly tie AGI to profitability ($100B in revenue). . Chollet argues AGI requires human-like reasoning —not just pattern-matching. . Other labs treat AGI as a vague marketing term for "better chatbots."

The Man Determined to Expose AI’s Limits—And Why It Matters

At first glance, Sam Altman and François Chollet might seem like kindred spirits. Both dream of artificial general intelligence (AGI)—machines that can think, reason, and innovate like humans. But while Altman has boldly claimed that OpenAI is on the brink of achieving AGI, Chollet, a renowned AI researcher and outspoken skeptic, dismisses such assertions as “absolutely clown shoes.”

Chollet, a former Google engineer and the creator of the widely used Keras deep-learning framework, has spent years warning that today’s AI models—no matter how impressive—are not truly intelligent. They excel at regurgitating patterns from vast datasets, acing standardized tests, and mimicking human language with eerie precision. But true intelligence, he argues, requires something far deeper: the ability to reason from first principles, adapt to novel problems, and think creatively—qualities that remain elusive for even the most advanced AI.

The ARC-AGI Test: A Reality Check for AI

In 2019, Chollet introduced the Abstraction and Reasoning Corpus (ARC-AGI), a test designed to expose the gap between human cognition and AI’s brittle, pattern-matching abilities. Unlike traditional benchmarks—which AI can brute-force with enough data—ARC-AGI presents entirely unfamiliar visual puzzles. Test-takers must deduce underlying rules from sparse examples, then apply them to solve new problems.

Humans, even those seeing ARC-AGI for the first time, typically score 60-70%. Early AI models, however, floundered spectacularly. GPT-3 scored 0%. Even OpenAI’s much-hyped GPT-4, touted for its “advanced reasoning,” managed only 5% in 2023. Google’s Gemini and Anthropic’s Claude fared slightly better, but none came close to human performance.

For Chollet, these failures were telling. AI’s reliance on memorization and statistical correlations, rather than true reasoning, left it helpless when faced with genuinely novel challenges. “If you were not intelligent, like the entire GPT series,” he quipped, “you would score basically zero.”

OpenAI’s Breakthrough—And the Catch

Then, in late 2023, OpenAI stunned the AI community with o3, a new reasoning model that scored an 87% on ARC-AGI—matching human performance for the first time. Chollet called it a “genuine breakthrough.” The model could dynamically combine strategies, adjust its approach, and solve puzzles in ways that suggested real fluid intelligence.

But there was a catch.

To achieve that score, o3 took 14 minutes per puzzle, generating over 1,000 possible answers before settling on one—a brute-force method that would be impossible for a human. The computational cost? Potentially hundreds of thousands of dollars per run. As Melanie Mitchell, an AI researcher at the Santa Fe Institute, noted, this wasn’t elegant reasoning—it was trial and error at an inhuman scale.

The Goalposts Move Again

Just as Silicon Valley began celebrating, Chollet unveiled ARC-AGI-2, a far more difficult version of the test. This time, even OpenAI’s best models crashed back to earth. A public version of o3 dropped from 30% to below 2%, while other leading AIs scored near 1%. Humans, meanwhile, still averaged 60%.

The message was clear: True reasoning remains out of reach.

The Bigger Problem: What Even Is AGI?

The debate over ARC-AGI highlights a deeper issue—no one agrees on what AGI really means.

For OpenAI and Microsoft, AGI might simply mean a bot profitable enough to generate $100 billion (a figure reportedly tied to their investor agreements).
For Chollet, it means machines that can think like humans—adapting, innovating, and reasoning without endless trial and error.
For the broader AI industry, AGI is often a marketing buzzword, slapped onto models that are merely better at pattern recognition.

And perhaps the most uncomfortable truth? Human intelligence itself is still a mystery. We solve problems in wildly different ways, blending logic, intuition, and creativity. Some people breeze through ARC-AGI; others struggle. But that diversity—not raw computational power—is what makes human thought so powerful.

The Real Question: Do We Even Want AGI?

As AI labs race toward an ever-shifting definition of AGI, Chollet’s work forces a crucial question: Should the goal be replicating human intelligence—or something else entirely?

If brute-forcing puzzles with massive compute is the answer, then AI may never truly “think.” But if the path forward requires understanding the messy, beautiful complexity of human cognition, then the industry has a long way to go.

For now, one thing is certain: As long as Chollet keeps designing harder tests, AI’s limitations will keep showing up. And that might be exactly what the field needs.

Final Thought:
“Intelligence isn’t just about solving puzzles—it’s about knowing which puzzles are worth solving.”

1. What is ARC-AGI, and why does it matter?

ARC-AGI is a test designed by AI researcher François Chollet to measure fluid intelligence—the ability to solve unfamiliar problems using reasoning, not memorization. Unlike standard AI benchmarks (which models can “cheat” by training on vast data), ARC-AGI’s puzzles are unique, forcing AI to think from scratch. Most humans score ~60-70%, while early AI models (like GPT-3) scored 0%. Recent models have improved, but their high compute costs and inefficiency suggest they’re still faking intelligence rather than truly understanding.

2. Did OpenAI’s AI finally pass the test?

Sort of. In late 2023, OpenAI’s o3 model scored 87% on ARC-AGI—matching humans. But there was a catch:
. It took 14 minutes per puzzle (vs. seconds for humans).
. It brute-forced answers by generating 1,000+ guesses (unlike human insight).
. The compute cost was astronomical (likely $100,000s per run).

3. Why can’t AI companies agree on what AGI means?

. OpenAI/Microsoft reportedly tie AGI to profitability ($100B in revenue).
. Chollet argues AGI requires human-like reasoning—not just pattern-matching.
. Other labs treat AGI as a vague marketing term for “better chatbots.”