Proof, Profits, and Overconfidence

A machine-verified Erdős solve, a Google-doc showdown on whether AI ROI is real, and evidence that instruction-tuning breaks model honesty.

Jan 12, 2026

GPT-5.2 Solves a Decades-Old Math Problem

Neel Somani prompted GPT-5.2 Pro to generate an original proof for Erdős Problem #397, a decades-old open problem posed by renowned mathematician, Paul Erdős. The proof was formally verified and accepted by Terence Tao.
However, this was not necessarily a massive breakthrough. Tao describes this as “low-hanging fruit” because the model solved the problem using standard, well-known techniques, rather than original theory. The key advance is reliability and autonomy, but not creativity… yet.
Even incremental progress matters. When AI can autonomously close straightforward open problems and verify its own work, the bottleneck in many fields shifts from human time to problem specification. We can expect faster resolution of well-scoped challenges in mathematics, law, compliance, and engineering, but deep, insight-driven breakthroughs will remain human-led for now.

The man who predicted the 2008 crash, Anthropic’s co-founder, and a leading AI podcaster jump into a Google doc to debate the future of AI—and, possibly, our lives

Capabilities are real, but the economic story is still unproven: Jack Clark (Anthropic’s Co-Founder), Dwarkesh Patel (AI Podcaster), Michael Burry (the man who predicted the 2008 crash and is now predicting the AI bubble bursting) largely agree on a surprising point: AI progress since 2017 has been real, discontinuous, and faster than expected. LLMs now outperform humans on many narrow cognitive tasks and feel qualitatively better every few months.

Where they diverge, though, is economics. Burry’s core skepticism is not that AI is fake, but that capability gains have not translated into durable profits or defensible margins. Infrastructure spending has exploded while application-layer revenue remains modest. If AI mainly compresses prices, automates existing tasks, and diffuses benefits to all competitors equally, then productivity gains may be real but returns on capital may collapse. The risk is not a tech bubble, but a capital cycle where everyone builds escalators and no one gets paid for them.

Productivity gains exist, but measurement is broken and often misleading: The group is unusually candid about how little we actually know. Internal surveys at Anthropic suggest large self-reported productivity boosts, yet rigorous external studies show neutral or even negative effects in complex, real-world codebases. Both can be true.

The key insight is that AI works best in “closed-loop” domains where outputs can be quickly verified. Coding crossed that threshold first. Other knowledge work is only now approaching it through better tooling, search integration, and context persistence. Until verification loops tighten, subjective usefulness will outrun objective output. This explains the paradox of massive adoption without clear labor displacement. The skeptical takeaway is not “AI doesn’t help,” but we are flying blind on true productivity, and markets are pricing in gains that have not yet been cleanly observed or proven to compound.

The biggest disagreement is on who captures value: The three authors accept that AI will matter, but the line is drawn with value capture vs. value diffusion. Burry argues that AI will resemble past general-purpose technologies where customers benefit most and producers compete margins away. Nvidia’s dominance looks fragile once smaller models mature and foundation models risk becoming expensive commodities unless recursive self-improvement or cost advantages create defensibility.

Dwarkesh and Clark are more open to upside if capabilities keep accelerating fast enough to justify today’s spend, but even they concede that competition remains intense and durable advantages are unclear.
The shared skeptical conclusion is this: AI can be transformative and still be a poor investment for much of the supply chain. If that happens, the winners are users, not shareholders. That would be economically deflationary, socially useful, and financially brutal for anyone betting that scale along guarantees returns.

Key Takeaway: This is not an “AI is fake” conversation. It is a warning that technological progress, capital allocation, and profit are not the same thing

Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs (Apple)

Base AI models are “naturally” honest about their knowledge: The most significant finding is that base models (AI that has only been trained on raw text from the internet) are remarkably good at judging their own confidence. Even though they were never explicitly taught to be “self-aware,” they “know what they don’t know” as a byproduct of their basic training.

Across multiple datasets, base models achieved extremely low “SmoothECE” (Calibration Error) scores, where a score of 0.00 is perfect.

Llama-31.1-70B on general facts: 0.030 error
Mistral-7B on trivia: 0.037 error
Qwen2.5-7B on math problems: 0.048 error

The significance: This means if a base model is 70% sure of a “concept” (like the capital of France), it is actually correct almost exactly 70% of the time.

“Polishing” AI for users destroys this honesty: We usually prefer using “Instruct” models (like the chat-ready versions of GPT or Llama) because they are more helpful and polite. However, the researchers found that instruction tuning and Reinforcement Learning (RLHF) systematically “break” the model’s ability to be honest about its confidence.

When models are trained to be helpful assistants, they often become overconfident. They might give a better-looking answer, but they lost the “internal compass” that tells them when they are guessing. In comparative tests, “Instruct” models consistently showed significantly higher calibration errors than their raw “Base” counterparts across all four major datasets tested.

“Thinking Step-by-Step” Makes AI Less Self-Aware: A common trick to make AI better at math is asking it to use Chain-of-Thought (CoT) reasoning where it thinks step by step. While this often makes the AI more accurate, it completely ruins its ability to predict if it will be right.

For an AI to be “calibrated” (honest), it needs to “know” its chance of success before it starts writing. When using CoT, the AI doesn’t know where its own “thoughts” will lead until it finishes the entire reasoning process, making its initial confidence estimate meaningless.
The study evaluated over 650 experiments and found that while base models giving concise answers were highly reliable, adding CoT shifted them into the “not predicted calibrated” category.

Key Takeaway: Large Language Models (LLMs) are naturally “honest” about their own knowledge as a side effect of their basic training, but the techniques we use to make them “helpful” for humans inadvertently break this honesty.

The Weekly Thesis

Discussion about this post

Ready for more?