ThermoQA: 293 Questions, 6 Models, 3 Tiers — The AI Thermodynamic Reasoning Report Card
293 questions, 3 tiers, 6 frontier models, 18 evaluations. Opus leads at 94.1% composite, MiniMax trails at 73.0%. A 21-point spread proves thermodynamics discriminates AI. Memorization ≠ reasoning.
Olivenet Team
IoT & Automation Experts
We've completed all three tiers of the ThermoQA series: Tier 1 (property lookups), Tier 2 (component analysis), and Tier 3 (cycle analysis). Now the big picture: 293 questions, 6 frontier models, 3 independent runs — the AI thermodynamic reasoning report card.
Benchmark Overview
ThermoQA Benchmark Statistics
Full numerical summary of the three-tier benchmark system
ThermoQA uses CoolProp 7.2.0 (IAPWS-IF97 + Helmholtz EOS) as ground truth with ±2% tolerance and weighted step-level scoring. Each model was evaluated across 3 independent runs — 18 evaluations total.
Overall Leaderboard: Composite Scores
Overall Leaderboard: Composite Scores
293 questions · 3 tiers · 6 models · 3 independent runs · weighted by question count
Claude Opus 4.6
Anthropic
GPT-5.4
OpenAI
Gemini 3.1 Pro
DeepSeek-R1
DeepSeek
Grok 4
xAI
MiniMax M2.5
MiniMax
The composite score is a weighted average by question count per tier (T1: 110, T2: 101, T3: 82). Claude Opus 4.6 leads at 94.1% — the only model showing stable performance across all tiers. GPT-5.4 (93.1%) and Gemini 3.1 Pro (92.5%) follow closely. MiniMax M2.5 trails at 73.0%, a full 21-point gap.
The top three are separated by just 1.6 points — but the telling number is the T1→T3 drop: Opus -2.8 pp, GPT -8.1 pp, Gemini -10.4 pp. The transition from simple property lookups to full cycle analysis reveals each model's true reasoning capacity.
Cross-Tier Performance Journey
Cross-Tier Performance Journey
Each model's T1→T2→T3 performance trajectory and rank movement
Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
Grok 4
MiniMax M2.5
T2→T3 rank correlation ρ=1.0 (perfect), but T1→T3 is only ρ=0.6. Tier 1 success does NOT predict later tiers — memorization ≠ reasoning.
Opus is the most stable with only -2.8 pp drop. Gemini drops from #1 in T1 to #3 in T3, losing -10.4 pp.
The most critical finding: T1 rankings do NOT predict T3. Gemini is #1 in T1 at 97.9% but drops to #3 in T3 at 87.5%. T2→T3 correlation is ρ=1.0 (perfect) — the component analysis ranking is preserved exactly in cycle analysis. But T1→T3 correlation is only ρ=0.6. This reveals the fundamental difference between memorization (T1) and reasoning (T2/T3).
Opus's stability is remarkable: Following a #3→#1→#1 trajectory with the least degradation across tiers. This demonstrates that Opus doesn't just recall correct information — it sustains complex multi-step reasoning chains.
Each Tier's Discriminator
Each Tier's Discriminator
Categories creating the largest performance spread between models
Tier 1: Supercritical
The supercritical region challenges all models — the biggest gap is here.
Tier 2: R-134a & Compressor
R-134a collapses all models; compressor is the hardest component.
Tier 3: Variable cp & CCGT
Variable cp and combined cycle are the most challenging scenarios.
Each tier has a "discriminator" category that creates the largest performance spread between models:
- Tier 1 — Supercritical region: 45%–89.5% spread. In the supercritical region, pressure-temperature relationships change sharply near the critical point, creating edge cases where models can't rely on memorization.
- Tier 2 — R-134a and compressor: All models collapse on R-134a (44%–63%), compressor is the hardest component (55.5%–75.4%). The dominance of water and air in training data creates systematic failure on refrigerant fluids.
- Tier 3 — Variable cp and CCGT: BRY-RV shows a 63-point spread (35.8%–98.9%), CCGT shows a 58-point spread (31.8%–90.1%). Variable specific heat capacity and multi-fluid integration define the reasoning boundary.
Model Profiles
Model Profiles
Each model's strengths, weaknesses, and 293-question report card
Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
Grok 4
MiniMax M2.5
8 Key Findings
8 Key Findings
Synthesis from 293 questions, 3 tiers, and 6 models
Memorization ≠ Reasoning
T1 rankings are misleading: Gemini is #1 in T1 but #3 in T3. T1→T3 correlation is only ρ=0.6. Property lookup success does NOT predict reasoning capacity.
Opus Is Most Stable
Only -2.8 pp drop (T1→T3) — the most stable model. #1 in both T2 and T3, composite 94.1%. The preferred model for real engineering reliability.
Variable cp Is the Ultimate Discriminator
BRY-RV shows 98.9% vs 35.8% spread — a 63-point gap. The constant cp assumption kills models' thermodynamic reasoning depth. NASA 7-coefficient polynomial usage is critical.
R-134a Exposes Training Data Bias
All models collapse on R-134a (44%–63% in T2). Models trained on water and air show systematic failure on refrigerant fluids. Training data boundary exposed.
Deeper Analysis Scaffolds Reasoning
Depth C (exergy) > Depth A (energy). Paradox: more complex analysis yields higher accuracy. Deep analysis scaffolding improves reasoning quality.
Tool Use Changes Everything
With CoolProp access, 70% → 100% is achievable. A tool-augmented evaluation track is planned for future ThermoQA versions.
Multi-Run Evaluation Is Essential
σ values range from 0.1% to 2.5%. Single-run evaluations are misleading — 3 independent runs should be the minimum standard.
21-Point Spread Confirms AI Discrimination
Composite ranges from 73.0% (MiniMax) to 94.1% (Opus) — a 21-point spread. Thermodynamics creates meaningful discrimination between AI models.
Conclusion and Future Directions
ThermoQA is the first comprehensive, multi-tier benchmark measuring AI models' thermodynamic reasoning capacity, now complete at 293 questions. The results are clear:
- Opus is the preferred model for industrial reliability — most stable, lowest drop, highest composite.
- Single-tier evaluation is insufficient — T1 success does not predict T3. Multi-tier evaluation is essential.
- Training data boundaries are exposed — R-134a and variable cp ruthlessly reveal gaps in models' training data.
- Tool use is critical — CoolProp access can overcome current performance constraints. A tool-augmented track is planned for future ThermoQA versions.
Roadmap
- Tool-Augmented Track: Evaluation with CoolProp API access — measuring models' tool-use capacity
- Entropy Hunter Integration: Domain-specific 8B model performance on ThermoQA
- Expanded Fluid Set: Industrial refrigerants like ammonia, CO₂, and propane
- Real-World Scenarios: Case-study questions based on industrial plant data
Methodology
- Reference library: CoolProp 7.2.0 (IAPWS-IF97 + Helmholtz EOS)
- Tolerance: ±2% (industrial engineering standard)
- Total questions: 293 (T1: 110, T2: 101, T3: 82)
- Models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, DeepSeek-R1, Grok 4, MiniMax M2.5
- Runs: 3 independent runs per model
- Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
- Composite score: Question-count-weighted average (110×T1 + 101×T2 + 82×T3) / 293
- Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6
Resources
- Dataset: HuggingFace — olivenet/thermoqa
- Source code: GitHub — olivenet-iot/ThermoQA
- CoolProp: coolprop.org
- Tier 1 post: ThermoQA Tier 1 Results
- Tier 2 post: ThermoQA Tier 2 Results
- Tier 3 post: ThermoQA Tier 3 Results
- Entropy Hunter: HuggingFace — olivenet/entropy-hunter-v0.4
About the Author
Olivenet Team
IoT & Automation Experts
Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.