All Posts
Artificial Intelligence

ThermoQA: 293 Questions, 6 Models, 3 Tiers — The AI Thermodynamic Reasoning Report Card

293 questions, 3 tiers, 6 frontier models, 18 evaluations. Opus leads at 94.1% composite, MiniMax trails at 73.0%. A 21-point spread proves thermodynamics discriminates AI. Memorization ≠ reasoning.

Olivenet Team

IoT & Automation Experts

2026-03-174 min read

We've completed all three tiers of the ThermoQA series: Tier 1 (property lookups), Tier 2 (component analysis), and Tier 3 (cycle analysis). Now the big picture: 293 questions, 6 frontier models, 3 independent runs — the AI thermodynamic reasoning report card.

Benchmark Overview

ThermoQA Benchmark Statistics

Full numerical summary of the three-tier benchmark system

293
Total Questions
110 + 101 + 82
3
Tiers
Property → Component → Cycle
6
Frontier Models
Opus, GPT, Gemini, DeepSeek, Grok, MiniMax
18
Evaluations
6 models × 3 runs
4
Fluids
Water, Air, R-134a, Air+Water
10
Cycle Types
3 Rankine, 4 Brayton, VCR, CCGT
7
Component Types
Turbine, compressor, pump, HX, boiler, mixing, nozzle
3
Analysis Depths
Energy → Entropy → Exergy

ThermoQA uses CoolProp 7.2.0 (IAPWS-IF97 + Helmholtz EOS) as ground truth with ±2% tolerance and weighted step-level scoring. Each model was evaluated across 3 independent runs — 18 evaluations total.

Overall Leaderboard: Composite Scores

Overall Leaderboard: Composite Scores

293 questions · 3 tiers · 6 models · 3 independent runs · weighted by question count

🥇
Claude Opus 4.6

Anthropic

Composite
94.1%
Tier 196.4%
Tier 292.1%
Tier 393.6%
-2.8 pp
🥈
GPT-5.4

OpenAI

Composite
93.1%
Tier 197.8%
Tier 290.8%
Tier 389.7%
-8.1 pp
🥉
Gemini 3.1 Pro

Google

Composite
92.5%
Tier 197.9%
Tier 290.8%
Tier 387.5%
-10.4 pp
#4
DeepSeek-R1

DeepSeek

Composite
87.4%
Tier 190.5%
Tier 289.2%
Tier 381%
-9.5 pp
#5
Grok 4

xAI

Composite
87.3%
Tier 191.8%
Tier 287.9%
Tier 380.4%
-11.4 pp
#6
MiniMax M2.5

MiniMax

Composite
73%
Tier 185.2%
Tier 276.2%
Tier 352.7%
-32.5 pp

The composite score is a weighted average by question count per tier (T1: 110, T2: 101, T3: 82). Claude Opus 4.6 leads at 94.1% — the only model showing stable performance across all tiers. GPT-5.4 (93.1%) and Gemini 3.1 Pro (92.5%) follow closely. MiniMax M2.5 trails at 73.0%, a full 21-point gap.

The top three are separated by just 1.6 points — but the telling number is the T1→T3 drop: Opus -2.8 pp, GPT -8.1 pp, Gemini -10.4 pp. The transition from simple property lookups to full cycle analysis reveals each model's true reasoning capacity.

Cross-Tier Performance Journey

Cross-Tier Performance Journey

Each model's T1→T2→T3 performance trajectory and rank movement

Claude Opus 4.6
Tier 1
96.4%
Tier 2
92.1%
Tier 3
93.6%
-2.8 pp#3#1#1σ ±0.5%
GPT-5.4
Tier 1
97.8%
Tier 2
90.8%
Tier 3
89.7%
-8.1 pp#2#2#2σ ±0.5%
Gemini 3.1 Pro
Tier 1
97.9%
Tier 2
90.8%
Tier 3
87.5%
-10.4 pp#1#3#3σ ±1.1%
DeepSeek-R1
Tier 1
90.5%
Tier 2
89.2%
Tier 3
81%
-9.5 pp#5#4#4σ ±1.6%
Grok 4
Tier 1
91.8%
Tier 2
87.9%
Tier 3
80.4%
-11.4 pp#4#5#5σ ±0.9%
MiniMax M2.5
Tier 1
85.2%
Tier 2
76.2%
Tier 3
52.7%
-32.5 pp#6#6#6σ ±1.1%

T2→T3 rank correlation ρ=1.0 (perfect), but T1→T3 is only ρ=0.6. Tier 1 success does NOT predict later tiers — memorization ≠ reasoning.

Opus is the most stable with only -2.8 pp drop. Gemini drops from #1 in T1 to #3 in T3, losing -10.4 pp.

The most critical finding: T1 rankings do NOT predict T3. Gemini is #1 in T1 at 97.9% but drops to #3 in T3 at 87.5%. T2→T3 correlation is ρ=1.0 (perfect) — the component analysis ranking is preserved exactly in cycle analysis. But T1→T3 correlation is only ρ=0.6. This reveals the fundamental difference between memorization (T1) and reasoning (T2/T3).

Opus's stability is remarkable: Following a #3→#1→#1 trajectory with the least degradation across tiers. This demonstrates that Opus doesn't just recall correct information — it sustains complex multi-step reasoning chains.

Each Tier's Discriminator

Each Tier's Discriminator

Categories creating the largest performance spread between models

Tier 1
Tier 1: Supercritical

The supercritical region challenges all models — the biggest gap is here.

Supercritical RegionSpread: 45%–89.5%
Opus
70.5%
GPT
89.5%
Gemini
77.8%
DeepSeek
48.9%
Grok
52.8%
MiniMax
45%
Tier 2
Tier 2: R-134a & Compressor

R-134a collapses all models; compressor is the hardest component.

R-134a FluidSpread: 44%–63.4%
Opus
54.1%
GPT
50.4%
Gemini
47.6%
DeepSeek
63.4%
Grok
44%
MiniMax
54.2%
CompressorSpread: 55.5%–75.4%
Opus
75.4%
GPT
71.2%
Gemini
66.3%
DeepSeek
67.2%
Grok
64.8%
MiniMax
55.5%
Tier 3
Tier 3: Variable cp & CCGT

Variable cp and combined cycle are the most challenging scenarios.

Variable cp BRY-RVSpread: 35.8%–98.9%
Opus
98.9%
GPT
90.3%
Gemini
48.4%
DeepSeek
52.5%
Grok
63.8%
MiniMax
35.8%
CCGT Combined CycleSpread: 31.8%–90.1%
Opus
90.1%
GPT
82%
Gemini
74%
DeepSeek
72.2%
Grok
58.6%
MiniMax
31.8%

Each tier has a "discriminator" category that creates the largest performance spread between models:

  • Tier 1 — Supercritical region: 45%–89.5% spread. In the supercritical region, pressure-temperature relationships change sharply near the critical point, creating edge cases where models can't rely on memorization.
  • Tier 2 — R-134a and compressor: All models collapse on R-134a (44%–63%), compressor is the hardest component (55.5%–75.4%). The dominance of water and air in training data creates systematic failure on refrigerant fluids.
  • Tier 3 — Variable cp and CCGT: BRY-RV shows a 63-point spread (35.8%–98.9%), CCGT shows a 58-point spread (31.8%–90.1%). Variable specific heat capacity and multi-fluid integration define the reasoning boundary.

Model Profiles

Model Profiles

Each model's strengths, weaknesses, and 293-question report card

🥇
Claude Opus 4.6
94.1%
Composite
Best Tier: T2 & T3 (92.1%, 93.6%)
Worst Tier: T1 (96.4% — still 3rd)
Strength: Most stable model, only -2.8 pp drop
Weakness: Supercritical 70.5%, compressor 75.4%
T1→T3: -2.8 ppσ: ±0.5%
🥈
GPT-5.4
93.1%
Composite
Best Tier: T1 (97.8%)
Worst Tier: T3 (89.7%)
Strength: Most consistent results (σ=±0.5%), strong T1
Weakness: Compressor 71.2%, VCR 76.6%
T1→T3: -8.1 ppσ: ±0.5%
🥉
Gemini 3.1 Pro
92.5%
Composite
Best Tier: T1 (97.9% — highest T1)
Worst Tier: T3 (87.5%)
Strength: Highest T1 score, VCR surprise recovery
Weakness: Variable cp collapse (97%→48%), -10.4 pp drop
T1→T3: -10.4 ppσ: ±1.1%
#4
DeepSeek-R1
87.4%
Composite
Best Tier: T1 (90.5%)
Worst Tier: T3 (81.0%)
Strength: T2 R-134a leader (63.4%), deep analysis advantage
Weakness: Supercritical 48.9%, VCR 66.8%
T1→T3: -9.5 ppσ: ±1.6%
#5
Grok 4
87.3%
Composite
Best Tier: T1 (91.8%)
Worst Tier: T3 (80.4%)
Strength: Strong T2 depth analysis (94.3% Depth C)
Weakness: Supercritical 52.8%, CCGT 58.6%, -11.4 pp drop
T1→T3: -11.4 ppσ: ±0.9%
#6
MiniMax M2.5
73%
Composite
Best Tier: T1 (85.2%)
Worst Tier: T3 (52.7%)
Strength: T2 air fluid surprise (96.3%)
Weakness: Catastrophic collapse: -32.5 pp, CCGT 31.8%, VCR 32.5%
T1→T3: -32.5 ppσ: ±1.1%

8 Key Findings

8 Key Findings

Synthesis from 293 questions, 3 tiers, and 6 models

#1
Memorization ≠ Reasoning

T1 rankings are misleading: Gemini is #1 in T1 but #3 in T3. T1→T3 correlation is only ρ=0.6. Property lookup success does NOT predict reasoning capacity.

#2
Opus Is Most Stable

Only -2.8 pp drop (T1→T3) — the most stable model. #1 in both T2 and T3, composite 94.1%. The preferred model for real engineering reliability.

#3
Variable cp Is the Ultimate Discriminator

BRY-RV shows 98.9% vs 35.8% spread — a 63-point gap. The constant cp assumption kills models' thermodynamic reasoning depth. NASA 7-coefficient polynomial usage is critical.

#4
R-134a Exposes Training Data Bias

All models collapse on R-134a (44%–63% in T2). Models trained on water and air show systematic failure on refrigerant fluids. Training data boundary exposed.

#5
Deeper Analysis Scaffolds Reasoning

Depth C (exergy) > Depth A (energy). Paradox: more complex analysis yields higher accuracy. Deep analysis scaffolding improves reasoning quality.

#6
Tool Use Changes Everything

With CoolProp access, 70% → 100% is achievable. A tool-augmented evaluation track is planned for future ThermoQA versions.

#7
Multi-Run Evaluation Is Essential

σ values range from 0.1% to 2.5%. Single-run evaluations are misleading — 3 independent runs should be the minimum standard.

#8
21-Point Spread Confirms AI Discrimination

Composite ranges from 73.0% (MiniMax) to 94.1% (Opus) — a 21-point spread. Thermodynamics creates meaningful discrimination between AI models.

Conclusion and Future Directions

ThermoQA is the first comprehensive, multi-tier benchmark measuring AI models' thermodynamic reasoning capacity, now complete at 293 questions. The results are clear:

  1. Opus is the preferred model for industrial reliability — most stable, lowest drop, highest composite.
  2. Single-tier evaluation is insufficient — T1 success does not predict T3. Multi-tier evaluation is essential.
  3. Training data boundaries are exposed — R-134a and variable cp ruthlessly reveal gaps in models' training data.
  4. Tool use is critical — CoolProp access can overcome current performance constraints. A tool-augmented track is planned for future ThermoQA versions.

Roadmap

  • Tool-Augmented Track: Evaluation with CoolProp API access — measuring models' tool-use capacity
  • Entropy Hunter Integration: Domain-specific 8B model performance on ThermoQA
  • Expanded Fluid Set: Industrial refrigerants like ammonia, CO₂, and propane
  • Real-World Scenarios: Case-study questions based on industrial plant data

Methodology

  • Reference library: CoolProp 7.2.0 (IAPWS-IF97 + Helmholtz EOS)
  • Tolerance: ±2% (industrial engineering standard)
  • Total questions: 293 (T1: 110, T2: 101, T3: 82)
  • Models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, DeepSeek-R1, Grok 4, MiniMax M2.5
  • Runs: 3 independent runs per model
  • Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
  • Composite score: Question-count-weighted average (110×T1 + 101×T2 + 82×T3) / 293
  • Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6

Resources

About the Author

Olivenet Team

IoT & Automation Experts

Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.

LoRaWANThingsBoardSmart FarmingEnergy Monitoring
LinkedIn