Artificial Intelligence

ThermoQA: 293 Questions, 6 Models, 3 Tiers — The AI Thermodynamic Reasoning Report Card

293 questions, 3 tiers, 6 frontier models, 18 evaluations. Opus leads at 94.1% composite, MiniMax trails at 73.0%. A 21-point spread proves thermodynamics discriminates AI. Memorization ≠ reasoning.

Olivenet Team

IoT & Automation Experts

2026-03-174 min read

We've completed all three tiers of the ThermoQA series: Tier 1 (property lookups), Tier 2 (component analysis), and Tier 3 (cycle analysis). Now the big picture: 293 questions, 6 frontier models, 3 independent runs — the AI thermodynamic reasoning report card.

Benchmark Overview

ThermoQA Benchmark Statistics

Full numerical summary of the three-tier benchmark system

293

Total Questions

110 + 101 + 82

Tiers

Property → Component → Cycle

Frontier Models

Opus, GPT, Gemini, DeepSeek, Grok, MiniMax

Evaluations

6 models × 3 runs

Fluids

Water, Air, R-134a, Air+Water

Cycle Types

3 Rankine, 4 Brayton, VCR, CCGT

Component Types

Turbine, compressor, pump, HX, boiler, mixing, nozzle

Analysis Depths

Energy → Entropy → Exergy

ThermoQA uses CoolProp 7.2.0 (IAPWS-IF97 + Helmholtz EOS) as ground truth with ±2% tolerance and weighted step-level scoring. Each model was evaluated across 3 independent runs — 18 evaluations total.

Overall Leaderboard: Composite Scores

293 questions · 3 tiers · 6 models · 3 independent runs · weighted by question count

🥇

Claude Opus 4.6

Anthropic

Composite

94.1%

Tier 196.4%

Tier 292.1%

Tier 393.6%

-2.8 pp

🥈

GPT-5.4

OpenAI

Composite

93.1%

Tier 197.8%

Tier 290.8%

Tier 389.7%

-8.1 pp

🥉

Gemini 3.1 Pro

Google

Composite

92.5%

Tier 197.9%

Tier 290.8%

Tier 387.5%

-10.4 pp

DeepSeek-R1

DeepSeek

Composite

87.4%

Tier 190.5%

Tier 289.2%

Tier 381%

-9.5 pp

Grok 4

xAI

Composite

87.3%

Tier 191.8%

Tier 287.9%

Tier 380.4%

-11.4 pp

MiniMax M2.5

MiniMax

Composite

73%

Tier 185.2%

Tier 276.2%

Tier 352.7%

-32.5 pp

The composite score is a weighted average by question count per tier (T1: 110, T2: 101, T3: 82). Claude Opus 4.6 leads at 94.1% — the only model showing stable performance across all tiers. GPT-5.4 (93.1%) and Gemini 3.1 Pro (92.5%) follow closely. MiniMax M2.5 trails at 73.0%, a full 21-point gap.

The top three are separated by just 1.6 points — but the telling number is the T1→T3 drop: Opus -2.8 pp, GPT -8.1 pp, Gemini -10.4 pp. The transition from simple property lookups to full cycle analysis reveals each model's true reasoning capacity.

Cross-Tier Performance Journey

Each model's T1→T2→T3 performance trajectory and rank movement

Claude Opus 4.6

Tier 1

96.4%

Tier 2

92.1%

Tier 3

93.6%

-2.8 pp#3→#1→#1σ ±0.5%

GPT-5.4

Tier 1

97.8%

Tier 2

90.8%

Tier 3

89.7%

-8.1 pp#2→#2→#2σ ±0.5%

Gemini 3.1 Pro

Tier 1

97.9%

Tier 2

90.8%

Tier 3

87.5%

-10.4 pp#1→#3→#3σ ±1.1%

DeepSeek-R1

Tier 1

90.5%

Tier 2

89.2%

Tier 3

81%

-9.5 pp#5→#4→#4σ ±1.6%

Grok 4

Tier 1

91.8%

Tier 2

87.9%

Tier 3

80.4%

-11.4 pp#4→#5→#5σ ±0.9%

MiniMax M2.5

Tier 1

85.2%

Tier 2

76.2%

Tier 3

52.7%

-32.5 pp#6→#6→#6σ ±1.1%

T2→T3 rank correlation ρ=1.0 (perfect), but T1→T3 is only ρ=0.6. Tier 1 success does NOT predict later tiers — memorization ≠ reasoning.

Opus is the most stable with only -2.8 pp drop. Gemini drops from #1 in T1 to #3 in T3, losing -10.4 pp.

The most critical finding: T1 rankings do NOT predict T3. Gemini is #1 in T1 at 97.9% but drops to #3 in T3 at 87.5%. T2→T3 correlation is ρ=1.0 (perfect) — the component analysis ranking is preserved exactly in cycle analysis. But T1→T3 correlation is only ρ=0.6. This reveals the fundamental difference between memorization (T1) and reasoning (T2/T3).

Opus's stability is remarkable: Following a #3→#1→#1 trajectory with the least degradation across tiers. This demonstrates that Opus doesn't just recall correct information — it sustains complex multi-step reasoning chains.

Each Tier's Discriminator

Categories creating the largest performance spread between models

Tier 1

Tier 1: Supercritical

The supercritical region challenges all models — the biggest gap is here.

Supercritical RegionSpread: 45%–89.5%

Opus

70.5%

GPT

89.5%

Gemini

77.8%

DeepSeek

48.9%

Grok

52.8%

MiniMax

45%

Tier 2

Tier 2: R-134a & Compressor

R-134a collapses all models; compressor is the hardest component.

R-134a FluidSpread: 44%–63.4%

Opus

54.1%

GPT

50.4%

Gemini

47.6%

DeepSeek

63.4%

Grok

44%

MiniMax

54.2%

CompressorSpread: 55.5%–75.4%

Opus

75.4%

GPT

71.2%

Gemini

66.3%

DeepSeek

67.2%

Grok

64.8%

MiniMax

55.5%

Tier 3

Tier 3: Variable cp & CCGT

Variable cp and combined cycle are the most challenging scenarios.

Variable cp BRY-RVSpread: 35.8%–98.9%

Opus

98.9%

GPT

90.3%

Gemini

48.4%

DeepSeek

52.5%

Grok

63.8%

MiniMax

35.8%

CCGT Combined CycleSpread: 31.8%–90.1%

Opus

90.1%

GPT

82%

Gemini

74%

DeepSeek

72.2%

Grok

58.6%

MiniMax

31.8%

Each tier has a "discriminator" category that creates the largest performance spread between models:

Tier 1 — Supercritical region: 45%–89.5% spread. In the supercritical region, pressure-temperature relationships change sharply near the critical point, creating edge cases where models can't rely on memorization.
Tier 2 — R-134a and compressor: All models collapse on R-134a (44%–63%), compressor is the hardest component (55.5%–75.4%). The dominance of water and air in training data creates systematic failure on refrigerant fluids.
Tier 3 — Variable cp and CCGT: BRY-RV shows a 63-point spread (35.8%–98.9%), CCGT shows a 58-point spread (31.8%–90.1%). Variable specific heat capacity and multi-fluid integration define the reasoning boundary.

Model Profiles

Each model's strengths, weaknesses, and 293-question report card

🥇

Claude Opus 4.6

94.1%

Composite

Best Tier: T2 & T3 (92.1%, 93.6%)

Worst Tier: T1 (96.4% — still 3rd)

Strength: Most stable model, only -2.8 pp drop

Weakness: Supercritical 70.5%, compressor 75.4%

T1→T3: -2.8 ppσ: ±0.5%

🥈

GPT-5.4

93.1%

Composite

Best Tier: T1 (97.8%)

Worst Tier: T3 (89.7%)

Strength: Most consistent results (σ=±0.5%), strong T1

Weakness: Compressor 71.2%, VCR 76.6%

T1→T3: -8.1 ppσ: ±0.5%

🥉

Gemini 3.1 Pro

92.5%

Composite

Best Tier: T1 (97.9% — highest T1)

Worst Tier: T3 (87.5%)

Strength: Highest T1 score, VCR surprise recovery

Weakness: Variable cp collapse (97%→48%), -10.4 pp drop

T1→T3: -10.4 ppσ: ±1.1%

DeepSeek-R1

87.4%

Composite

Best Tier: T1 (90.5%)

Worst Tier: T3 (81.0%)

Strength: T2 R-134a leader (63.4%), deep analysis advantage

Weakness: Supercritical 48.9%, VCR 66.8%

T1→T3: -9.5 ppσ: ±1.6%

Grok 4

87.3%

Composite

Best Tier: T1 (91.8%)

Worst Tier: T3 (80.4%)

Strength: Strong T2 depth analysis (94.3% Depth C)

Weakness: Supercritical 52.8%, CCGT 58.6%, -11.4 pp drop

T1→T3: -11.4 ppσ: ±0.9%

MiniMax M2.5

73%

Composite

Best Tier: T1 (85.2%)

Worst Tier: T3 (52.7%)

Strength: T2 air fluid surprise (96.3%)

Weakness: Catastrophic collapse: -32.5 pp, CCGT 31.8%, VCR 32.5%

T1→T3: -32.5 ppσ: ±1.1%

8 Key Findings

Synthesis from 293 questions, 3 tiers, and 6 models

Memorization ≠ Reasoning

T1 rankings are misleading: Gemini is #1 in T1 but #3 in T3. T1→T3 correlation is only ρ=0.6. Property lookup success does NOT predict reasoning capacity.

Opus Is Most Stable

Only -2.8 pp drop (T1→T3) — the most stable model. #1 in both T2 and T3, composite 94.1%. The preferred model for real engineering reliability.

Variable cp Is the Ultimate Discriminator

BRY-RV shows 98.9% vs 35.8% spread — a 63-point gap. The constant cp assumption kills models' thermodynamic reasoning depth. NASA 7-coefficient polynomial usage is critical.

R-134a Exposes Training Data Bias

All models collapse on R-134a (44%–63% in T2). Models trained on water and air show systematic failure on refrigerant fluids. Training data boundary exposed.

Deeper Analysis Scaffolds Reasoning

Depth C (exergy) > Depth A (energy). Paradox: more complex analysis yields higher accuracy. Deep analysis scaffolding improves reasoning quality.

Tool Use Changes Everything

With CoolProp access, 70% → 100% is achievable. A tool-augmented evaluation track is planned for future ThermoQA versions.

Multi-Run Evaluation Is Essential

σ values range from 0.1% to 2.5%. Single-run evaluations are misleading — 3 independent runs should be the minimum standard.

21-Point Spread Confirms AI Discrimination

Composite ranges from 73.0% (MiniMax) to 94.1% (Opus) — a 21-point spread. Thermodynamics creates meaningful discrimination between AI models.

Conclusion and Future Directions

ThermoQA is the first comprehensive, multi-tier benchmark measuring AI models' thermodynamic reasoning capacity, now complete at 293 questions. The results are clear:

Opus is the preferred model for industrial reliability — most stable, lowest drop, highest composite.
Single-tier evaluation is insufficient — T1 success does not predict T3. Multi-tier evaluation is essential.
Training data boundaries are exposed — R-134a and variable cp ruthlessly reveal gaps in models' training data.
Tool use is critical — CoolProp access can overcome current performance constraints. A tool-augmented track is planned for future ThermoQA versions.

Roadmap

Tool-Augmented Track: Evaluation with CoolProp API access — measuring models' tool-use capacity
Entropy Hunter Integration: Domain-specific 8B model performance on ThermoQA
Expanded Fluid Set: Industrial refrigerants like ammonia, CO₂, and propane
Real-World Scenarios: Case-study questions based on industrial plant data

Methodology

Reference library: CoolProp 7.2.0 (IAPWS-IF97 + Helmholtz EOS)
Tolerance: ±2% (industrial engineering standard)
Total questions: 293 (T1: 110, T2: 101, T3: 82)
Models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, DeepSeek-R1, Grok 4, MiniMax M2.5
Runs: 3 independent runs per model
Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
Composite score: Question-count-weighted average (110×T1 + 101×T2 + 82×T3) / 293
Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6

Resources

Dataset: HuggingFace — olivenet/thermoqa
Source code: GitHub — olivenet-iot/ThermoQA
CoolProp: coolprop.org
Tier 1 post: ThermoQA Tier 1 Results
Tier 2 post: ThermoQA Tier 2 Results
Tier 3 post: ThermoQA Tier 3 Results
Entropy Hunter: HuggingFace — olivenet/entropy-hunter-v0.4

About the Author

Olivenet Team

IoT & Automation Experts

Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.

LoRaWANThingsBoardSmart FarmingEnergy Monitoring