ThermoQA Tier 3: How Well Can AI Models Handle Full Thermodynamic Cycle Analysis?
82 questions, 9 cycle types, 4 fluids. Opus leads at 91.3%, MiniMax collapses to 40.2%, and variable cp breaks Gemini. CCGT is the hardest cycle — best score only 77.3%. T2→T3 rank correlation is perfect (ρ=1.0).
Olivenet Team
IoT & Automation Experts
In our Tier 1 post, we tested steam table lookups. In Tier 2, we tested component analysis. Now it's time for the most challenging tier: full thermodynamic cycle analysis.
ThermoQA Tier 3: Cycle Analysis tests scenarios where multiple components are interconnected, cycle efficiency is computed, and optimization decisions are made. 82 questions, 9 cycle types (3 Rankine, 4 Brayton, vapor compression refrigeration, combined cycle), 4 fluids (water, air, R-134a, air+water). A single component is no longer enough — models must analyze complete cycles end to end.
Overall Leaderboard
Tier 3 Overall Leaderboard
82 questions · 9 cycle types · 4 fluids · CoolProp 7.2.0 ground truth · ±2% tolerance
Claude Opus 4.6
Anthropic
GPT-5.4
OpenAI
Gemini 3.1 Pro
DeepSeek-R1
DeepSeek
MiniMax M2.5
MiniMax
Claude Opus 4.6 maintains its lead at 91.3% — only a 0.7-point drop from Tier 2, making it the most stable model. GPT-5.4 follows at 88.3%, Gemini 3.1 Pro at 84.1%. The most striking result: MiniMax M2.5 collapses to 40.2%, falling to unusable levels.
By fluid, CCGT (combined cycle) is the hardest for all models: best score only 77.3% (Opus). Gemini's surprise 88.6% on VCR-A is notable in contrast to its Brayton collapse.
Why Is CCGT So Difficult?
A combined cycle (CCGT) integrates a Brayton gas turbine topping cycle with a Rankine steam turbine bottoming cycle. Hot exhaust gases from the gas turbine generate steam in a heat recovery steam generator (HRSG). This integration:
- Requires simultaneous management of two different fluids (air and water)
- Errors in the topping cycle propagate directly to the bottoming cycle (error cascade)
- Overall efficiency calculation requires both independent and combined analysis of both cycles
- Pinch point analysis and HRSG design constraints add complexity
Per-Cycle Performance
Per-Cycle Performance
9 cycle types x 5 models - grouped into 3 families
Rankine Family
Brayton Family
Other
Variable cp: Breaks Gemini from 97% to 63% to 38%. Hardcoded cp=1.005 kJ/(kg·K) assumption; missing NASA 7-coefficient polynomial.
CCGT is the hardest cycle: Best score 77.3% (Opus). Requires multi-fluid, multi-component integration.
A clear difficulty hierarchy emerges across the 9 cycles:
Rankine Family
Ideal (RNK-I) and actual (RNK-A) Rankine cycles see all frontier models scoring 90%+ — these cycles can be considered solved. Reheat Rankine (RNK-RH) shows a slight decline (85-93%) but still strong performance. Water remains the fluid models know best.
Brayton Family and the Variable cp Problem
Ideal (BRY-I) and actual (BRY-A) Brayton cycles are also in the solved category — 90%+ scores. But when variable specific heat capacity (cp) enters the picture, the landscape changes completely.
Gemini's collapse is dramatic: Constant cp Brayton at 97% → variable cp at 63% → regenerative + variable cp at 38%. Gemini hardcodes cp=1.005 kJ/(kg·K) and cannot compute temperature-dependent cp variation. Where it should use NASA 7-coefficient polynomials or air tables, it uses a fixed value.
Opus and GPT also decline on variable cp (88.5% and 85.0%), but not catastrophically. DeepSeek sits at 78.0%, moderate level.
Refrigeration and Combined Cycle
VCR-A surprise: Gemini scores 88.6%, ahead of Opus (88.0%). In contrast to its Brayton variable cp collapse, Gemini shows strong performance on R-134a-based refrigeration cycles. This suggests strong R-134a training data or refrigeration cycle experience.
CCGT is a universal challenge: Opus 77.3%, GPT 73.0%, Gemini 70.0%, DeepSeek 62.0%, MiniMax 18.0%. Multi-fluid integration and error cascades challenge all models.
Fluid Analysis
Fluid-Based Performance
4 fluid groups x 5 models - water, air, R-134a, air+water (CCGT)
CCGT is the hardest for all models: requires multi-fluid integration and cross-cycle energy transfer. Gemini shows surprise recovery at 88.6% in VCR-A.
Cross-Tier Analysis: T1 → T2 → T3
Cross-Tier Analysis: T1 -> T2 -> T3
Performance degradation and rank trajectory across all three tiers
Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
MiniMax M2.5
T2 -> T3 rank correlation p=1.0 (perfect), but T1 -> T3 correlation is only p=0.6. Property lookup success does NOT predict cycle analysis performance.
MiniMax catastrophic collapse of -44.3 pp: from 84.5% in Tier 1 to 40.2% in Tier 3. Insufficient multi-step reasoning capacity.
T2→T3 rank correlation is ρ=1.0 — perfect correlation. The component analysis ranking is preserved exactly in cycle analysis. However, T1→T3 correlation is only ρ=0.6 — property lookup success does NOT predict cycle analysis performance.
Opus is the most stable model: Only -4.3 pp total drop, following a #3→#1→#1 trajectory. Gemini shows the largest degradation: -13.2 pp following #1→#3→#3. MiniMax is catastrophic: -44.3 pp, completely inadequate for industrial thermodynamic reasoning.
Key Findings
6 Key Findings
The most important insights from Tier 3 results
Variable cp Breaks Gemini
Gemini leads constant cp Brayton at 97% but collapses to 63% with variable cp and 38% with regenerative variable cp. Hardcoded cp=1.005 kJ/(kg·K) assumption; missing NASA 7-coefficient polynomial.
CCGT Is the Ultimate Test
Best score only 77.3% (Opus). Combined cycle requires integrating Brayton topping and Rankine bottoming cycles, multi-fluid management, and cross-cycle energy transfer.
MiniMax Catastrophic Collapse
From 84.5% in Tier 1 to 40.2% in Tier 3: -44.3 pp drop. 18% on CCGT, 25% on BRY-RV. Multi-step reasoning capacity completely inadequate for full cycle analysis.
VCR Surprise: Gemini Recovers
Gemini collapses on variable cp but shows surprise recovery at 88.6% on VCR-A — ahead of Opus. Suggests strong R-134a experience or robust refrigeration cycle training data.
Error Cascades Dominate
In cycle analysis, one step error propagates to all subsequent steps. In Rankine, a pump error corrupts boiler and turbine calculations; in Brayton, a compressor error invalidates combustion chamber and turbine results.
Opus: Most Stable Across Tiers
Only -4.3 pp total drop — the most stable model. #3 in T1, #1 in both T2 and T3. Gemini shows the largest degradation at -13.2 pp (#1 to #3 to #3).
Methodology
- Reference library: CoolProp 7.2.0 (IAPWS-IF97 + NIST reference data)
- Tolerance: ±2% (industrial engineering standard)
- Questions: 82 (Tier 3)
- Cycle types: 9 (RNK-I, RNK-A, RNK-RH, BRY-I, BRY-A, BRY-AV, BRY-RV, VCR-A, CCGT)
- Fluids: 4 (Water, Air, R-134a, Air+Water)
- Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
- Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6
Conclusion
The ThermoQA three-tier benchmark system is now complete — 293 total questions (110 + 101 + 82):
- Tier 1: Property Lookups — 110 questions, steam table values ✅
- Tier 2: Component Analysis — 101 questions, 7 components, 3 fluids ✅
- Tier 3: Cycle Analysis (this post) — 82 questions, 9 cycles, 4 fluids ✅
A consistent picture emerges across all three tiers: Opus is the most stable and highest-performing model, GPT is a strong second, Gemini excels at simpler tasks but degrades with complexity, DeepSeek maintains its reasoning-oriented architecture advantage. MiniMax lacks sufficient capacity for industrial thermodynamic reasoning.
Resources
- Dataset: HuggingFace — olivenet/thermoqa
- Source code: GitHub — olivenet-iot/ThermoQA
- CoolProp: coolprop.org
- IAPWS-IF97: International Association for the Properties of Water and Steam industrial formulation
- Tier 1 post: ThermoQA Tier 1 Results
- Tier 2 post: ThermoQA Tier 2 Results
About the Author
Olivenet Team
IoT & Automation Experts
Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.