All Posts
Artificial Intelligence

ThermoQA Tier 3: How Well Can AI Models Handle Full Thermodynamic Cycle Analysis?

82 questions, 9 cycle types, 4 fluids. Opus leads at 91.3%, MiniMax collapses to 40.2%, and variable cp breaks Gemini. CCGT is the hardest cycle — best score only 77.3%. T2→T3 rank correlation is perfect (ρ=1.0).

Olivenet Team

IoT & Automation Experts

2026-03-154 min read

In our Tier 1 post, we tested steam table lookups. In Tier 2, we tested component analysis. Now it's time for the most challenging tier: full thermodynamic cycle analysis.

ThermoQA Tier 3: Cycle Analysis tests scenarios where multiple components are interconnected, cycle efficiency is computed, and optimization decisions are made. 82 questions, 9 cycle types (3 Rankine, 4 Brayton, vapor compression refrigeration, combined cycle), 4 fluids (water, air, R-134a, air+water). A single component is no longer enough — models must analyze complete cycles end to end.

Overall Leaderboard

Tier 3 Overall Leaderboard

82 questions · 9 cycle types · 4 fluids · CoolProp 7.2.0 ground truth · ±2% tolerance

82 Questions9 Cycles4 Fluids±2% Tolerance
🥇
Claude Opus 4.6

Anthropic

Overall
91.3%
Water95.2%
Air91.1%
R-134a88%
Air+Water77.3%
32,450 tok/Q
🥈
GPT-5.4

OpenAI

Overall
88.3%
Water93.3%
Air88.6%
R-134a84%
Air+Water73%
9,540 tok/Q
🥉
Gemini 3.1 Pro

Google

Overall
84.1%
Water95.5%
Air74%
R-134a88.6%
Air+Water70%
1,480 tok/Q
#4
DeepSeek-R1

DeepSeek

Overall
81.2%
Water88.8%
Air82.3%
R-134a76%
Air+Water62%
15,280 tok/Q
#5
MiniMax M2.5

MiniMax

Overall
40.2%
Water48%
Air40.5%
R-134a28%
Air+Water18%
12,100 tok/Q

Claude Opus 4.6 maintains its lead at 91.3% — only a 0.7-point drop from Tier 2, making it the most stable model. GPT-5.4 follows at 88.3%, Gemini 3.1 Pro at 84.1%. The most striking result: MiniMax M2.5 collapses to 40.2%, falling to unusable levels.

By fluid, CCGT (combined cycle) is the hardest for all models: best score only 77.3% (Opus). Gemini's surprise 88.6% on VCR-A is notable in contrast to its Brayton collapse.

Why Is CCGT So Difficult?

A combined cycle (CCGT) integrates a Brayton gas turbine topping cycle with a Rankine steam turbine bottoming cycle. Hot exhaust gases from the gas turbine generate steam in a heat recovery steam generator (HRSG). This integration:

  • Requires simultaneous management of two different fluids (air and water)
  • Errors in the topping cycle propagate directly to the bottoming cycle (error cascade)
  • Overall efficiency calculation requires both independent and combined analysis of both cycles
  • Pinch point analysis and HRSG design constraints add complexity

Per-Cycle Performance

Per-Cycle Performance

9 cycle types x 5 models - grouped into 3 families

Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
MiniMax M2.5
Rankine Family
RNK-I (Ideal Rankine)(10 Q)
Claude Opus 4.6
97.5%
GPT-5.4
95%
Gemini 3.1 Pro
97.5%
DeepSeek-R1
91.8%
MiniMax M2.5
52.5%
RNK-A (Actual Rankine)(10 Q)
Claude Opus 4.6
95%
GPT-5.4
93.5%
Gemini 3.1 Pro
96%
DeepSeek-R1
89.5%
MiniMax M2.5
48%
RNK-RH (Reheat Rankine)(8 Q)
Claude Opus 4.6
93%
GPT-5.4
91.5%
Gemini 3.1 Pro
93%
DeepSeek-R1
85%
MiniMax M2.5
43.5%
Brayton Family
BRY-I (Ideal Brayton)(10 Q)
Claude Opus 4.6
98%
GPT-5.4
96.5%
Gemini 3.1 Pro
98%
DeepSeek-R1
93%
MiniMax M2.5
55%
BRY-A (Actual Brayton)(10 Q)
Claude Opus 4.6
96%
GPT-5.4
94.5%
Gemini 3.1 Pro
97%
DeepSeek-R1
90%
MiniMax M2.5
50%
BRY-AV (Variable cp)(8 Q)
Claude Opus 4.6
88.5%
GPT-5.4
85%
Gemini 3.1 Pro
63%
DeepSeek-R1
78%
MiniMax M2.5
32%
BRY-RV (Regen. + Var. cp)(8 Q)
Claude Opus 4.6
82%
GPT-5.4
78.5%
Gemini 3.1 Pro
38%
DeepSeek-R1
68%
MiniMax M2.5
25%
Other
VCR-A (Vapor Compression)(9 Q)
Claude Opus 4.6
88%
GPT-5.4
84%
Gemini 3.1 Pro
88.6%
DeepSeek-R1
76%
MiniMax M2.5
28%
CCGT (Combined Cycle)(9 Q)
Claude Opus 4.6
77.3%
GPT-5.4
73%
Gemini 3.1 Pro
70%
DeepSeek-R1
62%
MiniMax M2.5
18%

Variable cp: Breaks Gemini from 97% to 63% to 38%. Hardcoded cp=1.005 kJ/(kg·K) assumption; missing NASA 7-coefficient polynomial.

CCGT is the hardest cycle: Best score 77.3% (Opus). Requires multi-fluid, multi-component integration.

A clear difficulty hierarchy emerges across the 9 cycles:

Rankine Family

Ideal (RNK-I) and actual (RNK-A) Rankine cycles see all frontier models scoring 90%+ — these cycles can be considered solved. Reheat Rankine (RNK-RH) shows a slight decline (85-93%) but still strong performance. Water remains the fluid models know best.

Brayton Family and the Variable cp Problem

Ideal (BRY-I) and actual (BRY-A) Brayton cycles are also in the solved category — 90%+ scores. But when variable specific heat capacity (cp) enters the picture, the landscape changes completely.

Gemini's collapse is dramatic: Constant cp Brayton at 97% → variable cp at 63% → regenerative + variable cp at 38%. Gemini hardcodes cp=1.005 kJ/(kg·K) and cannot compute temperature-dependent cp variation. Where it should use NASA 7-coefficient polynomials or air tables, it uses a fixed value.

Opus and GPT also decline on variable cp (88.5% and 85.0%), but not catastrophically. DeepSeek sits at 78.0%, moderate level.

Refrigeration and Combined Cycle

VCR-A surprise: Gemini scores 88.6%, ahead of Opus (88.0%). In contrast to its Brayton variable cp collapse, Gemini shows strong performance on R-134a-based refrigeration cycles. This suggests strong R-134a training data or refrigeration cycle experience.

CCGT is a universal challenge: Opus 77.3%, GPT 73.0%, Gemini 70.0%, DeepSeek 62.0%, MiniMax 18.0%. Multi-fluid integration and error cascades challenge all models.

Fluid Analysis

Fluid-Based Performance

4 fluid groups x 5 models - water, air, R-134a, air+water (CCGT)

Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
MiniMax M2.5
95.2
93.3
95.5
88.8
48
Water (Rankine)
91.1
88.6
74
82.3
40.5
Air (Brayton)
88
84
88.6
76
28
R-134a (VCR)
77.3
73
70
62
18
Air+Water (CCGT)

CCGT is the hardest for all models: requires multi-fluid integration and cross-cycle energy transfer. Gemini shows surprise recovery at 88.6% in VCR-A.

Cross-Tier Analysis: T1 → T2 → T3

Cross-Tier Analysis: T1 -> T2 -> T3

Performance degradation and rank trajectory across all three tiers

Claude Opus 4.6
Tier 1
95.6%
Tier 2
92%
Tier 3
91.3%
-4.3 pp#3#1#1
GPT-5.4
Tier 1
96.9%
Tier 2
91%
Tier 3
88.3%
-8.6 pp#2#2#2
Gemini 3.1 Pro
Tier 1
97.3%
Tier 2
89.5%
Tier 3
84.1%
-13.2 pp#1#3#3
DeepSeek-R1
Tier 1
89.5%
Tier 2
86.9%
Tier 3
81.2%
-8.3 pp#4#4#4
MiniMax M2.5
Tier 1
84.5%
Tier 2
73.4%
Tier 3
40.2%
-44.3 pp#5#5#5

T2 -> T3 rank correlation p=1.0 (perfect), but T1 -> T3 correlation is only p=0.6. Property lookup success does NOT predict cycle analysis performance.

MiniMax catastrophic collapse of -44.3 pp: from 84.5% in Tier 1 to 40.2% in Tier 3. Insufficient multi-step reasoning capacity.

T2→T3 rank correlation is ρ=1.0 — perfect correlation. The component analysis ranking is preserved exactly in cycle analysis. However, T1→T3 correlation is only ρ=0.6 — property lookup success does NOT predict cycle analysis performance.

Opus is the most stable model: Only -4.3 pp total drop, following a #3→#1→#1 trajectory. Gemini shows the largest degradation: -13.2 pp following #1→#3→#3. MiniMax is catastrophic: -44.3 pp, completely inadequate for industrial thermodynamic reasoning.

Key Findings

6 Key Findings

The most important insights from Tier 3 results

#1
Variable cp Breaks Gemini

Gemini leads constant cp Brayton at 97% but collapses to 63% with variable cp and 38% with regenerative variable cp. Hardcoded cp=1.005 kJ/(kg·K) assumption; missing NASA 7-coefficient polynomial.

#2
CCGT Is the Ultimate Test

Best score only 77.3% (Opus). Combined cycle requires integrating Brayton topping and Rankine bottoming cycles, multi-fluid management, and cross-cycle energy transfer.

#3
MiniMax Catastrophic Collapse

From 84.5% in Tier 1 to 40.2% in Tier 3: -44.3 pp drop. 18% on CCGT, 25% on BRY-RV. Multi-step reasoning capacity completely inadequate for full cycle analysis.

#4
VCR Surprise: Gemini Recovers

Gemini collapses on variable cp but shows surprise recovery at 88.6% on VCR-A — ahead of Opus. Suggests strong R-134a experience or robust refrigeration cycle training data.

#5
Error Cascades Dominate

In cycle analysis, one step error propagates to all subsequent steps. In Rankine, a pump error corrupts boiler and turbine calculations; in Brayton, a compressor error invalidates combustion chamber and turbine results.

#6
Opus: Most Stable Across Tiers

Only -4.3 pp total drop — the most stable model. #3 in T1, #1 in both T2 and T3. Gemini shows the largest degradation at -13.2 pp (#1 to #3 to #3).

Methodology

  • Reference library: CoolProp 7.2.0 (IAPWS-IF97 + NIST reference data)
  • Tolerance: ±2% (industrial engineering standard)
  • Questions: 82 (Tier 3)
  • Cycle types: 9 (RNK-I, RNK-A, RNK-RH, BRY-I, BRY-A, BRY-AV, BRY-RV, VCR-A, CCGT)
  • Fluids: 4 (Water, Air, R-134a, Air+Water)
  • Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
  • Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6

Conclusion

The ThermoQA three-tier benchmark system is now complete — 293 total questions (110 + 101 + 82):

  • Tier 1: Property Lookups — 110 questions, steam table values ✅
  • Tier 2: Component Analysis — 101 questions, 7 components, 3 fluids ✅
  • Tier 3: Cycle Analysis (this post) — 82 questions, 9 cycles, 4 fluids ✅

A consistent picture emerges across all three tiers: Opus is the most stable and highest-performing model, GPT is a strong second, Gemini excels at simpler tasks but degrades with complexity, DeepSeek maintains its reasoning-oriented architecture advantage. MiniMax lacks sufficient capacity for industrial thermodynamic reasoning.

Resources

About the Author

Olivenet Team

IoT & Automation Experts

Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.

LoRaWANThingsBoardSmart FarmingEnergy Monitoring
LinkedIn