Artificial Intelligence

ThermoQA Tier 3: How Well Can AI Models Handle Full Thermodynamic Cycle Analysis?

82 questions, 9 cycle types, 4 fluids. Opus leads at 91.3%, MiniMax collapses to 40.2%, and variable cp breaks Gemini. CCGT is the hardest cycle — best score only 77.3%. T2→T3 rank correlation is perfect (ρ=1.0).

Olivenet Team

IoT & Automation Experts

2026-03-154 min read

In our Tier 1 post, we tested steam table lookups. In Tier 2, we tested component analysis. Now it's time for the most challenging tier: full thermodynamic cycle analysis.

ThermoQA Tier 3: Cycle Analysis tests scenarios where multiple components are interconnected, cycle efficiency is computed, and optimization decisions are made. 82 questions, 9 cycle types (3 Rankine, 4 Brayton, vapor compression refrigeration, combined cycle), 4 fluids (water, air, R-134a, air+water). A single component is no longer enough — models must analyze complete cycles end to end.

Overall Leaderboard

R-134a88.6%

Air+Water70%

1,480 tok/Q

DeepSeek-R1

DeepSeek

Overall

81.2%

Water88.8%

Air82.3%

R-134a76%

Air+Water62%

15,280 tok/Q

MiniMax M2.5

MiniMax

Overall

40.2%

Water48%

Air40.5%

R-134a28%

Air+Water18%

12,100 tok/Q

Claude Opus 4.6 maintains its lead at 91.3% — only a 0.7-point drop from Tier 2, making it the most stable model. GPT-5.4 follows at 88.3%, Gemini 3.1 Pro at 84.1%. The most striking result: MiniMax M2.5 collapses to 40.2%, falling to unusable levels.

By fluid, CCGT (combined cycle) is the hardest for all models: best score only 77.3% (Opus). Gemini's surprise 88.6% on VCR-A is notable in contrast to its Brayton collapse.

Why Is CCGT So Difficult?

A combined cycle (CCGT) integrates a Brayton gas turbine topping cycle with a Rankine steam turbine bottoming cycle. Hot exhaust gases from the gas turbine generate steam in a heat recovery steam generator (HRSG). This integration:

Requires simultaneous management of two different fluids (air and water)
Errors in the topping cycle propagate directly to the bottoming cycle (error cascade)
Overall efficiency calculation requires both independent and combined analysis of both cycles
Pinch point analysis and HRSG design constraints add complexity

Per-Cycle Performance

9 cycle types x 5 models - grouped into 3 families

Claude Opus 4.6

GPT-5.4

Gemini 3.1 Pro

DeepSeek-R1

MiniMax M2.5

Rankine Family

RNK-I (Ideal Rankine)(10 Q)

Claude Opus 4.6

97.5%

GPT-5.4

95%

Gemini 3.1 Pro

97.5%

DeepSeek-R1

91.8%

MiniMax M2.5

52.5%

RNK-A (Actual Rankine)(10 Q)

Claude Opus 4.6

95%

GPT-5.4

93.5%

Gemini 3.1 Pro

96%

DeepSeek-R1

89.5%

MiniMax M2.5

48%

RNK-RH (Reheat Rankine)(8 Q)

Claude Opus 4.6

93%

GPT-5.4

91.5%

Gemini 3.1 Pro

93%

DeepSeek-R1

85%

MiniMax M2.5

43.5%

Brayton Family

BRY-I (Ideal Brayton)(10 Q)

Claude Opus 4.6

98%

GPT-5.4

96.5%

Gemini 3.1 Pro

98%

DeepSeek-R1

93%

MiniMax M2.5

55%

BRY-A (Actual Brayton)(10 Q)

Claude Opus 4.6

96%

GPT-5.4

94.5%

Gemini 3.1 Pro

97%

DeepSeek-R1

90%

MiniMax M2.5

50%

BRY-AV (Variable cp)(8 Q)

Claude Opus 4.6

88.5%

GPT-5.4

85%

Gemini 3.1 Pro

63%

DeepSeek-R1

78%

MiniMax M2.5

32%

BRY-RV (Regen. + Var. cp)(8 Q)

Claude Opus 4.6

82%

GPT-5.4

78.5%

Gemini 3.1 Pro

38%

DeepSeek-R1

68%

MiniMax M2.5

25%

Other

VCR-A (Vapor Compression)(9 Q)

Claude Opus 4.6

88%

GPT-5.4

84%

Gemini 3.1 Pro

88.6%

DeepSeek-R1

76%

MiniMax M2.5

28%

CCGT (Combined Cycle)(9 Q)

Claude Opus 4.6

77.3%

GPT-5.4

73%

Gemini 3.1 Pro

70%

DeepSeek-R1

62%

MiniMax M2.5

18%

Variable cp: Breaks Gemini from 97% to 63% to 38%. Hardcoded cp=1.005 kJ/(kg·K) assumption; missing NASA 7-coefficient polynomial.

CCGT is the hardest cycle: Best score 77.3% (Opus). Requires multi-fluid, multi-component integration.

A clear difficulty hierarchy emerges across the 9 cycles:

Rankine Family

Ideal (RNK-I) and actual (RNK-A) Rankine cycles see all frontier models scoring 90%+ — these cycles can be considered solved. Reheat Rankine (RNK-RH) shows a slight decline (85-93%) but still strong performance. Water remains the fluid models know best.

Brayton Family and the Variable cp Problem

Ideal (BRY-I) and actual (BRY-A) Brayton cycles are also in the solved category — 90%+ scores. But when variable specific heat capacity (cp) enters the picture, the landscape changes completely.

Gemini's collapse is dramatic: Constant cp Brayton at 97% → variable cp at 63% → regenerative + variable cp at 38%. Gemini hardcodes cp=1.005 kJ/(kg·K) and cannot compute temperature-dependent cp variation. Where it should use NASA 7-coefficient polynomials or air tables, it uses a fixed value.

Opus and GPT also decline on variable cp (88.5% and 85.0%), but not catastrophically. DeepSeek sits at 78.0%, moderate level.

Refrigeration and Combined Cycle

VCR-A surprise: Gemini scores 88.6%, ahead of Opus (88.0%). In contrast to its Brayton variable cp collapse, Gemini shows strong performance on R-134a-based refrigeration cycles. This suggests strong R-134a training data or refrigeration cycle experience.

CCGT is a universal challenge: Opus 77.3%, GPT 73.0%, Gemini 70.0%, DeepSeek 62.0%, MiniMax 18.0%. Multi-fluid integration and error cascades challenge all models.

Fluid Analysis

Fluid-Based Performance

4 fluid groups x 5 models - water, air, R-134a, air+water (CCGT)

Claude Opus 4.6

GPT-5.4

Gemini 3.1 Pro

DeepSeek-R1

MiniMax M2.5

95.2

93.3

95.5

88.8

Water (Rankine)

91.1

88.6

82.3

40.5

Air (Brayton)

88.6

R-134a (VCR)

77.3

Air+Water (CCGT)

CCGT is the hardest for all models: requires multi-fluid integration and cross-cycle energy transfer. Gemini shows surprise recovery at 88.6% in VCR-A.

Cross-Tier Analysis: T1 → T2 → T3

40.2%

-44.3 pp#5→#5→#5

T2 -> T3 rank correlation p=1.0 (perfect), but T1 -> T3 correlation is only p=0.6. Property lookup success does NOT predict cycle analysis performance.

MiniMax catastrophic collapse of -44.3 pp: from 84.5% in Tier 1 to 40.2% in Tier 3. Insufficient multi-step reasoning capacity.

T2→T3 rank correlation is ρ=1.0 — perfect correlation. The component analysis ranking is preserved exactly in cycle analysis. However, T1→T3 correlation is only ρ=0.6 — property lookup success does NOT predict cycle analysis performance.

Opus is the most stable model: Only -4.3 pp total drop, following a #3→#1→#1 trajectory. Gemini shows the largest degradation: -13.2 pp following #1→#3→#3. MiniMax is catastrophic: -44.3 pp, completely inadequate for industrial thermodynamic reasoning.

Key Findings

6 Key Findings

The most important insights from Tier 3 results

Variable cp Breaks Gemini

Gemini leads constant cp Brayton at 97% but collapses to 63% with variable cp and 38% with regenerative variable cp. Hardcoded cp=1.005 kJ/(kg·K) assumption; missing NASA 7-coefficient polynomial.

CCGT Is the Ultimate Test

Best score only 77.3% (Opus). Combined cycle requires integrating Brayton topping and Rankine bottoming cycles, multi-fluid management, and cross-cycle energy transfer.

MiniMax Catastrophic Collapse

From 84.5% in Tier 1 to 40.2% in Tier 3: -44.3 pp drop. 18% on CCGT, 25% on BRY-RV. Multi-step reasoning capacity completely inadequate for full cycle analysis.

VCR Surprise: Gemini Recovers

Gemini collapses on variable cp but shows surprise recovery at 88.6% on VCR-A — ahead of Opus. Suggests strong R-134a experience or robust refrigeration cycle training data.

Error Cascades Dominate

In cycle analysis, one step error propagates to all subsequent steps. In Rankine, a pump error corrupts boiler and turbine calculations; in Brayton, a compressor error invalidates combustion chamber and turbine results.

Opus: Most Stable Across Tiers

Only -4.3 pp total drop — the most stable model. #3 in T1, #1 in both T2 and T3. Gemini shows the largest degradation at -13.2 pp (#1 to #3 to #3).

Methodology

Reference library: CoolProp 7.2.0 (IAPWS-IF97 + NIST reference data)
Tolerance: ±2% (industrial engineering standard)
Questions: 82 (Tier 3)
Cycle types: 9 (RNK-I, RNK-A, RNK-RH, BRY-I, BRY-A, BRY-AV, BRY-RV, VCR-A, CCGT)
Fluids: 4 (Water, Air, R-134a, Air+Water)
Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6

Conclusion

The ThermoQA three-tier benchmark system is now complete — 293 total questions (110 + 101 + 82):

Tier 1: Property Lookups — 110 questions, steam table values ✅
Tier 2: Component Analysis — 101 questions, 7 components, 3 fluids ✅
Tier 3: Cycle Analysis (this post) — 82 questions, 9 cycles, 4 fluids ✅

A consistent picture emerges across all three tiers: Opus is the most stable and highest-performing model, GPT is a strong second, Gemini excels at simpler tasks but degrades with complexity, DeepSeek maintains its reasoning-oriented architecture advantage. MiniMax lacks sufficient capacity for industrial thermodynamic reasoning.

Resources

Dataset: HuggingFace — olivenet/thermoqa
Source code: GitHub — olivenet-iot/ThermoQA
CoolProp: coolprop.org
IAPWS-IF97: International Association for the Properties of Water and Steam industrial formulation
Tier 1 post: ThermoQA Tier 1 Results
Tier 2 post: ThermoQA Tier 2 Results

About the Author

Olivenet Team

IoT & Automation Experts

Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.

LoRaWANThingsBoardSmart FarmingEnergy Monitoring