All Posts
Artificial Intelligence

ThermoQA Tier 2: How Well Can AI Models Analyze Thermodynamic Components?

101 questions, 7 components, 3 fluids, 3 analysis depths. In Tier 2, rankings completely reshuffle: Opus rises to #1 while Gemini drops to #3. R-134a collapses all models, compressor is the hardest component — and deeper analysis paradoxically yields higher accuracy.

Olivenet Team

IoT & Automation Experts

2026-03-105 min read

In our Tier 1 post, we tested 5 AI models on steam table lookups — 110 questions, single fluid (water), single level (property lookup). Gemini led at 97.3%. But does that lead hold under more complex tasks?

ThermoQA Tier 2: Component Analysis tests multi-step thermodynamic reasoning. 101 questions, 7 components (turbine, compressor, pump, heat exchanger, boiler, mixing chamber, nozzle), 3 fluids (water, air, R-134a), 3 analysis depths (energy, entropy, exergy). Reading tables is no longer enough — models must set up energy balances, compute isentropic efficiencies, and perform irreversibility analysis.

Overall Leaderboard

Tier 2 Overall Leaderboard

101 questions · 3 fluids · 7 components · CoolProp 7.2.0 ground truth · ±2% tolerance

101 Questions3 Fluids7 Components±2% Tolerance
🥇
Claude Opus 4.6

Anthropic

Overall
92%
Water96.5%
Air95.6%
R-134a53%
30,371 tok/Q
🥈
GPT-5.4

OpenAI

Overall
91%
Water95.2%
Air95.8%
R-134a52%
8,986 tok/Q
🥉
Gemini 3.1 Pro

Google

Overall
89.5%
Water97.4%
Air81.3%
R-134a44.6%
1,310 tok/Q
#4
DeepSeek-R1

DeepSeek

Overall
86.9%
Water88.5%
Air86.5%
R-134a57.6%
14,053 tok/Q
#5
MiniMax M2.5

MiniMax

Overall
73.4%
Water61.5%
Air76.5%
R-134a35.5%
11,659 tok/Q

Claude Opus 4.6 rises to #1 with 92.0% — up from third place in Tier 1. GPT-5.4 holds steady at #2 with 91.0%. Gemini 3.1 Pro could not maintain its Tier 1 lead: dropping to #3 at 89.5%.

The most striking finding by fluid: R-134a collapses all models. While water (62-97%) and air (77-96%) show reasonable ranges, R-134a's best score is only 57.6% (DeepSeek). Even frontier models struggle significantly with refrigerant properties.

Why Is R-134a So Difficult?

R-134a (1,1,1,2-tetrafluoroethane) is a refrigerant widely used in cooling systems. Unlike water and air, R-134a's thermodynamic properties have limited coverage in standard Çengel textbook tables. Models need specialized databases like NIST/REFPROP or CoolProp for this fluid — but these sources are far less prevalent in training data than water.

Additionally, R-134a's critical temperature (101.06°C) and pressure (40.59 bar) are very different from water's. Its two-phase behavior misleads the intuitions models learned from water.

Per-Component Performance

Per-Component Performance

7 components × 5 models — Compressor is hardest, pump is solved

Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
MiniMax M2.5
Turbine(18 Q)
Claude Opus 4.6
96.9%
GPT-5.4
91.2%
Gemini 3.1 Pro
93.5%
DeepSeek-R1
86.8%
MiniMax M2.5
55.6%
Compressor(14 Q)
Claude Opus 4.6
76.3%
GPT-5.4
73.4%
Gemini 3.1 Pro
58.5%
DeepSeek-R1
61.4%
MiniMax M2.5
48.6%
Pump(10 Q)
Claude Opus 4.6
100%
GPT-5.4
100%
Gemini 3.1 Pro
100%
DeepSeek-R1
93.3%
MiniMax M2.5
88.5%
Heat Exchanger(19 Q)
Claude Opus 4.6
88.7%
GPT-5.4
84.9%
Gemini 3.1 Pro
88.5%
DeepSeek-R1
89.9%
MiniMax M2.5
66.6%
Boiler(14 Q)
Claude Opus 4.6
98.2%
GPT-5.4
97.1%
Gemini 3.1 Pro
100%
DeepSeek-R1
93.5%
MiniMax M2.5
73%
Mixing Chamber(12 Q)
Claude Opus 4.6
92%
GPT-5.4
98.6%
Gemini 3.1 Pro
97.8%
DeepSeek-R1
95.7%
MiniMax M2.5
59.6%
Nozzle(14 Q)
Claude Opus 4.6
94.1%
GPT-5.4
97.9%
Gemini 3.1 Pro
91.4%
DeepSeek-R1
77%
MiniMax M2.5
45.7%

Compressor: Hardest component. Best score 76.3% (Opus). Formula reversal (h₂ = h₁ + (h₂s − h₁) / ηs) and sign convention challenges.

Pump: Solved component. Top three models scored 100%. Simple energy balance suffices.

A clear difficulty hierarchy emerges across the 7 components:

Compressor is the hardest — best score only 76.3% (Opus). Compressor analysis requires formula reversal: outlet enthalpy is computed as h₂ = h₁ + (h₂s − h₁) / ηs. Here ηs is isentropic efficiency and the sign convention is critical — is work input positive or negative? Errors in this convention completely invalidate the result. Additionally, determining the isentropic outlet state (h₂s) requires a separate steam table lookup.

Pump is solved — the top three models scored 100%. Pump analysis is relatively straightforward: operates in the liquid phase, specific volume is approximately constant, and energy balance applies directly.

DeepSeek stands out on heat exchanger questions: at 89.9% it's the highest-scoring model for that component. On nozzle, the gap between Opus (94.1%) and GPT (97.9%) is notable.

Tier 1 → Tier 2 Performance Shift

Tier 1 → Tier 2 Performance Shift

Property lookup accuracy does NOT predict component analysis performance

Claude Opus 4.6
Tier 1
95.6%
Tier 2
92%
-3.6 pp#3→#1
GPT-5.4
Tier 1
96.9%
Tier 2
91%
-5.9 pp#2→#2
Gemini 3.1 Pro
Tier 1
97.3%
Tier 2
89.5%
-7.8 pp#1→#3
DeepSeek-R1
Tier 1
89.5%
Tier 2
86.9%
-2.6 pp#4→#4
MiniMax M2.5
Tier 1
84.5%
Tier 2
73.4%
-11.1 pp#5→#5

Property lookup accuracy does NOT predict component analysis performance — rankings completely reshuffle.

The ranking shift is dramatic. Gemini led Tier 1 at 97.3% but drops to 89.5% in Tier 2 — a 7.8-point loss, the largest degradation. Opus experiences only a 3.6-point drop, the smallest among frontier models, climbing from #3 to #1.

What does this mean? Steam table lookup (Tier 1) and multi-step component analysis (Tier 2) measure different capabilities. Tier 1 rewards fast, accurate table access, while Tier 2 requires setting up energy/mass balances, chaining multiple property lookups, and manipulating formulas. Gemini's efficient Tier 1 approach (823 tokens/question) becomes a disadvantage in multi-step reasoning.

DeepSeek-R1 notably shows the smallest drop (−2.6 pp). Its reasoning-heavy architecture appears more resilient on multi-step problems.

Analysis Depth vs Accuracy

Analysis Depth vs Accuracy

Counterintuitive: deeper analysis = higher accuracy

Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
MiniMax M2.5
90.2
89.8
87.8
81.3
71.8
A (Energy)
91.5
89.2
88.8
87.9
76
B (+Entropy)
94.8
94.7
92.2
92.4
72.3
C (+Exergy)

Counterintuitive finding: Deeper analysis (exergy) yields higher accuracy for frontier models. Exergy formulas act as structured reasoning scaffolds.

Perhaps the most surprising finding: deeper analysis = higher accuracy. The expectation was that exergy (Depth C) questions would be harder and accuracy would drop. The reality is the opposite:

  • Depth A (Energy): 71.8% – 90.2%
  • Depth B (+Entropy): 76.0% – 91.5%
  • Depth C (+Exergy): 72.3% – 94.8%

For frontier models (Opus, GPT, Gemini, DeepSeek), Depth C consistently outperforms Depth A. Why?

Exergy formulas provide structured reasoning scaffolds. Exergy analysis offers a specific formula framework combining energy and entropy (ψ = h − h₀ − T₀(s − s₀)). This formula explicitly defines which steps the model should follow. In energy analysis, how to set up balance equations is more ambiguous, leading to more errors.

The exception is MiniMax: at 72.3% on Depth C, nearly the same as Depth A (71.8%). Structured scaffolding only benefits models with sufficient base capability.

Key Findings

6 Key Findings

The most important insights from Tier 2 results

#1
Rankings Reshuffle

Tier 1 leader Gemini (97.3%) drops to #3 (89.5%) in Tier 2, while Opus rises from #3 to #1 (92.0%). Property lookup ≠ multi-step reasoning.

#2
R-134a Is the Discriminator

While water and air range 62-97%, R-134a collapses all models to 35-57%. Refrigerant properties beyond Çengel tables are the biggest weakness.

#3
Compressor Is Hardest

Best score only 76.3% (Opus). Formula reversal (h₂ = h₁ + (h₂s − h₁) / ηs) and isentropic efficiency sign convention challenge all models.

#4
Depth C > A

Counterintuitive: exergy (deepest) accuracy exceeds energy (simplest). Exergy formulas provide structured reasoning scaffolds for frontier models.

#5
Pump Is Solved

Top three models scored 100% on pump questions. Simple energy balance and low complexity — LLMs have this component fully solved.

#6
Three Performance Tiers

Opus/GPT (91-92%), Gemini/DeepSeek (87-90%), MiniMax (73%). The tight Tier 1 race separates into clear tiers in Tier 2.

Methodology

  • Reference library: CoolProp 7.2.0 (IAPWS-IF97 + NIST reference data)
  • Tolerance: ±2% (industrial engineering standard)
  • Questions: 101 (Tier 2)
  • Components: 7 (Turbine, Compressor, Pump, Heat Exchanger, Boiler, Mixing Chamber, Nozzle)
  • Fluids: 3 (Water, Air, R-134a)
  • Analysis depths: 3 (Energy, +Entropy, +Exergy)
  • Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
  • Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6

What's Next?

ThermoQA is a three-tier benchmark system:

  • Tier 1: Property Lookups — 110 questions, steam table values ✅
  • Tier 2: Component Analysis (this post) — 101 questions, 7 components, 3 fluids ✅
  • Tier 3: Cycle Analysis — Full Rankine, Brayton, refrigeration cycles (in development)

In Tier 3 we'll analyze complete thermodynamic cycles — scenarios where multiple components are interconnected, cycle efficiency is computed, and optimization decisions are made. Will models strong in component analysis also succeed in full cycle analysis? Stay tuned.

Resources

About the Author

Olivenet Team

IoT & Automation Experts

Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.

LoRaWANThingsBoardSmart FarmingEnergy Monitoring
LinkedIn