ThermoQA Tier 2: How Well Can AI Models Analyze Thermodynamic Components?
101 questions, 7 components, 3 fluids, 3 analysis depths. In Tier 2, rankings completely reshuffle: Opus rises to #1 while Gemini drops to #3. R-134a collapses all models, compressor is the hardest component — and deeper analysis paradoxically yields higher accuracy.
Olivenet Team
IoT & Automation Experts
In our Tier 1 post, we tested 5 AI models on steam table lookups — 110 questions, single fluid (water), single level (property lookup). Gemini led at 97.3%. But does that lead hold under more complex tasks?
ThermoQA Tier 2: Component Analysis tests multi-step thermodynamic reasoning. 101 questions, 7 components (turbine, compressor, pump, heat exchanger, boiler, mixing chamber, nozzle), 3 fluids (water, air, R-134a), 3 analysis depths (energy, entropy, exergy). Reading tables is no longer enough — models must set up energy balances, compute isentropic efficiencies, and perform irreversibility analysis.
Overall Leaderboard
Tier 2 Overall Leaderboard
101 questions · 3 fluids · 7 components · CoolProp 7.2.0 ground truth · ±2% tolerance
Claude Opus 4.6
Anthropic
GPT-5.4
OpenAI
Gemini 3.1 Pro
DeepSeek-R1
DeepSeek
MiniMax M2.5
MiniMax
Claude Opus 4.6 rises to #1 with 92.0% — up from third place in Tier 1. GPT-5.4 holds steady at #2 with 91.0%. Gemini 3.1 Pro could not maintain its Tier 1 lead: dropping to #3 at 89.5%.
The most striking finding by fluid: R-134a collapses all models. While water (62-97%) and air (77-96%) show reasonable ranges, R-134a's best score is only 57.6% (DeepSeek). Even frontier models struggle significantly with refrigerant properties.
Why Is R-134a So Difficult?
R-134a (1,1,1,2-tetrafluoroethane) is a refrigerant widely used in cooling systems. Unlike water and air, R-134a's thermodynamic properties have limited coverage in standard Çengel textbook tables. Models need specialized databases like NIST/REFPROP or CoolProp for this fluid — but these sources are far less prevalent in training data than water.
Additionally, R-134a's critical temperature (101.06°C) and pressure (40.59 bar) are very different from water's. Its two-phase behavior misleads the intuitions models learned from water.
Per-Component Performance
Per-Component Performance
7 components × 5 models — Compressor is hardest, pump is solved
Compressor: Hardest component. Best score 76.3% (Opus). Formula reversal (h₂ = h₁ + (h₂s − h₁) / ηs) and sign convention challenges.
Pump: Solved component. Top three models scored 100%. Simple energy balance suffices.
A clear difficulty hierarchy emerges across the 7 components:
Compressor is the hardest — best score only 76.3% (Opus). Compressor analysis requires formula reversal: outlet enthalpy is computed as h₂ = h₁ + (h₂s − h₁) / ηs. Here ηs is isentropic efficiency and the sign convention is critical — is work input positive or negative? Errors in this convention completely invalidate the result. Additionally, determining the isentropic outlet state (h₂s) requires a separate steam table lookup.
Pump is solved — the top three models scored 100%. Pump analysis is relatively straightforward: operates in the liquid phase, specific volume is approximately constant, and energy balance applies directly.
DeepSeek stands out on heat exchanger questions: at 89.9% it's the highest-scoring model for that component. On nozzle, the gap between Opus (94.1%) and GPT (97.9%) is notable.
Tier 1 → Tier 2 Performance Shift
Tier 1 → Tier 2 Performance Shift
Property lookup accuracy does NOT predict component analysis performance
Claude Opus 4.6
GPT-5.4
Gemini 3.1 Pro
DeepSeek-R1
MiniMax M2.5
Property lookup accuracy does NOT predict component analysis performance — rankings completely reshuffle.
The ranking shift is dramatic. Gemini led Tier 1 at 97.3% but drops to 89.5% in Tier 2 — a 7.8-point loss, the largest degradation. Opus experiences only a 3.6-point drop, the smallest among frontier models, climbing from #3 to #1.
What does this mean? Steam table lookup (Tier 1) and multi-step component analysis (Tier 2) measure different capabilities. Tier 1 rewards fast, accurate table access, while Tier 2 requires setting up energy/mass balances, chaining multiple property lookups, and manipulating formulas. Gemini's efficient Tier 1 approach (823 tokens/question) becomes a disadvantage in multi-step reasoning.
DeepSeek-R1 notably shows the smallest drop (−2.6 pp). Its reasoning-heavy architecture appears more resilient on multi-step problems.
Analysis Depth vs Accuracy
Analysis Depth vs Accuracy
Counterintuitive: deeper analysis = higher accuracy
Counterintuitive finding: Deeper analysis (exergy) yields higher accuracy for frontier models. Exergy formulas act as structured reasoning scaffolds.
Perhaps the most surprising finding: deeper analysis = higher accuracy. The expectation was that exergy (Depth C) questions would be harder and accuracy would drop. The reality is the opposite:
- Depth A (Energy): 71.8% – 90.2%
- Depth B (+Entropy): 76.0% – 91.5%
- Depth C (+Exergy): 72.3% – 94.8%
For frontier models (Opus, GPT, Gemini, DeepSeek), Depth C consistently outperforms Depth A. Why?
Exergy formulas provide structured reasoning scaffolds. Exergy analysis offers a specific formula framework combining energy and entropy (ψ = h − h₀ − T₀(s − s₀)). This formula explicitly defines which steps the model should follow. In energy analysis, how to set up balance equations is more ambiguous, leading to more errors.
The exception is MiniMax: at 72.3% on Depth C, nearly the same as Depth A (71.8%). Structured scaffolding only benefits models with sufficient base capability.
Key Findings
6 Key Findings
The most important insights from Tier 2 results
Rankings Reshuffle
Tier 1 leader Gemini (97.3%) drops to #3 (89.5%) in Tier 2, while Opus rises from #3 to #1 (92.0%). Property lookup ≠ multi-step reasoning.
R-134a Is the Discriminator
While water and air range 62-97%, R-134a collapses all models to 35-57%. Refrigerant properties beyond Çengel tables are the biggest weakness.
Compressor Is Hardest
Best score only 76.3% (Opus). Formula reversal (h₂ = h₁ + (h₂s − h₁) / ηs) and isentropic efficiency sign convention challenge all models.
Depth C > A
Counterintuitive: exergy (deepest) accuracy exceeds energy (simplest). Exergy formulas provide structured reasoning scaffolds for frontier models.
Pump Is Solved
Top three models scored 100% on pump questions. Simple energy balance and low complexity — LLMs have this component fully solved.
Three Performance Tiers
Opus/GPT (91-92%), Gemini/DeepSeek (87-90%), MiniMax (73%). The tight Tier 1 race separates into clear tiers in Tier 2.
Methodology
- Reference library: CoolProp 7.2.0 (IAPWS-IF97 + NIST reference data)
- Tolerance: ±2% (industrial engineering standard)
- Questions: 101 (Tier 2)
- Components: 7 (Turbine, Compressor, Pump, Heat Exchanger, Boiler, Mixing Chamber, Nozzle)
- Fluids: 3 (Water, Air, R-134a)
- Analysis depths: 3 (Energy, +Entropy, +Exergy)
- Scoring: Weighted step-level — each intermediate step independently validated against CoolProp reference
- Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6
What's Next?
ThermoQA is a three-tier benchmark system:
- Tier 1: Property Lookups — 110 questions, steam table values ✅
- Tier 2: Component Analysis (this post) — 101 questions, 7 components, 3 fluids ✅
- Tier 3: Cycle Analysis — Full Rankine, Brayton, refrigeration cycles (in development)
In Tier 3 we'll analyze complete thermodynamic cycles — scenarios where multiple components are interconnected, cycle efficiency is computed, and optimization decisions are made. Will models strong in component analysis also succeed in full cycle analysis? Stay tuned.
Resources
- Dataset: HuggingFace — olivenet/thermoqa
- Source code: GitHub — olivenet-iot/ThermoQA
- CoolProp: coolprop.org
- IAPWS-IF97: International Association for the Properties of Water and Steam industrial formulation
About the Author
Olivenet Team
IoT & Automation Experts
Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.