All Posts
Artificial Intelligence

ThermoQA Tier 1: How Well Can AI Models Read Steam Tables?

We tested 5 large language models with 110 thermodynamics questions. Gemini 3.1 Pro leads with 97.3%, but all models struggle in the supercritical region. CoolProp 7.2.0 ground truth, ±2% tolerance — here are the results.

Olivenet Team

IoT & Automation Experts

2026-03-075 min read

LLMs now write code, summarize articles, and generate images. But how reliable are they for engineering calculations? ThermoQA is a benchmark we designed to answer exactly this question: measuring AI models' thermodynamic reasoning ability.

In this post we share Tier 1: Property Lookups results — 110 questions, 5 models, 8 categories, CoolProp 7.2.0 ground truth. The questions are at the level of reading values from steam tables in an engineering thermodynamics course: finding properties like enthalpy, entropy, and specific volume given temperature and pressure.

Why Thermodynamics?

Thermodynamics is the cornerstone of engineering calculations. The efficiency of a steam turbine, the COP of a refrigeration cycle, the outlet temperature of a heat exchanger — all depend on correct values from steam tables. If an LLM can't perform these basic lookups accurately, it cannot be trusted for more complex cycle analyses.

CoolProp 7.2.0 is an open-source thermodynamic property library using the IAPWS-IF97 formulation. All reference values in the benchmark are computed with CoolProp. Tolerance is set at ±2% — consistent with industrial engineering standards.

Overall Leaderboard

Tier 1 Overall Leaderboard

110 questions · Water/steam · CoolProp 7.2.0 ground truth (IAPWS-IF97) · ±2% tolerance

110 QuestionsCoolProp 7.2.0±2% Tolerance
🥇
Gemini 3.1 Pro

Google

Overall
97.3%
Easy100%
Medium98.9%
Hard87.5%
🥈
GPT-5.4

OpenAI

Overall
96.9%
Easy100%
Medium93.9%
Hard94.4%
🥉
Claude Opus 4.6

Anthropic

Overall
95.6%
Easy88.5%
Medium94.4%
Hard75%
#4
DeepSeek-R1

DeepSeek

Overall
89.5%
Easy97.4%
Medium96.1%
Hard67.6%
#5
MiniMax M2.5

MiniMax

Overall
84.5%
Easy90.1%
Medium78.9%
Hard70.8%

Gemini 3.1 Pro leads at 97.3%. 100% on easy questions, 98.9% on medium, 87.5% on hard. GPT-5.4 is a close second (96.9%) and notably outperforms Gemini on hard questions (94.4% vs 87.5%). Claude Opus 4.6 ranks third (95.6%) but makes unexpected errors on easy questions (88.5%).

Further down, DeepSeek-R1 (89.5%) and MiniMax M2.5 (84.5%). DeepSeek is strong on easy questions (97.4%) but drops significantly on hard ones (67.6%).

Difficulty Analysis

Easy questions (standard steam table values) don't discriminate among frontier models — the top three range from 88-100%. The real differentiation happens on hard questions: interpolation, boundary conditions, and the supercritical region. GPT-5.4's 94.4% on hard questions is notable — 7 points above Gemini.

Per-Category Performance

Per-Category Performance

8 categories × 5 models — Supercritical region is the key discriminator

Gemini 3.1 Pro
GPT-5.4
Claude Opus 4.6
DeepSeek-R1
MiniMax M2.5
Subcooled Liquid(10 Q)
Gemini 3.1 Pro
100%
GPT-5.4
100%
Claude Opus 4.6
80%
DeepSeek-R1
100%
MiniMax M2.5
76.7%
Saturated Liquid(12 Q)
Gemini 3.1 Pro
100%
GPT-5.4
100%
Claude Opus 4.6
100%
DeepSeek-R1
91.7%
MiniMax M2.5
97.9%
Wet Steam(18 Q)
Gemini 3.1 Pro
100%
GPT-5.4
100%
Claude Opus 4.6
90.7%
DeepSeek-R1
90.7%
MiniMax M2.5
92.6%
Saturated Vapor(10 Q)
Gemini 3.1 Pro
100%
GPT-5.4
100%
Claude Opus 4.6
100%
DeepSeek-R1
100%
MiniMax M2.5
87.5%
Superheated Vapor(20 Q)
Gemini 3.1 Pro
98.3%
GPT-5.4
98.3%
Claude Opus 4.6
78.3%
DeepSeek-R1
95%
MiniMax M2.5
83.3%
Supercritical(10 Q)
Gemini 3.1 Pro
76.7%
GPT-5.4
86.7%
Claude Opus 4.6
48.3%
DeepSeek-R1
48.3%
MiniMax M2.5
43.3%
Phase Determination(15 Q)
Gemini 3.1 Pro
100%
GPT-5.4
100%
Claude Opus 4.6
93.3%
DeepSeek-R1
86.7%
MiniMax M2.5
100%
Inverse Lookups(15 Q)
Gemini 3.1 Pro
100%
GPT-5.4
88.3%
Claude Opus 4.6
96.7%
DeepSeek-R1
95%
MiniMax M2.5
63.3%

Supercritical region: All models struggled. Best score 86.7% (GPT-5.4)

In 4 of 8 categories (Saturated Liquid, Saturated Vapor, Phase Determination, Subcooled Liquid), at least 3 models score 100%. But the supercritical region is dramatically different: even the best score is only 86.7% (GPT-5.4).

Supercritical Region: Why So Difficult?

Water's critical point is at T > 373.95°C, P > 22.064 MPa. Beyond this point there is no clear distinction between liquid and vapor phases — properties are continuous but change rapidly. Standard steam tables don't cover this region in detail; accurate values require solving IAPWS-IF97 equations of state.

LLMs may have memorized textbook steam tables, but computing values near the critical point that aren't in the tables requires equation-solving capability. This reveals the gap between memorization and actual computation.

Example: Claude Opus 4.6 at 402°C / 25.3 MPa

At this supercritical condition, Claude Opus returned enthalpy as h = 1887 kJ/kg. CoolProp reference: h = 2585.77 kJ/kg. Error: 27%. The model estimated a value close to the liquid side in a region where the phase boundary is blurred — a completely incorrect interpolation just above the critical point.

Token Usage Analysis

Token Usage Analysis

Tier 1 — Mean output tokens per model

823
Gemini 3.1 Pro97.3%
10,798
GPT-5.496.9%
12,981
Claude Opus 4.695.6%
7,476
DeepSeek-R189.5%
7,551
MiniMax M2.584.5%

Gemini used 16× fewer tokens than Opus but scored higher

Efficiency Metrics
8.5
Tokens / %
Gemini 3.1 Pro
111.4
Tokens / %
GPT-5.4
135.8
Tokens / %
Claude Opus 4.6
83.5
Tokens / %
DeepSeek-R1
89.4
Tokens / %
MiniMax M2.5

Token usage varies dramatically across models. Gemini 3.1 Pro uses an average of just 823 tokens per question, while Claude Opus 4.6 uses 12,981 tokens per question — a 16x difference. Yet spending more tokens doesn't yield better results: Gemini scored higher with fewer tokens.

This shows that for well-defined property lookups, the assumption "more thinking = better results" doesn't hold. Gemini's efficient approach appears to be the optimal strategy for this type of structured question.

Key Findings

5 Key Findings

The most important insights from Tier 1 results

#1
Supercritical Is the Discriminator

All models struggled in the supercritical region. Best score 86.7% (GPT-5.4). LLMs can memorize steam tables but can't solve IAPWS-IF97 equations near the critical point.

#2
Reasoning Mode Is Critical

GPT-5.4 without reasoning: 81%. With reasoning: 96.9%. A 16-point jump. Reasoning enables cross-checking and self-correction.

#3
Token Efficiency ≠ Accuracy

Gemini scored 97.3% with 823 tokens/question. Opus used 12,981 tokens (16×) and scored 95.6%. More thinking doesn't always mean better answers.

#4
Tool Use Changes Everything

The same model that scores 48% on supercritical without tools scores 100% with Python/CoolProp. The gap isn't knowledge — it's methodology.

#5
No Model Is Perfect Everywhere

Each model has unique weaknesses: GPT-5.4 on inverse lookups (88.3%), Opus on supercritical (48.3%), MiniMax on inverse lookups (63.3%), DeepSeek on hard problems (67.6%).

Impact of Tool Use

One of the benchmark's most striking findings is the impact of tool use. The same model that scores 48% on supercritical questions without tools scores 100% when it can run the CoolProp library via Python. The missing piece isn't thermodynamic knowledge — models know they need equation-of-state solvers, but can't run them without tool access.

This clearly defines the scenario where LLMs are strongest in engineering calculations: access to the right tools + reasoning capability.

The Importance of Reasoning Mode

GPT-5.4's score without reasoning_effort: 81%. With reasoning_effort=high: 96.9%. Reasoning mode enables the model to cross-check, validate intermediate results, and correct interpolation errors. All models in the leaderboard were tested with their best reasoning modes.

Methodology

  • Reference library: CoolProp 7.2.0 (IAPWS-IF97 formulation)
  • Tolerance: ±2% (industrial engineering standard)
  • Questions: 110 (Tier 1)
  • Categories: 8 (Subcooled Liquid, Saturated Liquid, Wet Steam, Saturated Vapor, Superheated Vapor, Supercritical, Phase Determination, Inverse Lookups)
  • Difficulty levels: Easy, Medium, Hard
  • Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6
  • Scoring: Each sub-question independently validated against CoolProp reference

What's Next?

ThermoQA is a three-tier benchmark system:

  • Tier 1: Property Lookups (this post) — 110 questions, steam table values
  • Tier 2: Component Analysis — 101 questions, thermodynamic analysis of components like turbines, compressors, pumps (coming soon)
  • Tier 3: Cycle Analysis — Full Rankine, Brayton, refrigeration cycles (in development)

In Tier 2 results, the rankings completely reshuffle — Gemini's Tier 1 lead doesn't hold up under multi-step reasoning. We'll share this analysis in our next post.

Resources

About the Author

Olivenet Team

IoT & Automation Experts

Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.

LoRaWANThingsBoardSmart FarmingEnergy Monitoring
LinkedIn