ThermoQA Tier 1: How Well Can AI Models Read Steam Tables?
We tested 5 large language models with 110 thermodynamics questions. Gemini 3.1 Pro leads with 97.3%, but all models struggle in the supercritical region. CoolProp 7.2.0 ground truth, ±2% tolerance — here are the results.
Olivenet Team
IoT & Automation Experts
LLMs now write code, summarize articles, and generate images. But how reliable are they for engineering calculations? ThermoQA is a benchmark we designed to answer exactly this question: measuring AI models' thermodynamic reasoning ability.
In this post we share Tier 1: Property Lookups results — 110 questions, 5 models, 8 categories, CoolProp 7.2.0 ground truth. The questions are at the level of reading values from steam tables in an engineering thermodynamics course: finding properties like enthalpy, entropy, and specific volume given temperature and pressure.
Why Thermodynamics?
Thermodynamics is the cornerstone of engineering calculations. The efficiency of a steam turbine, the COP of a refrigeration cycle, the outlet temperature of a heat exchanger — all depend on correct values from steam tables. If an LLM can't perform these basic lookups accurately, it cannot be trusted for more complex cycle analyses.
CoolProp 7.2.0 is an open-source thermodynamic property library using the IAPWS-IF97 formulation. All reference values in the benchmark are computed with CoolProp. Tolerance is set at ±2% — consistent with industrial engineering standards.
Overall Leaderboard
Tier 1 Overall Leaderboard
110 questions · Water/steam · CoolProp 7.2.0 ground truth (IAPWS-IF97) · ±2% tolerance
Gemini 3.1 Pro
GPT-5.4
OpenAI
Claude Opus 4.6
Anthropic
DeepSeek-R1
DeepSeek
MiniMax M2.5
MiniMax
Gemini 3.1 Pro leads at 97.3%. 100% on easy questions, 98.9% on medium, 87.5% on hard. GPT-5.4 is a close second (96.9%) and notably outperforms Gemini on hard questions (94.4% vs 87.5%). Claude Opus 4.6 ranks third (95.6%) but makes unexpected errors on easy questions (88.5%).
Further down, DeepSeek-R1 (89.5%) and MiniMax M2.5 (84.5%). DeepSeek is strong on easy questions (97.4%) but drops significantly on hard ones (67.6%).
Difficulty Analysis
Easy questions (standard steam table values) don't discriminate among frontier models — the top three range from 88-100%. The real differentiation happens on hard questions: interpolation, boundary conditions, and the supercritical region. GPT-5.4's 94.4% on hard questions is notable — 7 points above Gemini.
Per-Category Performance
Per-Category Performance
8 categories × 5 models — Supercritical region is the key discriminator
Supercritical region: All models struggled. Best score 86.7% (GPT-5.4)
In 4 of 8 categories (Saturated Liquid, Saturated Vapor, Phase Determination, Subcooled Liquid), at least 3 models score 100%. But the supercritical region is dramatically different: even the best score is only 86.7% (GPT-5.4).
Supercritical Region: Why So Difficult?
Water's critical point is at T > 373.95°C, P > 22.064 MPa. Beyond this point there is no clear distinction between liquid and vapor phases — properties are continuous but change rapidly. Standard steam tables don't cover this region in detail; accurate values require solving IAPWS-IF97 equations of state.
LLMs may have memorized textbook steam tables, but computing values near the critical point that aren't in the tables requires equation-solving capability. This reveals the gap between memorization and actual computation.
Example: Claude Opus 4.6 at 402°C / 25.3 MPa
At this supercritical condition, Claude Opus returned enthalpy as h = 1887 kJ/kg. CoolProp reference: h = 2585.77 kJ/kg. Error: 27%. The model estimated a value close to the liquid side in a region where the phase boundary is blurred — a completely incorrect interpolation just above the critical point.
Token Usage Analysis
Token Usage Analysis
Tier 1 — Mean output tokens per model
Gemini used 16× fewer tokens than Opus but scored higher
Efficiency Metrics
Token usage varies dramatically across models. Gemini 3.1 Pro uses an average of just 823 tokens per question, while Claude Opus 4.6 uses 12,981 tokens per question — a 16x difference. Yet spending more tokens doesn't yield better results: Gemini scored higher with fewer tokens.
This shows that for well-defined property lookups, the assumption "more thinking = better results" doesn't hold. Gemini's efficient approach appears to be the optimal strategy for this type of structured question.
Key Findings
5 Key Findings
The most important insights from Tier 1 results
Supercritical Is the Discriminator
All models struggled in the supercritical region. Best score 86.7% (GPT-5.4). LLMs can memorize steam tables but can't solve IAPWS-IF97 equations near the critical point.
Reasoning Mode Is Critical
GPT-5.4 without reasoning: 81%. With reasoning: 96.9%. A 16-point jump. Reasoning enables cross-checking and self-correction.
Token Efficiency ≠ Accuracy
Gemini scored 97.3% with 823 tokens/question. Opus used 12,981 tokens (16×) and scored 95.6%. More thinking doesn't always mean better answers.
Tool Use Changes Everything
The same model that scores 48% on supercritical without tools scores 100% with Python/CoolProp. The gap isn't knowledge — it's methodology.
No Model Is Perfect Everywhere
Each model has unique weaknesses: GPT-5.4 on inverse lookups (88.3%), Opus on supercritical (48.3%), MiniMax on inverse lookups (63.3%), DeepSeek on hard problems (67.6%).
Impact of Tool Use
One of the benchmark's most striking findings is the impact of tool use. The same model that scores 48% on supercritical questions without tools scores 100% when it can run the CoolProp library via Python. The missing piece isn't thermodynamic knowledge — models know they need equation-of-state solvers, but can't run them without tool access.
This clearly defines the scenario where LLMs are strongest in engineering calculations: access to the right tools + reasoning capability.
The Importance of Reasoning Mode
GPT-5.4's score without reasoning_effort: 81%. With reasoning_effort=high: 96.9%. Reasoning mode enables the model to cross-check, validate intermediate results, and correct interpolation errors. All models in the leaderboard were tested with their best reasoning modes.
Methodology
- Reference library: CoolProp 7.2.0 (IAPWS-IF97 formulation)
- Tolerance: ±2% (industrial engineering standard)
- Questions: 110 (Tier 1)
- Categories: 8 (Subcooled Liquid, Saturated Liquid, Wet Steam, Saturated Vapor, Superheated Vapor, Supercritical, Phase Determination, Inverse Lookups)
- Difficulty levels: Easy, Medium, Hard
- Value extraction: Automated LLM-based extraction via Claude Sonnet 4.6
- Scoring: Each sub-question independently validated against CoolProp reference
What's Next?
ThermoQA is a three-tier benchmark system:
- Tier 1: Property Lookups (this post) — 110 questions, steam table values
- Tier 2: Component Analysis — 101 questions, thermodynamic analysis of components like turbines, compressors, pumps (coming soon)
- Tier 3: Cycle Analysis — Full Rankine, Brayton, refrigeration cycles (in development)
In Tier 2 results, the rankings completely reshuffle — Gemini's Tier 1 lead doesn't hold up under multi-step reasoning. We'll share this analysis in our next post.
Resources
- Dataset: HuggingFace — olivenet/thermoqa
- Source code: GitHub — olivenet-iot/ThermoQA
- CoolProp: coolprop.org
- IAPWS-IF97: International Association for the Properties of Water and Steam industrial formulation
About the Author
Olivenet Team
IoT & Automation Experts
Technology team providing industrial IoT, smart farming, and energy monitoring solutions in Northern Cyprus and Turkey.