← Benchmark Results & Recommendations

All 44 Benchmarked AI Models — Complete Results

IWV Digital Solutions LLC | 44-model benchmark sweep | June 2026

Reference data. This page tabulates complete benchmark results for all 44 models evaluated during the NotesXML AI model selection process. The 11 models selected for the shipping catalog (v2026.06.03.01) are marked with ★. All benchmarks were run on Ryzen 7 7800X3D + RTX 5070 Ti 16 GB, Linux, GPU-accelerated inference, ctx=4096, identical sampling parameters per model. Models are sorted by average quality score (descending).


Performance Scatter Plot

Each point represents one of the 44 benchmarked models. The X-axis shows the token generation interval (1/TPS in milliseconds, log scale — lower is faster). The Y-axis shows the average quality score across the seven scored benchmarks (Code excluded). Catalog models are highlighted in green.

All 44 Benchmarked Models: Token Generation Interval vs Average Performance Scatter plot of 44 AI models. X-axis: 1/TPS in milliseconds (log scale). Y-axis: average benchmark score. Catalog models highlighted in green. Token Generation Interval vs Average Performance — 44 Models 1ms (1000 TPS) 2ms 5ms (200 TPS) 10ms (100 TPS) 20ms 50ms (20 TPS) 100ms (10 TPS) 0% 20% 40% 60% 80% 100% Token Generation Interval (ms/token, log scale — lower is faster) Average Quality Score Catalog model (11) Non-catalog model (33) Llama 3.2 1B Instruct SmolLM2 1.7B Instruct Granite 4.0 H-1B (Hybrid) Llama 3.2 3B Instruct Granite 3.0 2B Instruct Gemma 4 E2B Ministral 3 3B Gemma 4 E4B Ministral 3 8B Gemma 4 26B-A4B (MoE) Gemma 4 12B Granite 4.0 350M (Dense) Granite 4.0 1B (Dense) H2O-Danube 3 500M SmolLM2 135M Instruct Phi-4 Reasoning Plus SmolLM2 360M Instruct BioMistral 7B Granite Vision 3.2 2B MythoMax L2 13B Gemma 3 1B Hermes 3 Llama 3.1 8B GPT-OSS 20B Mistral Nemo 12B Instruct Granite 3.0 1B-A400M Granite 3.0 8B Codestral 22B Granite 3.2 8B (Thinking) Ministral 3 3B Instruct Granite 3.3 8B Instruct Gemma 3n E2B Gemma 3 4B Gemma 3 27B Gemma 3n E4B Granite 4.1 8B Instruct Phi-4 14B Phi-4 Mini 3.8B Granite 4.1 3B Instruct Devstral Small 250 Ministral 3 14B Gemma 3 12B Devstral Small 2 24B Instruct Mistral Small 3.2 24B Granite 4.0 H-Small

Complete Benchmark Table

All 44 models, sorted by average quality score descending. ★ = selected for the 11-model shipping catalog. Model Bench = Model Benchmark structural-conformance score (chat template handling, stop tokens, instruction adherence, output formatting). Avg Score is the unweighted mean of the seven normalized benchmark scores (Code excluded). HMS = Harmonic Mean Score combining speed and quality.

Model Params (B) TPS ms/tok Model Bench HEE PLE PhD SDB LongCtx Tool Avg Score HMS
★ Gemma 4 26B-A4B (MoE)2651.619.4100%100%99%100%80%85%94%94.0%0.405
Granite 4.0 H-Small3216.461.0100%98%92%99%70%79%94%90.0%0.150
★ Gemma 4 12B1276.913.096%92%84%96%54%84%97%86.0%0.530
Mistral Small 3.2 24B2448.020.894%94%97%100%60%71%88%86.0%0.375
★ Ministral 3 8B81119.096%96%92%94%50%86%83%85.0%0.672
Devstral Small 2 24B Instruct2448.420.7100%94%93%100%36%69%94%84.0%0.376
★ Gemma 4 E4B4.51446.9100%96%92%96%24%77%97%83.0%0.771
Gemma 3 12B1268.414.6100%92%91%93%28%82%93%83.0%0.484
Ministral 3 14B1465.815.273%98%95%96%50%67%78%80.0%0.466
★ Ministral 3 3B32064.9100%84%80%83%39%83%84%79.0%0.883
★ Gemma 4 E2B2.32074.896%94%87%94%25%81%71%78.0%0.876
Phi-4 Mini 3.8B3.81995.092%82%81%87%60%68%79%78.0%0.874
Granite 4.1 3B Instruct31965.189%94%89%95%50%72%60%78.0%0.869
Devstral Small 2502446.721.471%98%94%99%40%51%91%78.0%0.359
Granite 4.1 8B Instruct81198.486%94%89%95%30%66%78%77.0%0.671
Phi-4 14B1474.813.471%96%94%98%60%31%91%77.0%0.503
★ Granite 3.0 2B Instruct22254.492%76%76%87%50%72%60%73.0%0.844
Gemma 3n E4B484.311.971%96%84%96%48%45%75%73.0%0.534
Gemma 3 27B2712.778.761%98%93%97%46%0%94%70.0%0.116
★ Llama 3.2 3B Instruct3.22424.1100%80%71%72%30%62%67%69.0%0.817
Gemma 3n E2B21128.966%86%83%84%40%44%67%67.0%0.610
Gemma 3 4B41128.965%90%80%86%35%82%31%67.0%0.610
★ Granite 4.0 H-1B (Hybrid)12603.8100%72%61%55%32%81%56%65.0%0.788
Ministral 3 3B Instruct31955.164%84%81%82%39%20%84%65.0%0.780
Granite 3.3 8B Instruct856.817.683%90%84%90%28%0%83%65.0%0.395
Granite 3.2 8B (Thinking)81039.794%46%17%92%50%72%60%62.0%0.563
Codestral 22B2212.083.394%80%68%83%28%0%72%61.0%0.109
Granite 3.0 8B81049.662%45%17%87%50%72%60%56.0%0.539
★ SmolLM2 1.7B Instruct1.73912.658%64%65%62%28%16%82%54.0%0.701
★ Llama 3.2 1B Instruct1.25032.096%42%42%38%25%18%48%44.0%0.611
Granite 3.0 1B-A400M14432.339%24%25%36%31%77%56%41.0%0.582
GPT-OSS 20B211975.138%8%16%11%60%68%79%40.0%0.569
Mistral Nemo 12B Instruct1279.812.5100%88%89%0%0%0%0%40.0%0.399
Hermes 3 Llama 3.1 8B81228.2100%84%83%0%0%0%0%38.0%0.468
Gemma 3 1B11526.641%44%40%38%10%49%37%37.0%0.498
Granite Vision 3.2 2B22104.868%80%71%0%0%0%0%31.0%0.473
MythoMax L2 13B1369.214.579%70%68%0%0%0%0%31.0%0.327
BioMistral 7B795.710.458%74%63%0%0%0%0%28.0%0.353
SmolLM2 360M Instruct0.364872.125%26%38%26%25%16%27%26.0%0.413
Phi-4 Reasoning Plus1482.412.125%8%16%9%20%0%98%25.0%0.311
SmolLM2 135M Instruct0.1359601.016%12%15%18%30%15%7%16.0%0.276
H2O-Danube 3 500M0.56341.612%6%16%14%30%0%0%11.0%0.198
Granite 4.0 350M (Dense)0.3515020.729%0%0%0%0%10%0%6.0%0.113
Granite 4.0 1B (Dense)11387.235%0%0%0%1%4%0%6.0%0.110

TPS = tokens per second on the benchmark hardware. ms/tok = 1000/TPS (token generation interval). Model Bench = Model Benchmark structural-conformance score. HEE, PLE, PhD, SDB, LongCtx, Tool = normalized scores (0–100%). Avg Score = unweighted mean of the seven normalized benchmark scores (Code excluded). HMS = 2 × min(TPS/200, 1) × AvgScore / (min(TPS/200, 1) + AvgScore).


Benchmark Descriptions

HEE (50 pts): Human Educational Equivalency — broad factual knowledge across STEM, humanities, health.
PLE (200 pts): Professional License Exam — 10 licensing domains including law, medicine, engineering.
PhD (100 pts): Doctoral-level logic and philosophy — the hardest single benchmark in the suite.
SDB-100 (10 tiers): Synthetic Deduction Benchmark — contamination-resistant structural reasoning.
LongCtx (100 pts): Long Context — retention and reasoning over extended input sequences.
Tool (100 pts): Tool Calling — structured function-call invocation and multi-step chains.
Code: Code generation and comprehension. Excluded from quality average (NotesXML is a note-taking app).
Model Bench: Model Benchmark structural-conformance score — chat template handling, stop tokens, instruction adherence, output formatting.

Key Takeaways

Gemma 4 26B-A4B (MoE) leads on average quality at 94.0%, with a 100% PhD score, 100% on HEE, and the highest SDB-100 score (80%) of any model in the sweep. At 51.6 TPS it ships as the only Desktop Pro model in the catalog and represents the quality ceiling among shipping catalog models.

The 2–4B parameter class is remarkably competitive. Gemma 4 E2B, Ministral 3 3B, Granite 4.1 3B Instruct, and Phi-4 Mini 3.8B all deliver quality scores competitive with many 8B–14B models while running at roughly 195–225 TPS. This is the sweet spot for interactive local inference; Gemma 4 E2B and Ministral 3 3B were selected for the shipping catalog at this footprint (alongside Granite 3.0 2B Instruct, Granite 4.0 H-1B, and Llama 3.2 3B in the broader Lightweight/Ultra-Light bands).

Granite 4.0 H-1B (Hybrid) is the new Ultra-Light frontier point. At 260 TPS and 65% average quality the hybrid Mamba-2 + transformer architecture outperforms every other ~1B-class entry on quality while staying inside the 1.5 GB RAM envelope — making it a Pareto-frontier addition to the Ultra-Light tier alongside Llama 3.2 1B.

Models below 1B parameters show steep quality drops. Granite 4.0 350M Dense (6%), Granite 4.0 1B Dense (6%), H2O-Danube 3 500M (11%), SmolLM2 135M Instruct (16%), and SmolLM2 360M Instruct (26%) sit at or below the utility floor for general-purpose assistant tasks. They may still serve narrow roles such as classification or simple extraction, but none is viable as a primary note-taking assistant.


© 2026 IWV Digital Solutions LLC. All rights reserved.


© 2026 IWV Digital Solutions LLC. All rights reserved.

← Back to Benchmark Results