All 44 Benchmarked AI Models — Complete Results
IWV Digital Solutions LLC | 44-model benchmark sweep | June 2026
Reference data. This page tabulates complete benchmark results for all 44 models evaluated during the NotesXML AI model selection process. The 11 models selected for the shipping catalog (v2026.06.03.01) are marked with ★. All benchmarks were run on Ryzen 7 7800X3D + RTX 5070 Ti 16 GB, Linux, GPU-accelerated inference, ctx=4096, identical sampling parameters per model. Models are sorted by average quality score (descending).
Performance Scatter Plot
Each point represents one of the 44 benchmarked models. The X-axis shows the token generation interval (1/TPS in milliseconds, log scale — lower is faster). The Y-axis shows the average quality score across the seven scored benchmarks (Code excluded). Catalog models are highlighted in green.
Complete Benchmark Table
All 44 models, sorted by average quality score descending. ★ = selected for the 11-model shipping catalog. Model Bench = Model Benchmark structural-conformance score (chat template handling, stop tokens, instruction adherence, output formatting). Avg Score is the unweighted mean of the seven normalized benchmark scores (Code excluded). HMS = Harmonic Mean Score combining speed and quality.
| Model | Params (B) | TPS | ms/tok | Model Bench | HEE | PLE | PhD | SDB | LongCtx | Tool | Avg Score | HMS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ★ Gemma 4 26B-A4B (MoE) | 26 | 51.6 | 19.4 | 100% | 100% | 99% | 100% | 80% | 85% | 94% | 94.0% | 0.405 |
| Granite 4.0 H-Small | 32 | 16.4 | 61.0 | 100% | 98% | 92% | 99% | 70% | 79% | 94% | 90.0% | 0.150 |
| ★ Gemma 4 12B | 12 | 76.9 | 13.0 | 96% | 92% | 84% | 96% | 54% | 84% | 97% | 86.0% | 0.530 |
| Mistral Small 3.2 24B | 24 | 48.0 | 20.8 | 94% | 94% | 97% | 100% | 60% | 71% | 88% | 86.0% | 0.375 |
| ★ Ministral 3 8B | 8 | 111 | 9.0 | 96% | 96% | 92% | 94% | 50% | 86% | 83% | 85.0% | 0.672 |
| Devstral Small 2 24B Instruct | 24 | 48.4 | 20.7 | 100% | 94% | 93% | 100% | 36% | 69% | 94% | 84.0% | 0.376 |
| ★ Gemma 4 E4B | 4.5 | 144 | 6.9 | 100% | 96% | 92% | 96% | 24% | 77% | 97% | 83.0% | 0.771 |
| Gemma 3 12B | 12 | 68.4 | 14.6 | 100% | 92% | 91% | 93% | 28% | 82% | 93% | 83.0% | 0.484 |
| Ministral 3 14B | 14 | 65.8 | 15.2 | 73% | 98% | 95% | 96% | 50% | 67% | 78% | 80.0% | 0.466 |
| ★ Ministral 3 3B | 3 | 206 | 4.9 | 100% | 84% | 80% | 83% | 39% | 83% | 84% | 79.0% | 0.883 |
| ★ Gemma 4 E2B | 2.3 | 207 | 4.8 | 96% | 94% | 87% | 94% | 25% | 81% | 71% | 78.0% | 0.876 |
| Phi-4 Mini 3.8B | 3.8 | 199 | 5.0 | 92% | 82% | 81% | 87% | 60% | 68% | 79% | 78.0% | 0.874 |
| Granite 4.1 3B Instruct | 3 | 196 | 5.1 | 89% | 94% | 89% | 95% | 50% | 72% | 60% | 78.0% | 0.869 |
| Devstral Small 250 | 24 | 46.7 | 21.4 | 71% | 98% | 94% | 99% | 40% | 51% | 91% | 78.0% | 0.359 |
| Granite 4.1 8B Instruct | 8 | 119 | 8.4 | 86% | 94% | 89% | 95% | 30% | 66% | 78% | 77.0% | 0.671 |
| Phi-4 14B | 14 | 74.8 | 13.4 | 71% | 96% | 94% | 98% | 60% | 31% | 91% | 77.0% | 0.503 |
| ★ Granite 3.0 2B Instruct | 2 | 225 | 4.4 | 92% | 76% | 76% | 87% | 50% | 72% | 60% | 73.0% | 0.844 |
| Gemma 3n E4B | 4 | 84.3 | 11.9 | 71% | 96% | 84% | 96% | 48% | 45% | 75% | 73.0% | 0.534 |
| Gemma 3 27B | 27 | 12.7 | 78.7 | 61% | 98% | 93% | 97% | 46% | 0% | 94% | 70.0% | 0.116 |
| ★ Llama 3.2 3B Instruct | 3.2 | 242 | 4.1 | 100% | 80% | 71% | 72% | 30% | 62% | 67% | 69.0% | 0.817 |
| Gemma 3n E2B | 2 | 112 | 8.9 | 66% | 86% | 83% | 84% | 40% | 44% | 67% | 67.0% | 0.610 |
| Gemma 3 4B | 4 | 112 | 8.9 | 65% | 90% | 80% | 86% | 35% | 82% | 31% | 67.0% | 0.610 |
| ★ Granite 4.0 H-1B (Hybrid) | 1 | 260 | 3.8 | 100% | 72% | 61% | 55% | 32% | 81% | 56% | 65.0% | 0.788 |
| Ministral 3 3B Instruct | 3 | 195 | 5.1 | 64% | 84% | 81% | 82% | 39% | 20% | 84% | 65.0% | 0.780 |
| Granite 3.3 8B Instruct | 8 | 56.8 | 17.6 | 83% | 90% | 84% | 90% | 28% | 0% | 83% | 65.0% | 0.395 |
| Granite 3.2 8B (Thinking) | 8 | 103 | 9.7 | 94% | 46% | 17% | 92% | 50% | 72% | 60% | 62.0% | 0.563 |
| Codestral 22B | 22 | 12.0 | 83.3 | 94% | 80% | 68% | 83% | 28% | 0% | 72% | 61.0% | 0.109 |
| Granite 3.0 8B | 8 | 104 | 9.6 | 62% | 45% | 17% | 87% | 50% | 72% | 60% | 56.0% | 0.539 |
| ★ SmolLM2 1.7B Instruct | 1.7 | 391 | 2.6 | 58% | 64% | 65% | 62% | 28% | 16% | 82% | 54.0% | 0.701 |
| ★ Llama 3.2 1B Instruct | 1.2 | 503 | 2.0 | 96% | 42% | 42% | 38% | 25% | 18% | 48% | 44.0% | 0.611 |
| Granite 3.0 1B-A400M | 1 | 443 | 2.3 | 39% | 24% | 25% | 36% | 31% | 77% | 56% | 41.0% | 0.582 |
| GPT-OSS 20B | 21 | 197 | 5.1 | 38% | 8% | 16% | 11% | 60% | 68% | 79% | 40.0% | 0.569 |
| Mistral Nemo 12B Instruct | 12 | 79.8 | 12.5 | 100% | 88% | 89% | 0% | 0% | 0% | 0% | 40.0% | 0.399 |
| Hermes 3 Llama 3.1 8B | 8 | 122 | 8.2 | 100% | 84% | 83% | 0% | 0% | 0% | 0% | 38.0% | 0.468 |
| Gemma 3 1B | 1 | 152 | 6.6 | 41% | 44% | 40% | 38% | 10% | 49% | 37% | 37.0% | 0.498 |
| Granite Vision 3.2 2B | 2 | 210 | 4.8 | 68% | 80% | 71% | 0% | 0% | 0% | 0% | 31.0% | 0.473 |
| MythoMax L2 13B | 13 | 69.2 | 14.5 | 79% | 70% | 68% | 0% | 0% | 0% | 0% | 31.0% | 0.327 |
| BioMistral 7B | 7 | 95.7 | 10.4 | 58% | 74% | 63% | 0% | 0% | 0% | 0% | 28.0% | 0.353 |
| SmolLM2 360M Instruct | 0.36 | 487 | 2.1 | 25% | 26% | 38% | 26% | 25% | 16% | 27% | 26.0% | 0.413 |
| Phi-4 Reasoning Plus | 14 | 82.4 | 12.1 | 25% | 8% | 16% | 9% | 20% | 0% | 98% | 25.0% | 0.311 |
| SmolLM2 135M Instruct | 0.135 | 960 | 1.0 | 16% | 12% | 15% | 18% | 30% | 15% | 7% | 16.0% | 0.276 |
| H2O-Danube 3 500M | 0.5 | 634 | 1.6 | 12% | 6% | 16% | 14% | 30% | 0% | 0% | 11.0% | 0.198 |
| Granite 4.0 350M (Dense) | 0.35 | 1502 | 0.7 | 29% | 0% | 0% | 0% | 0% | 10% | 0% | 6.0% | 0.113 |
| Granite 4.0 1B (Dense) | 1 | 138 | 7.2 | 35% | 0% | 0% | 0% | 1% | 4% | 0% | 6.0% | 0.110 |
TPS = tokens per second on the benchmark hardware. ms/tok = 1000/TPS (token generation interval). Model Bench = Model Benchmark structural-conformance score. HEE, PLE, PhD, SDB, LongCtx, Tool = normalized scores (0–100%). Avg Score = unweighted mean of the seven normalized benchmark scores (Code excluded). HMS = 2 × min(TPS/200, 1) × AvgScore / (min(TPS/200, 1) + AvgScore).
Benchmark Descriptions
Key Takeaways
Gemma 4 26B-A4B (MoE) leads on average quality at 94.0%, with a 100% PhD score, 100% on HEE, and the highest SDB-100 score (80%) of any model in the sweep. At 51.6 TPS it ships as the only Desktop Pro model in the catalog and represents the quality ceiling among shipping catalog models.
The 2–4B parameter class is remarkably competitive. Gemma 4 E2B, Ministral 3 3B, Granite 4.1 3B Instruct, and Phi-4 Mini 3.8B all deliver quality scores competitive with many 8B–14B models while running at roughly 195–225 TPS. This is the sweet spot for interactive local inference; Gemma 4 E2B and Ministral 3 3B were selected for the shipping catalog at this footprint (alongside Granite 3.0 2B Instruct, Granite 4.0 H-1B, and Llama 3.2 3B in the broader Lightweight/Ultra-Light bands).
Granite 4.0 H-1B (Hybrid) is the new Ultra-Light frontier point. At 260 TPS and 65% average quality the hybrid Mamba-2 + transformer architecture outperforms every other ~1B-class entry on quality while staying inside the 1.5 GB RAM envelope — making it a Pareto-frontier addition to the Ultra-Light tier alongside Llama 3.2 1B.
Models below 1B parameters show steep quality drops. Granite 4.0 350M Dense (6%), Granite 4.0 1B Dense (6%), H2O-Danube 3 500M (11%), SmolLM2 135M Instruct (16%), and SmolLM2 360M Instruct (26%) sit at or below the utility floor for general-purpose assistant tasks. They may still serve narrow roles such as classification or simple extraction, but none is viable as a primary note-taking assistant.
© 2026 IWV Digital Solutions LLC. All rights reserved.
© 2026 IWV Digital Solutions LLC. All rights reserved.
← Back to Benchmark Results