June 3, 2026 · IWV Digital Solutions · Reference

All 44 Benchmarked AI Models — Complete Results

IWV Digital Solutions LLC | 44-model benchmark sweep | June 2026

Reference data — original candidate sweep. This page tabulates the original NotesXML model-selection sweep: all 44 candidate models evaluated on Ryzen 7 7800X3D + RTX 5070 Ti 16 GB, Linux, GPU-accelerated inference, ctx=4096, identical sampling parameters per model, sorted by average quality (descending). ★ marks the models from this sweep that are in the current 10-model shipping catalog (v2026.07.08.03) — all 10 appear in this sweep. The June 2026 refresh swapped in Ministral 3 3B and 8B; the July 2026 update re-added Gemma 4 E2B and added Ministral 3 14B (the large-model option for Intel iGPU/GPU systems, where the Gemma 4 models run CPU-only). Figures here are from this sweep’s hardware and an earlier scoring methodology; the catalog’s latest per-model numbers live in the main benchmark article.

Performance Scatter Plot

Each point represents one of the 44 benchmarked models. The X-axis shows the token generation interval (1/TPS in milliseconds, log scale — lower is faster). The Y-axis shows the average quality score across the seven scored benchmarks (Code excluded). Models in the current 10-model catalog are highlighted in green (all 10 appear in this sweep).

Complete Benchmark Table

All 44 models, sorted by average quality score descending. ★ = in the current 10-model shipping catalog (v2026.07.08.03); all 10 appear in this sweep. Model Bench = Model Benchmark score — the in-app task suite (six text tasks plus a vision OCR task), graded A–F. Avg Score here is the unweighted mean of the seven normalized benchmark scores with Code excluded — the sweep’s original methodology; the main article now reports an updated eight-benchmark quality mean (Code included) and a 50-TPS speed cap, so per-model figures there differ from this reference table. HMS = Harmonic Mean Score combining speed and quality.

Model	Params (B)	TPS	ms/tok	Model Bench	HEE	PLE	PhD	SDB	LongCtx	Tool	Avg Score	HMS
★ Gemma 4 26B-A4B (MoE)	26	51.6	19.4	100%	100%	99%	100%	80%	85%	94%	94.0%	0.405
Granite 4.0 H-Small	32	16.4	61.0	100%	98%	92%	99%	70%	79%	94%	90.0%	0.150
★ Gemma 4 12B	12	76.9	13.0	96%	92%	84%	96%	54%	84%	97%	86.0%	0.530
Mistral Small 3.2 24B	24	48.0	20.8	94%	94%	97%	100%	60%	71%	88%	86.0%	0.375
★ Ministral 3 8B	8	111	9.0	96%	96%	92%	94%	50%	86%	83%	85.0%	0.672
Devstral Small 2 24B Instruct	24	48.4	20.7	100%	94%	93%	100%	36%	69%	94%	84.0%	0.376
★ Gemma 4 E4B	4.5	144	6.9	100%	96%	92%	96%	24%	77%	97%	83.0%	0.771
Gemma 3 12B	12	68.4	14.6	100%	92%	91%	93%	28%	82%	93%	83.0%	0.484
★ Ministral 3 14B	14	65.8	15.2	73%	98%	95%	96%	50%	67%	78%	80.0%	0.466
★ Ministral 3 3B	3	206	4.9	100%	84%	80%	83%	39%	83%	84%	79.0%	0.883
★ Gemma 4 E2B	2.3	207	4.8	96%	94%	87%	94%	25%	81%	71%	78.0%	0.876
★ Phi-4 Mini 3.8B	3.8	199	5.0	92%	82%	81%	87%	60%	68%	79%	78.0%	0.874
Granite 4.1 3B Instruct	3	196	5.1	89%	94%	89%	95%	50%	72%	60%	78.0%	0.869
Devstral Small 250	24	46.7	21.4	71%	98%	94%	99%	40%	51%	91%	78.0%	0.359
Granite 4.1 8B Instruct	8	119	8.4	86%	94%	89%	95%	30%	66%	78%	77.0%	0.671
Phi-4 14B	14	74.8	13.4	71%	96%	94%	98%	60%	31%	91%	77.0%	0.503
Granite 3.0 2B Instruct	2	225	4.4	92%	76%	76%	87%	50%	72%	60%	73.0%	0.844
Gemma 3n E4B	4	84.3	11.9	71%	96%	84%	96%	48%	45%	75%	73.0%	0.534
Gemma 3 27B	27	12.7	78.7	61%	98%	93%	97%	46%	0%	94%	70.0%	0.116
★ Llama 3.2 3B Instruct	3.2	242	4.1	100%	80%	71%	72%	30%	62%	67%	69.0%	0.817
Gemma 3n E2B	2	112	8.9	66%	86%	83%	84%	40%	44%	67%	67.0%	0.610
Gemma 3 4B	4	112	8.9	65%	90%	80%	86%	35%	82%	31%	67.0%	0.610
Granite 4.0 H-1B (Hybrid)	1	260	3.8	100%	72%	61%	55%	32%	81%	56%	65.0%	0.788
Ministral 3 3B Instruct	3	195	5.1	64%	84%	81%	82%	39%	20%	84%	65.0%	0.780
Granite 3.3 8B Instruct	8	56.8	17.6	83%	90%	84%	90%	28%	0%	83%	65.0%	0.395
Granite 3.2 8B (Thinking)	8	103	9.7	94%	46%	17%	92%	50%	72%	60%	62.0%	0.563
Codestral 22B	22	12.0	83.3	94%	80%	68%	83%	28%	0%	72%	61.0%	0.109
Granite 3.0 8B	8	104	9.6	62%	45%	17%	87%	50%	72%	60%	56.0%	0.539
SmolLM2 1.7B Instruct	1.7	391	2.6	58%	64%	65%	62%	28%	16%	82%	54.0%	0.701
★ Llama 3.2 1B Instruct	1.2	503	2.0	96%	42%	42%	38%	25%	18%	48%	44.0%	0.611
Granite 3.0 1B-A400M	1	443	2.3	39%	24%	25%	36%	31%	77%	56%	41.0%	0.582
GPT-OSS 20B	21	197	5.1	38%	8%	16%	11%	60%	68%	79%	40.0%	0.569
Mistral Nemo 12B Instruct	12	79.8	12.5	100%	88%	89%	0%	0%	0%	0%	40.0%	0.399
Hermes 3 Llama 3.1 8B	8	122	8.2	100%	84%	83%	0%	0%	0%	0%	38.0%	0.468
Gemma 3 1B	1	152	6.6	41%	44%	40%	38%	10%	49%	37%	37.0%	0.498
Granite Vision 3.2 2B	2	210	4.8	68%	80%	71%	0%	0%	0%	0%	31.0%	0.473
MythoMax L2 13B	13	69.2	14.5	79%	70%	68%	0%	0%	0%	0%	31.0%	0.327
BioMistral 7B	7	95.7	10.4	58%	74%	63%	0%	0%	0%	0%	28.0%	0.353
SmolLM2 360M Instruct	0.36	487	2.1	25%	26%	38%	26%	25%	16%	27%	26.0%	0.413
Phi-4 Reasoning Plus	14	82.4	12.1	25%	8%	16%	9%	20%	0%	98%	25.0%	0.311
SmolLM2 135M Instruct	0.135	960	1.0	16%	12%	15%	18%	30%	15%	7%	16.0%	0.276
H2O-Danube 3 500M	0.5	634	1.6	12%	6%	16%	14%	30%	0%	0%	11.0%	0.198
Granite 4.0 350M (Dense)	0.35	1502	0.7	29%	0%	0%	0%	0%	10%	0%	6.0%	0.113
Granite 4.0 1B (Dense)	1	138	7.2	35%	0%	0%	0%	1%	4%	0%	6.0%	0.110

TPS = tokens per second on the benchmark hardware. ms/tok = 1000/TPS (token generation interval). Model Bench = Model Benchmark score (in-app task suite: six text tasks + one vision OCR, graded A–F). HEE, PLE, PhD, SDB, LongCtx, Tool = normalized scores (0–100%). Avg Score = unweighted mean of the seven normalized benchmark scores (Code excluded). HMS = 2 × min(TPS/200, 1) × AvgScore / (min(TPS/200, 1) + AvgScore).

Benchmark Descriptions

HEE (50 pts): Human Educational Equivalency — broad factual knowledge across STEM, humanities, health.

PLE (200 pts): Professional License Exam — 10 licensing domains including law, medicine, engineering.

PhD (100 pts): Doctoral-level logic and philosophy — the hardest single benchmark in the suite.

SDB-100 (10 tiers): Synthetic Deduction Benchmark — contamination-resistant structural reasoning.

LongCtx (100 pts): Long Context — retention and reasoning over extended input sequences.

Tool (100 pts): Tool Calling — structured function-call invocation and multi-step chains.

Code: Code generation and comprehension. Excluded from quality average (NotesXML is a note-taking app).

Model Bench: Model Benchmark score — the in-app task suite: six text tasks (summarization, auto-titling, grammar polish, translation, action-item extraction, Markdown/XML) plus one vision OCR task, each graded A–F.

Key Takeaways

Gemma 4 26B-A4B (MoE) leads on average quality at 94.0%, with a 100% PhD score, 100% on HEE, and the highest SDB-100 score (80%) of any model in the sweep. At 51.6 TPS it ships as the only Desktop Pro model in the catalog and represents the quality ceiling among shipping catalog models.

The 2–4B parameter class is remarkably competitive. Gemma 4 E2B, Phi-4 Mini 3.8B, Granite 4.1 3B Instruct, and Ministral 3 3B all deliver quality scores competitive with many 8B–14B models while running at roughly 195–225 TPS. This is the sweet spot for interactive local inference, and the current catalog draws its Lightweight tier from this band: Ministral 3 3B (vision-capable, the Free-tier vision model and Electron default), Phi-4 Mini 3.8B (MIT-licensed text), and Llama 3.2 3B Instruct — with Llama 3.2 1B Instruct anchoring the Ultra-Light tier.

Catalog revisions since this sweep. The current shipping catalog (v2026.07.08.03) is a 10-model set. The June 2026 refresh added Ministral 3 3B and 8B (both vision-capable, and both candidates in this sweep), promoted Phi-4 Mini 3.8B (MIT-licensed), and retired Llama 3.1 8B Instruct. The July 2026 update then re-added Gemma 4 E2B (a compact, fast, vision-capable option for light hardware) and added Ministral 3 14B — the large-model option for Intel iGPU/GPU systems, where it runs on the Intel GPU while the Gemma 4 models fall back to CPU-only. The catalog retains the other three Gemma 4 models (E4B, 12B, 26B-A4B) and both Llama 3.2 models shown above. Several earlier candidates were not carried forward (Granite 3.0 2B, Granite 4.0 H-1B, SmolLM2 1.7B). For the current catalog’s latest per-model benchmarks, see the main benchmark article.

Models below 1B parameters show steep quality drops. Granite 4.0 350M Dense (6%), Granite 4.0 1B Dense (6%), H2O-Danube 3 500M (11%), SmolLM2 135M Instruct (16%), and SmolLM2 360M Instruct (26%) sit at or below the utility floor for general-purpose assistant tasks. They may still serve narrow roles such as classification or simple extraction, but none is viable as a primary note-taking assistant.

← Back to Benchmark Results