← All Articles

NotesXML AI Model Benchmark Results & Recommendations

IWV Digital Solutions LLC | 11-model catalog v2026.06.03.01

11-model catalog with refreshed benchmarks. The shipping catalog contains 11 models spanning four hardware tiers, and all 11 sit on the Pareto efficiency frontier (speed vs quality) — no catalog model is strictly dominated by another. The sweep covers eight benchmarks (Long Context, Tool Calling, Model Benchmark, HEE, PLE, PhD Philosophy, SDB-100, Code) on Ryzen 7 7800X3D + RTX 5070 Ti, Linux. The HMS leader is Ministral 3 3B (HMS 0.883), with Gemma 4 E2B a close second at HMS 0.876.


Overview

NotesXML's Professional tier ships with 11 local AI models spanning every hardware tier from phones and low-RAM devices to enthusiast workstations with discrete GPUs. This article presents complete benchmark results for all 11 catalog models, the Harmonic Mean Score (HMS) ranking, and the efficiency frontier analysis.

Catalog v2026.06.03.01 (June 2026): The 11-model catalog was built by running the full 8-benchmark suite against 44 candidate models, computing the Pareto frontier on TPS vs average quality, and applying a utility floor to exclude models below viable capability. Every model in the shipping catalog is on the strict Pareto frontier. Vision is a Professional-tier feature delivered by Ministral 3 3B, Gemma 4 E2B, Gemma 4 E4B, and Gemma 4 12B (cross-platform) and Gemma 4 26B-A4B (Desktop Pro only). Free tier: Llama 3.2 1B Instruct (Android default, lower) and Ministral 3 3B (vision-capable, upper). Electron default: Gemma 4 E4B.

All AI inference in NotesXML runs entirely on your device. No internet connection is required. No data is transmitted.


Benchmark Deep-Dives

Each benchmark in the NotesXML suite is documented in a dedicated article covering methodology, full question bank or task list, scoring rubric, and references to the academic literature that informed its design. The results on this page summarise scores; the deep-dive articles explain exactly what those scores mean and how they were produced.

HEE — 50 pts
Human Educational Equivalency
50-question breadth exam across Linguistics, History, STEM, and Health. Maps score to an educational equivalency level (High School → Post-Graduate).
PLE — 200 pts
Professional License Exam
200 questions drawn from 10 real licensing exams: MBE, USMLE, PE, EPPP, NCLEX-RN, Real Estate, CPA, ARE, and CISSP.
PhD — 100 pts
Logic & Philosophy
100 doctoral-level questions in modal logic, philosophy of language, Kripke semantics, Gödel's theorems, epistemology, and metaphysics. The hardest benchmark in the suite.
SDB-100 — 100 pts
Synthetic Deduction Benchmark
Procedurally generated at runtime from 10 logic templates × 10 semantic themes. Training-data contamination impossible by design. Pure structural reasoning.
LongCtx — 100 pts
Long Context Benchmark
20 tasks across 4K–128K context tiers. Tests needle retrieval, position bias, multi-fact recall, and long-range reasoning. VRAM cascade detection built in.
ToolCall — 100 pts
Tool Calling Benchmark
25 tasks across schema compliance, tool selection, parameter extraction, multi-step sequences, error recovery, and appropriate refusal. JSON schema graded only — no tools executed.

NotesXML Benchmark Suite

NotesXML uses eight purpose-designed benchmarks to evaluate models as they actually run on local hardware — testing the combination of model capability and local inference performance that users experience in practice.

Long Context Benchmark (100 points)

Evaluates a model's ability to process, retain, and reason over extended input sequences. Passages with embedded facts at varying context depths are presented; the model is queried on details requiring genuine retention rather than positional pattern-matching. Critical for note-taking applications where users ask AI to summarize or analyze multi-page documents.

Score RangeLevel
80–100 (80–100%)Strong Retention
60–79 (60–79%)Functional Retention
40–59 (40–59%)Partial Retention
Below 40 (< 40%)Limited Retention

Tool Calling Benchmark (100 points)

Evaluates a model's ability to correctly invoke structured function calls. Tests cover simple single-tool invocations, multi-step chains, parallel calls, and error recovery — all capabilities underpinning NotesXML's AI action system.

Score RangeLevel
90–100 (90–100%)Expert Tool Use
80–89 (80–89%)Advanced Tool Use
70–79 (70–79%)Reliable Tool Use
Below 70 (< 70%)Basic Tool Use

Model Benchmark — Structural Conformance (Grade A/B/C + Harmonic)

Evaluates chat template handling, stop-token behavior, instruction adherence, and output formatting. Grade A = fully conformant; Grade B = minor issues. The Harmonic score combines conformance quality with inference speed, rewarding models that are both correct and fast.

HEE — Human Education Evaluation (50 points)

Evaluates broad factual knowledge across mathematics, science, history, literature, geography, and reasoning — drawn from secondary school through postgraduate level, weighted toward the upper end. The closest NotesXML benchmark to the academic MMLU family.

Score RangeLevel
45–50 (90–100%)Post-Graduate / Mastery
40–44 (80–89%)Undergraduate Level
30–39 (60–79%)High School Graduate
Below 30 (< 60%)Below Standard

PLE — Professional License Exam (200 points)

200 questions across ten professional fields: Law (MBE), Medicine (USMLE), Engineering (PE), Psychology (EPPP), Nursing (NCLEX-RN), Real Estate, Finance (CPA), Architecture (ARE), and Cybersecurity (CISSP). Tests not only factual recall but multi-step logical deduction and contextual synthesis under professional-exam constraints.

Score RangeLevel
190–200 (95–100%)Expert / Mastery
170–189 (85–94%)Expert / Mastery
160–169 (80–84%)Proficient
140–159 (70–79%)Developing
Below 140 (< 70%)Below Competency

PhD Philosophy — PhD-Level Logic & Philosophy Comprehensive Exam (100 points)

100 questions spanning metalogic, modal logic, philosophy of language and mind, epistemology, metaphysics, and continental phenomenology. Designed so that correctly answering requires understanding how theories mechanically function, not merely pattern-matching to familiar names. The most intellectually demanding single benchmark in the suite.

Score RangeLevel
90–100 (90–100%)PhD Mastery / Expert
80–89 (80–89%)Advanced
70–79 (70–79%)Proficient
Below 70 (< 70%)Below PhD Level

SDB-100 — Synthetic Deduction Benchmark (100 points, reported as /10 tiers)

10 questions testing multi-step deductive reasoning using synthetic ontologies that cannot exist in any training corpus — novel self-contained universes governed by arbitrary but logically absolute rules. By stripping away semantic priors, the benchmark forces genuine structural reasoning rather than pattern completion. The most discriminating and contamination-resistant benchmark in the suite.

Score RangeLevel
8–10 (80–100%)Advanced Deductive Reasoning
6–7 (60–79%)Proficient
4–5 (40–59%)Developing
Below 4 (< 40%)Below Standard

Full Results

NotesXML Benchmark Results

Complete 8-benchmark sweep for all 11 catalog models, run on Ryzen 7 7800X3D + RTX 5070 Ti 16 GB (Linux, GPU-accelerated inference). All benchmarks executed at ctx=4096, GPU offload enabled, identical sampling parameters per model.

Model Tier TPS LongCtx /100 ToolCall /100 Grade HEE /50 PLE /200 PhD /100 SDB /10
Llama 3.2 1B Instruct ★Ultra-Light50318 (18%)48 (48%)A21 (42%)84 (42%)38 (38%)2 (25%)
SmolLM2 1.7B Instruct ★Lightweight39116 (16%)82 (82%)B32 (64%)130 (65%)62 (62%)3 (28%)
Granite 4.0 H-1B (Hybrid) ★Ultra-Light26081 (81%)56 (56%)A36 (72%)122 (61%)55 (55%)3 (32%)
Llama 3.2 3B Instruct ★Lightweight24262 (62%)67 (67%)A40 (80%)142 (71%)72 (72%)3 (30%)
Granite 3.0 2B Instruct ★Lightweight22572 (72%)60 (60%)A38 (76%)152 (76%)87 (87%)5 (50%)
Gemma 4 E2B ★Lightweight20781 (81%)71 (71%)A47 (94%)174 (87%)94 (94%)2 (25%)
Ministral 3 3B ★Lightweight20683 (83%)84 (84%)A42 (84%)160 (80%)83 (83%)4 (39%)
Gemma 4 E4B ★Enhanced14477 (77%)97 (97%)A48 (96%)184 (92%)96 (96%)2 (24%)
Ministral 3 8B ★Enhanced11186 (86%)83 (83%)A48 (96%)184 (92%)94 (94%)5 (50%)
Gemma 4 12B ★Enhanced76.984 (84%)97 (97%)A46 (92%)168 (84%)96 (96%)5 (54%)
Gemma 4 26B-A4B (MoE) ★Desktop Pro51.685 (85%)94 (94%)A50 (100%)198 (99%)100 (100%)8 (80%)

TPS = tokens per second. All benchmarks executed at ctx=4096 with GPU offload enabled. ★ = on Pareto efficiency frontier. Grade = Model Benchmark structural conformance (A = fully conformant, B = minor issues). Code Benchmark excluded from quality calculation — NotesXML is a note-taking application.

Performance vs Quality — The Efficiency Frontier

The dashed green line below connects the eleven catalog models on the Pareto efficiency frontier — the set of models for which no other catalog model is both faster and higher-quality. Every model in the current 11-model catalog sits on the strict frontier.

Efficiency Frontier: Inverse TPS vs Average Benchmark Score (11-model catalog) Scatter plot of all 11 NotesXML catalog models. All 11 are on the Pareto efficiency frontier. Efficiency Frontier: 1/TPS vs Avg Score — 11 Catalog Models 0 5 (200 TPS) 10 (100 TPS) 15 (67 TPS) 20 (50 TPS) 40% 50% 60% 70% 80% 90% 1 / TPS (ms per token — lower is faster) Average Quality Score Llama 3.2 1B Instruct SmolLM2 1.7B Instruct Granite 4.0 H-1B (Hybrid) Llama 3.2 3B Instruct Granite 3.0 2B Instruct Gemma 4 E2B 👁 Ministral 3 3B 👁 Gemma 4 E4B 👁 Ministral 3 8B 👁 Gemma 4 12B 👁 Gemma 4 26B-A4B (MoE) 👁 All 11 catalog models on Pareto frontier Frontier model (11 / 11) 44-model benchmark sweep, Jun 2026
ModelTPS1/TPS (ms/token)Avg ScoreFrontier
Llama 3.2 1B Instruct5031.9944.0%★ YES
SmolLM2 1.7B Instruct3912.5654.0%★ YES
Granite 4.0 H-1B (Hybrid)2603.8565.0%★ YES
Llama 3.2 3B Instruct2424.1369.0%★ YES
Granite 3.0 2B Instruct2254.4473.0%★ YES
Gemma 4 E2B2074.8378.0%★ YES
Ministral 3 3B2064.8579.0%★ YES
Gemma 4 E4B1446.9483.0%★ YES
Ministral 3 8B1119.0185.0%★ YES
Gemma 4 12B76.913.0086.0%★ YES
Gemma 4 26B-A4B (MoE)51.619.3894.0%★ YES

Sorted by TPS descending (fastest first). All ten catalog models are on the strict Pareto efficiency frontier. Quality is the 7-benchmark qualityMean (Code excluded).

Harmonic Mean Score — Single-Number Speed/Quality Ranking

The Harmonic Mean Score (HMS) combines normalized inference speed and average quality into one value where higher is better. The harmonic mean punishes imbalance: a model that is fast but weak, or strong but slow, scores lower than one that is reasonably good at both. A model that hits 200+ TPS has reached the point of diminishing returns for streaming output in a chat UI, so speed is capped at that threshold.

Methodology

  1. Speed component: speed_norm = min(TPS / 200, 1.0)
  2. Quality component: quality = qualityMean — the unweighted mean of seven normalized benchmark scores. Code Benchmark is excluded.
  3. Harmonic mean: HMS = 2 × speed_norm × quality / (speed_norm + quality)

HMS Ranking (all 11 catalog models)

RankModelTierTPSAvg %speed_normqualityHMS
1Ministral 3 3BLightweight20679.0%1.0000.7900.883
2Gemma 4 E2B ★Lightweight20778.0%1.0000.7800.876
3Granite 3.0 2B Instruct ★Lightweight22573.0%1.0000.7300.844
4Llama 3.2 3B Instruct ★Lightweight24269.0%1.0000.6900.817
5Granite 4.0 H-1B (Hybrid) ★Ultra-Light26065.0%1.0000.6500.788
6Gemma 4 E4B ★Enhanced14483.0%0.7200.8300.771
7SmolLM2 1.7B Instruct ★Lightweight39154.0%1.0000.5400.701
8Ministral 3 8B ★Enhanced11185.0%0.5550.8500.672
9Llama 3.2 1B Instruct ★Ultra-Light50344.0%1.0000.4400.611
10Gemma 4 12B ★Enhanced76.986.0%0.3800.8600.530
11Gemma 4 26B-A4B (MoE) ★Desktop Pro51.694.0%0.2580.9400.405

Reading the HMS Ranking

Ministral 3 3B leads at HMS 0.883. At 206 TPS and 79.0% average quality, it delivers the best balance of speed and quality in the catalog. Gemma 4 E2B follows at HMS 0.876 — a close near-tie — with comparable speed (207 TPS) and similar quality (78.0%). All seven small catalog models (Llama 3.2 1B/3B, SmolLM2 1.7B, Granite 4.0 H-1B, Granite 3.0 2B Instruct, Gemma 4 E2B, Ministral 3 3B) are excellent general-purpose choices in the Ultra-Light and Lightweight tiers.

Small models dominate the top of the HMS ranking because models at or above 200 TPS have their speed capped, so a fast 1–3B model with solid quality can outscore a high-quality 26B model on HMS. This is intentional: HMS measures best default interactive choice, not best overall model. For tasks where quality matters most, look at the quality column directly.

Quality leaders sit lower in the HMS ranking. Gemma 4 26B-A4B (MoE) (94.0% quality, the catalog leader) ranks lower on HMS because its 52 TPS is well below the speed cap. For batch workflows, research synthesis, or high-stakes writing where speed is secondary, the quality column is the right metric.

Gemma 4 26B-A4B (MoE) has the best SDB-100 score in the catalog (80%) — structured deductive reasoning is its standout strength. SDB-100 is the hardest contamination-resistant reasoning benchmark in the suite.


Key Observations

The Efficiency Sweet Spot (1–3B parameters): Five of the top six HMS positions are sub-4B models. Ministral 3 3B (HMS 0.883), Gemma 4 E2B (0.876), Granite 3.0 2B Instruct (0.844), Llama 3.2 3B (0.817), and Granite 4.0 H-1B (0.788) form a dense cluster on the frontier. For interactive note-taking, this band delivers the best return.

Ministral 3 3B is the standout of the catalog. A 3B-parameter model, it matches Gemma 4 E2B on speed (206 vs 207 TPS), slightly leads on overall quality (79.0% vs 78.0%), is vision-capable, and exposes a 256K-token native context window — the longest in the catalog along with Ministral 3 8B. Ships as the Free-tier upper model.

Ministral 3 8B leads on Long Context. At 86% on the Long Context Benchmark, Ministral 3 8B scores the highest in the catalog on this dimension — meaningful for users who work with multi-page notes and need the AI to retain context across a full document.

Gemma 4 E4B leads on Tool Calling. At 97% (Expert tool use), Gemma 4 E4B is the strongest model in the catalog for structured AI actions. It is the cross-platform vision flagship in the Enhanced tier and is the Electron default. Vision is a Professional-tier feature, also provided by Ministral 3 3B and Gemma 4 E2B at the Lightweight footprint and Gemma 4 26B-A4B at Desktop Pro.

Gemma 4 26B-A4B (MoE) leads on quality, HEE, PLE, PhD, and SDB-100. At 94.0% average quality with 100% on HEE, 99% on PLE, 100% on PhD Philosophy, and 80% on SDB-100, the Gemma 4 26B-A4B Mixture-of-Experts architecture takes the top spot on every knowledge and reasoning benchmark in the suite. Combined with vision capability and a 256K-token native context, it is the highest-quality single model in the catalog.

Llama 3.2 1B Instruct's knowledge benchmarks reflect its scale. At 1.2B parameters, HEE 42%, PLE 42%, and PhD 38% are expected. The model is the fastest in the catalog at 503 TPS and is the Android Free-tier default, designed for fast general-purpose tasks — auto-titling, summarization, grammar polish — under severe memory constraints. For knowledge-intensive workflows, any model at the 3B tier or above delivers substantially better results.


Model Recommendations by Hardware Tier

Ultra-Light — Up to ~1.5 GB RAM Available for AI

Free-tier default: Llama 3.2 1B Instruct (HMS 0.611) — the Android Free-tier default and fastest model in the catalog at 503 TPS. 128K native context. Best for instant note titling, short summaries, transcription polish, and basic text cleanup under severe memory constraints. Llama 3.2 Community License.

Stronger Ultra-Light alternative: Granite 4.0 H-1B (Hybrid) (HMS 0.788) — Apache 2.0, 260 TPS, LongCtx 81%, Tool 56%, HEE 72%. The hybrid Mamba-2 + transformer architecture delivers materially stronger long-context and tool-calling performance than Llama 3.2 1B at a comparable 1.5 GB footprint, while remaining cross-platform (Android included). 128K native context.


Lightweight — ~1.5–4.5 GB RAM Available for AI

Ministral 3 3B (HMS 0.883) — 206 TPS, HEE 84%, PLE 80%, PhD 83%, LongCtx 83%, Tool 84%. Vision-capable (Professional tier) with a 256K native context window. Ships as the Free-tier upper model on all platforms. 40+ language support.

Gemma 4 E2B (HMS 0.876) — 207 TPS, HEE 94%, PLE 87%, PhD 94%, LongCtx 81%, Tool 71%. Vision-capable (Professional tier). 2.3B effective parameters, 128K native context.

Granite 3.0 2B Instruct (HMS 0.844) — 225 TPS, HEE 76%, PLE 76%, PhD 87%, LongCtx 72%, Tool 60%. The Granite 3.0 dense transformer at 2.6B parameters. 4K native context (the smallest in the catalog — suited to action tasks rather than multi-page summarization).

Llama 3.2 3B Instruct (HMS 0.817) — 242 TPS, HEE 80%, PLE 71%, PhD 72%, LongCtx 62%, Tool 67%. Llama 3.2 Community license. 128K native context. The Llama-family choice in the Lightweight tier.

SmolLM2 1.7B Instruct (HMS 0.701) — 391 TPS, HEE 64%, PLE 65%, PhD 62%, LongCtx 16%, Tool 82%. 8K context. The best choice when RAM is tight but 3B is too large.


Enhanced — ~5–9 GB RAM Available for AI

Electron default: Gemma 4 E4B (HMS 0.771) — Apache 2.0, 144 TPS, Tool 97% (catalog leader), HEE 96%, PLE 92%, PhD 96%. Vision-capable (Professional tier). 128K native context. The go-to for AI Chat, structured actions, and image analysis.

Long-context leader: Ministral 3 8B (HMS 0.672) — Apache 2.0, 111 TPS, Long Context 86% (catalog leader), HEE 96%, PLE 92%, PhD 94%, Tool 83%. Vision-capable. 256K native context.

Highest-quality cross-platform model: Gemma 4 12B (HMS 0.53) — Apache 2.0, 76.9 TPS, Tool 97% (tied catalog leader), PhD 96%, HEE 92%, PLE 84%, LongCtx 84%, SDB 54%. 86.0% average quality — the catalog's second-highest, behind only the desktop-only 26B MoE. Vision-capable (Professional tier). 12B dense, 256K native context. The top-quality choice that still runs cross-platform (not desktop-only), for quality-sensitive work on Enhanced-tier hardware where the 26B MoE is out of reach.


Desktop Pro — ~22 GB RAM or 16+ GB GPU VRAM

Highest overall quality: Gemma 4 26B-A4B (MoE) (HMS 0.405) — Apache 2.0, desktopOnly, 51.6 TPS. The catalog's top quality model: 94.0% average, HEE 100%, PLE 99%, PhD 100%, SDB 80% (catalog leader), Tool 94%. 26B total / 4B active Mixture-of-Experts, vision-capable, 256K native context. The highest-quality choice for research, synthesis, and agentic workflows where speed is secondary.


Summary Recommendation Table

Your Hardware / Use CaseRecommended ModelHMSPrimary Strength
Phone / very low-RAM device (<1.5 GB)Llama 3.2 1B Instruct0.611Android Free-tier default; 503 TPS; smallest viable footprint
Phone / low-RAM, stronger Ultra-Light alternativeGranite 4.0 H-1B (Hybrid)0.788LongCtx 81%; Tool 56%; hybrid Mamba-2 + transformer; Apache 2.0
Light device, best balance, vision & Free-tier upper — catalog HMS leaderMinistral 3 3B0.883HMS catalog leader; 256K context; vision-capable; Free Tier; Apache 2.0
Light device, near-balanced second pick (vision)Gemma 4 E2B0.876HEE 94%, PhD 94%; vision-capable; Apache 2.0
Light device, action-task focusedGranite 3.0 2B Instruct0.844Tool 60%; SDB 50%; 4K context; dense 2.6B transformer; Apache 2.0
Light device, Llama familyLlama 3.2 3B Instruct0.817Llama-family architecture; 242 TPS
~1.7 GB budget (between 1B and 3B)SmolLM2 1.7B Instruct0.701391 TPS; Tool 82%; bridge model
Enhanced desktop — vision + tool calling (Pro)Gemma 4 E4B0.771Electron default; Tool 97%; vision-capable; Apache 2.0
Enhanced desktop — long contextMinistral 3 8B0.672LongCtx 86% (catalog leader); 256K context; vision-capable
Enhanced desktop — highest quality cross-platform (vision)Gemma 4 12B0.53Quality 86% (2nd in catalog); Tool 97%; PhD 96%; 256K context; vision-capable; Apache 2.0
Desktop Pro — highest quality + visionGemma 4 26B-A4B (MoE)0.405Quality leader; SDB 80%; vision-capable; 256K context; Apache 2.0

Complete Benchmark Data

Full benchmark results for all 44 models evaluated in the sweep — including models not selected for the catalog — are available in a dedicated reference article.

View All 44 Benchmarked Models →

Notes on Hardware Acceleration

Benchmarks above were collected on a Ryzen 7 7800X3D + RTX 5070 Ti 16 GB system, Linux, GPU-accelerated. NotesXML supports GPU acceleration on NVIDIA CUDA/Vulkan, AMD Vulkan (RADV), and Intel iGPU (Vulkan / SYCL). CPU-only inference remains fully supported on every platform.


About NotesXML AI

All model inference runs locally on your device using llama.cpp. No subscription is required for AI beyond the one-time Professional license. Models are downloaded once and stored locally — no internet connection is needed at inference time. Your notes, queries, and AI responses remain entirely private.

© 2026 IWV Digital Solutions LLC. All rights reserved.


© 2026 IWV Digital Solutions LLC. All rights reserved.

← Back to Articles