NotesXML AI Model Benchmark Results & Recommendations
IWV Digital Solutions LLC | 11-model catalog v2026.06.03.01
11-model catalog with refreshed benchmarks. The shipping catalog contains 11 models spanning four hardware tiers, and all 11 sit on the Pareto efficiency frontier (speed vs quality) — no catalog model is strictly dominated by another. The sweep covers eight benchmarks (Long Context, Tool Calling, Model Benchmark, HEE, PLE, PhD Philosophy, SDB-100, Code) on Ryzen 7 7800X3D + RTX 5070 Ti, Linux. The HMS leader is Ministral 3 3B (HMS 0.883), with Gemma 4 E2B a close second at HMS 0.876.
Overview
NotesXML's Professional tier ships with 11 local AI models spanning every hardware tier from phones and low-RAM devices to enthusiast workstations with discrete GPUs. This article presents complete benchmark results for all 11 catalog models, the Harmonic Mean Score (HMS) ranking, and the efficiency frontier analysis.
Catalog v2026.06.03.01 (June 2026): The 11-model catalog was built by running the full 8-benchmark suite against 44 candidate models, computing the Pareto frontier on TPS vs average quality, and applying a utility floor to exclude models below viable capability. Every model in the shipping catalog is on the strict Pareto frontier. Vision is a Professional-tier feature delivered by Ministral 3 3B, Gemma 4 E2B, Gemma 4 E4B, and Gemma 4 12B (cross-platform) and Gemma 4 26B-A4B (Desktop Pro only). Free tier: Llama 3.2 1B Instruct (Android default, lower) and Ministral 3 3B (vision-capable, upper). Electron default: Gemma 4 E4B.
All AI inference in NotesXML runs entirely on your device. No internet connection is required. No data is transmitted.
Benchmark Deep-Dives
Each benchmark in the NotesXML suite is documented in a dedicated article covering methodology, full question bank or task list, scoring rubric, and references to the academic literature that informed its design. The results on this page summarise scores; the deep-dive articles explain exactly what those scores mean and how they were produced.
NotesXML Benchmark Suite
NotesXML uses eight purpose-designed benchmarks to evaluate models as they actually run on local hardware — testing the combination of model capability and local inference performance that users experience in practice.
Long Context Benchmark (100 points)
Evaluates a model's ability to process, retain, and reason over extended input sequences. Passages with embedded facts at varying context depths are presented; the model is queried on details requiring genuine retention rather than positional pattern-matching. Critical for note-taking applications where users ask AI to summarize or analyze multi-page documents.
| Score Range | Level |
|---|---|
| 80–100 (80–100%) | Strong Retention |
| 60–79 (60–79%) | Functional Retention |
| 40–59 (40–59%) | Partial Retention |
| Below 40 (< 40%) | Limited Retention |
Tool Calling Benchmark (100 points)
Evaluates a model's ability to correctly invoke structured function calls. Tests cover simple single-tool invocations, multi-step chains, parallel calls, and error recovery — all capabilities underpinning NotesXML's AI action system.
| Score Range | Level |
|---|---|
| 90–100 (90–100%) | Expert Tool Use |
| 80–89 (80–89%) | Advanced Tool Use |
| 70–79 (70–79%) | Reliable Tool Use |
| Below 70 (< 70%) | Basic Tool Use |
Model Benchmark — Structural Conformance (Grade A/B/C + Harmonic)
Evaluates chat template handling, stop-token behavior, instruction adherence, and output formatting. Grade A = fully conformant; Grade B = minor issues. The Harmonic score combines conformance quality with inference speed, rewarding models that are both correct and fast.
HEE — Human Education Evaluation (50 points)
Evaluates broad factual knowledge across mathematics, science, history, literature, geography, and reasoning — drawn from secondary school through postgraduate level, weighted toward the upper end. The closest NotesXML benchmark to the academic MMLU family.
| Score Range | Level |
|---|---|
| 45–50 (90–100%) | Post-Graduate / Mastery |
| 40–44 (80–89%) | Undergraduate Level |
| 30–39 (60–79%) | High School Graduate |
| Below 30 (< 60%) | Below Standard |
PLE — Professional License Exam (200 points)
200 questions across ten professional fields: Law (MBE), Medicine (USMLE), Engineering (PE), Psychology (EPPP), Nursing (NCLEX-RN), Real Estate, Finance (CPA), Architecture (ARE), and Cybersecurity (CISSP). Tests not only factual recall but multi-step logical deduction and contextual synthesis under professional-exam constraints.
| Score Range | Level |
|---|---|
| 190–200 (95–100%) | Expert / Mastery |
| 170–189 (85–94%) | Expert / Mastery |
| 160–169 (80–84%) | Proficient |
| 140–159 (70–79%) | Developing |
| Below 140 (< 70%) | Below Competency |
PhD Philosophy — PhD-Level Logic & Philosophy Comprehensive Exam (100 points)
100 questions spanning metalogic, modal logic, philosophy of language and mind, epistemology, metaphysics, and continental phenomenology. Designed so that correctly answering requires understanding how theories mechanically function, not merely pattern-matching to familiar names. The most intellectually demanding single benchmark in the suite.
| Score Range | Level |
|---|---|
| 90–100 (90–100%) | PhD Mastery / Expert |
| 80–89 (80–89%) | Advanced |
| 70–79 (70–79%) | Proficient |
| Below 70 (< 70%) | Below PhD Level |
SDB-100 — Synthetic Deduction Benchmark (100 points, reported as /10 tiers)
10 questions testing multi-step deductive reasoning using synthetic ontologies that cannot exist in any training corpus — novel self-contained universes governed by arbitrary but logically absolute rules. By stripping away semantic priors, the benchmark forces genuine structural reasoning rather than pattern completion. The most discriminating and contamination-resistant benchmark in the suite.
| Score Range | Level |
|---|---|
| 8–10 (80–100%) | Advanced Deductive Reasoning |
| 6–7 (60–79%) | Proficient |
| 4–5 (40–59%) | Developing |
| Below 4 (< 40%) | Below Standard |
Full Results
NotesXML Benchmark Results
Complete 8-benchmark sweep for all 11 catalog models, run on Ryzen 7 7800X3D + RTX 5070 Ti 16 GB (Linux, GPU-accelerated inference). All benchmarks executed at ctx=4096, GPU offload enabled, identical sampling parameters per model.
| Model | Tier | TPS | LongCtx /100 | ToolCall /100 | Grade | HEE /50 | PLE /200 | PhD /100 | SDB /10 |
|---|---|---|---|---|---|---|---|---|---|
| Llama 3.2 1B Instruct ★ | Ultra-Light | 503 | 18 (18%) | 48 (48%) | A | 21 (42%) | 84 (42%) | 38 (38%) | 2 (25%) |
| SmolLM2 1.7B Instruct ★ | Lightweight | 391 | 16 (16%) | 82 (82%) | B | 32 (64%) | 130 (65%) | 62 (62%) | 3 (28%) |
| Granite 4.0 H-1B (Hybrid) ★ | Ultra-Light | 260 | 81 (81%) | 56 (56%) | A | 36 (72%) | 122 (61%) | 55 (55%) | 3 (32%) |
| Llama 3.2 3B Instruct ★ | Lightweight | 242 | 62 (62%) | 67 (67%) | A | 40 (80%) | 142 (71%) | 72 (72%) | 3 (30%) |
| Granite 3.0 2B Instruct ★ | Lightweight | 225 | 72 (72%) | 60 (60%) | A | 38 (76%) | 152 (76%) | 87 (87%) | 5 (50%) |
| Gemma 4 E2B ★ | Lightweight | 207 | 81 (81%) | 71 (71%) | A | 47 (94%) | 174 (87%) | 94 (94%) | 2 (25%) |
| Ministral 3 3B ★ | Lightweight | 206 | 83 (83%) | 84 (84%) | A | 42 (84%) | 160 (80%) | 83 (83%) | 4 (39%) |
| Gemma 4 E4B ★ | Enhanced | 144 | 77 (77%) | 97 (97%) | A | 48 (96%) | 184 (92%) | 96 (96%) | 2 (24%) |
| Ministral 3 8B ★ | Enhanced | 111 | 86 (86%) | 83 (83%) | A | 48 (96%) | 184 (92%) | 94 (94%) | 5 (50%) |
| Gemma 4 12B ★ | Enhanced | 76.9 | 84 (84%) | 97 (97%) | A | 46 (92%) | 168 (84%) | 96 (96%) | 5 (54%) |
| Gemma 4 26B-A4B (MoE) ★ | Desktop Pro | 51.6 | 85 (85%) | 94 (94%) | A | 50 (100%) | 198 (99%) | 100 (100%) | 8 (80%) |
TPS = tokens per second. All benchmarks executed at ctx=4096 with GPU offload enabled. ★ = on Pareto efficiency frontier. Grade = Model Benchmark structural conformance (A = fully conformant, B = minor issues). Code Benchmark excluded from quality calculation — NotesXML is a note-taking application.
Performance vs Quality — The Efficiency Frontier
The dashed green line below connects the eleven catalog models on the Pareto efficiency frontier — the set of models for which no other catalog model is both faster and higher-quality. Every model in the current 11-model catalog sits on the strict frontier.
| Model | TPS | 1/TPS (ms/token) | Avg Score | Frontier |
|---|---|---|---|---|
| Llama 3.2 1B Instruct | 503 | 1.99 | 44.0% | ★ YES |
| SmolLM2 1.7B Instruct | 391 | 2.56 | 54.0% | ★ YES |
| Granite 4.0 H-1B (Hybrid) | 260 | 3.85 | 65.0% | ★ YES |
| Llama 3.2 3B Instruct | 242 | 4.13 | 69.0% | ★ YES |
| Granite 3.0 2B Instruct | 225 | 4.44 | 73.0% | ★ YES |
| Gemma 4 E2B | 207 | 4.83 | 78.0% | ★ YES |
| Ministral 3 3B | 206 | 4.85 | 79.0% | ★ YES |
| Gemma 4 E4B | 144 | 6.94 | 83.0% | ★ YES |
| Ministral 3 8B | 111 | 9.01 | 85.0% | ★ YES |
| Gemma 4 12B | 76.9 | 13.00 | 86.0% | ★ YES |
| Gemma 4 26B-A4B (MoE) | 51.6 | 19.38 | 94.0% | ★ YES |
Sorted by TPS descending (fastest first). All ten catalog models are on the strict Pareto efficiency frontier. Quality is the 7-benchmark qualityMean (Code excluded).
Harmonic Mean Score — Single-Number Speed/Quality Ranking
The Harmonic Mean Score (HMS) combines normalized inference speed and average quality into one value where higher is better. The harmonic mean punishes imbalance: a model that is fast but weak, or strong but slow, scores lower than one that is reasonably good at both. A model that hits 200+ TPS has reached the point of diminishing returns for streaming output in a chat UI, so speed is capped at that threshold.
Methodology
- Speed component:
speed_norm = min(TPS / 200, 1.0) - Quality component:
quality = qualityMean— the unweighted mean of seven normalized benchmark scores. Code Benchmark is excluded. - Harmonic mean:
HMS = 2 × speed_norm × quality / (speed_norm + quality)
HMS Ranking (all 11 catalog models)
| Rank | Model | Tier | TPS | Avg % | speed_norm | quality | HMS |
|---|---|---|---|---|---|---|---|
| 1 | Ministral 3 3B ★ | Lightweight | 206 | 79.0% | 1.000 | 0.790 | 0.883 |
| 2 | Gemma 4 E2B ★ | Lightweight | 207 | 78.0% | 1.000 | 0.780 | 0.876 |
| 3 | Granite 3.0 2B Instruct ★ | Lightweight | 225 | 73.0% | 1.000 | 0.730 | 0.844 |
| 4 | Llama 3.2 3B Instruct ★ | Lightweight | 242 | 69.0% | 1.000 | 0.690 | 0.817 |
| 5 | Granite 4.0 H-1B (Hybrid) ★ | Ultra-Light | 260 | 65.0% | 1.000 | 0.650 | 0.788 |
| 6 | Gemma 4 E4B ★ | Enhanced | 144 | 83.0% | 0.720 | 0.830 | 0.771 |
| 7 | SmolLM2 1.7B Instruct ★ | Lightweight | 391 | 54.0% | 1.000 | 0.540 | 0.701 |
| 8 | Ministral 3 8B ★ | Enhanced | 111 | 85.0% | 0.555 | 0.850 | 0.672 |
| 9 | Llama 3.2 1B Instruct ★ | Ultra-Light | 503 | 44.0% | 1.000 | 0.440 | 0.611 |
| 10 | Gemma 4 12B ★ | Enhanced | 76.9 | 86.0% | 0.380 | 0.860 | 0.530 |
| 11 | Gemma 4 26B-A4B (MoE) ★ | Desktop Pro | 51.6 | 94.0% | 0.258 | 0.940 | 0.405 |
Reading the HMS Ranking
Ministral 3 3B leads at HMS 0.883. At 206 TPS and 79.0% average quality, it delivers the best balance of speed and quality in the catalog. Gemma 4 E2B follows at HMS 0.876 — a close near-tie — with comparable speed (207 TPS) and similar quality (78.0%). All seven small catalog models (Llama 3.2 1B/3B, SmolLM2 1.7B, Granite 4.0 H-1B, Granite 3.0 2B Instruct, Gemma 4 E2B, Ministral 3 3B) are excellent general-purpose choices in the Ultra-Light and Lightweight tiers.
Small models dominate the top of the HMS ranking because models at or above 200 TPS have their speed capped, so a fast 1–3B model with solid quality can outscore a high-quality 26B model on HMS. This is intentional: HMS measures best default interactive choice, not best overall model. For tasks where quality matters most, look at the quality column directly.
Quality leaders sit lower in the HMS ranking. Gemma 4 26B-A4B (MoE) (94.0% quality, the catalog leader) ranks lower on HMS because its 52 TPS is well below the speed cap. For batch workflows, research synthesis, or high-stakes writing where speed is secondary, the quality column is the right metric.
Gemma 4 26B-A4B (MoE) has the best SDB-100 score in the catalog (80%) — structured deductive reasoning is its standout strength. SDB-100 is the hardest contamination-resistant reasoning benchmark in the suite.
Key Observations
The Efficiency Sweet Spot (1–3B parameters): Five of the top six HMS positions are sub-4B models. Ministral 3 3B (HMS 0.883), Gemma 4 E2B (0.876), Granite 3.0 2B Instruct (0.844), Llama 3.2 3B (0.817), and Granite 4.0 H-1B (0.788) form a dense cluster on the frontier. For interactive note-taking, this band delivers the best return.
Ministral 3 3B is the standout of the catalog. A 3B-parameter model, it matches Gemma 4 E2B on speed (206 vs 207 TPS), slightly leads on overall quality (79.0% vs 78.0%), is vision-capable, and exposes a 256K-token native context window — the longest in the catalog along with Ministral 3 8B. Ships as the Free-tier upper model.
Ministral 3 8B leads on Long Context. At 86% on the Long Context Benchmark, Ministral 3 8B scores the highest in the catalog on this dimension — meaningful for users who work with multi-page notes and need the AI to retain context across a full document.
Gemma 4 E4B leads on Tool Calling. At 97% (Expert tool use), Gemma 4 E4B is the strongest model in the catalog for structured AI actions. It is the cross-platform vision flagship in the Enhanced tier and is the Electron default. Vision is a Professional-tier feature, also provided by Ministral 3 3B and Gemma 4 E2B at the Lightweight footprint and Gemma 4 26B-A4B at Desktop Pro.
Gemma 4 26B-A4B (MoE) leads on quality, HEE, PLE, PhD, and SDB-100. At 94.0% average quality with 100% on HEE, 99% on PLE, 100% on PhD Philosophy, and 80% on SDB-100, the Gemma 4 26B-A4B Mixture-of-Experts architecture takes the top spot on every knowledge and reasoning benchmark in the suite. Combined with vision capability and a 256K-token native context, it is the highest-quality single model in the catalog.
Llama 3.2 1B Instruct's knowledge benchmarks reflect its scale. At 1.2B parameters, HEE 42%, PLE 42%, and PhD 38% are expected. The model is the fastest in the catalog at 503 TPS and is the Android Free-tier default, designed for fast general-purpose tasks — auto-titling, summarization, grammar polish — under severe memory constraints. For knowledge-intensive workflows, any model at the 3B tier or above delivers substantially better results.
Model Recommendations by Hardware Tier
Ultra-Light — Up to ~1.5 GB RAM Available for AI
Free-tier default: Llama 3.2 1B Instruct (HMS 0.611) — the Android Free-tier default and fastest model in the catalog at 503 TPS. 128K native context. Best for instant note titling, short summaries, transcription polish, and basic text cleanup under severe memory constraints. Llama 3.2 Community License.
Stronger Ultra-Light alternative: Granite 4.0 H-1B (Hybrid) (HMS 0.788) — Apache 2.0, 260 TPS, LongCtx 81%, Tool 56%, HEE 72%. The hybrid Mamba-2 + transformer architecture delivers materially stronger long-context and tool-calling performance than Llama 3.2 1B at a comparable 1.5 GB footprint, while remaining cross-platform (Android included). 128K native context.
Lightweight — ~1.5–4.5 GB RAM Available for AI
Ministral 3 3B (HMS 0.883) — 206 TPS, HEE 84%, PLE 80%, PhD 83%, LongCtx 83%, Tool 84%. Vision-capable (Professional tier) with a 256K native context window. Ships as the Free-tier upper model on all platforms. 40+ language support.
Gemma 4 E2B (HMS 0.876) — 207 TPS, HEE 94%, PLE 87%, PhD 94%, LongCtx 81%, Tool 71%. Vision-capable (Professional tier). 2.3B effective parameters, 128K native context.
Granite 3.0 2B Instruct (HMS 0.844) — 225 TPS, HEE 76%, PLE 76%, PhD 87%, LongCtx 72%, Tool 60%. The Granite 3.0 dense transformer at 2.6B parameters. 4K native context (the smallest in the catalog — suited to action tasks rather than multi-page summarization).
Llama 3.2 3B Instruct (HMS 0.817) — 242 TPS, HEE 80%, PLE 71%, PhD 72%, LongCtx 62%, Tool 67%. Llama 3.2 Community license. 128K native context. The Llama-family choice in the Lightweight tier.
SmolLM2 1.7B Instruct (HMS 0.701) — 391 TPS, HEE 64%, PLE 65%, PhD 62%, LongCtx 16%, Tool 82%. 8K context. The best choice when RAM is tight but 3B is too large.
Enhanced — ~5–9 GB RAM Available for AI
Electron default: Gemma 4 E4B (HMS 0.771) — Apache 2.0, 144 TPS, Tool 97% (catalog leader), HEE 96%, PLE 92%, PhD 96%. Vision-capable (Professional tier). 128K native context. The go-to for AI Chat, structured actions, and image analysis.
Long-context leader: Ministral 3 8B (HMS 0.672) — Apache 2.0, 111 TPS, Long Context 86% (catalog leader), HEE 96%, PLE 92%, PhD 94%, Tool 83%. Vision-capable. 256K native context.
Highest-quality cross-platform model: Gemma 4 12B (HMS 0.53) — Apache 2.0, 76.9 TPS, Tool 97% (tied catalog leader), PhD 96%, HEE 92%, PLE 84%, LongCtx 84%, SDB 54%. 86.0% average quality — the catalog's second-highest, behind only the desktop-only 26B MoE. Vision-capable (Professional tier). 12B dense, 256K native context. The top-quality choice that still runs cross-platform (not desktop-only), for quality-sensitive work on Enhanced-tier hardware where the 26B MoE is out of reach.
Desktop Pro — ~22 GB RAM or 16+ GB GPU VRAM
Highest overall quality: Gemma 4 26B-A4B (MoE) (HMS 0.405) — Apache 2.0, desktopOnly, 51.6 TPS. The catalog's top quality model: 94.0% average, HEE 100%, PLE 99%, PhD 100%, SDB 80% (catalog leader), Tool 94%. 26B total / 4B active Mixture-of-Experts, vision-capable, 256K native context. The highest-quality choice for research, synthesis, and agentic workflows where speed is secondary.
Summary Recommendation Table
| Your Hardware / Use Case | Recommended Model | HMS | Primary Strength |
|---|---|---|---|
| Phone / very low-RAM device (<1.5 GB) | Llama 3.2 1B Instruct | 0.611 | Android Free-tier default; 503 TPS; smallest viable footprint |
| Phone / low-RAM, stronger Ultra-Light alternative | Granite 4.0 H-1B (Hybrid) | 0.788 | LongCtx 81%; Tool 56%; hybrid Mamba-2 + transformer; Apache 2.0 |
| Light device, best balance, vision & Free-tier upper — catalog HMS leader | Ministral 3 3B | 0.883 | HMS catalog leader; 256K context; vision-capable; Free Tier; Apache 2.0 |
| Light device, near-balanced second pick (vision) | Gemma 4 E2B | 0.876 | HEE 94%, PhD 94%; vision-capable; Apache 2.0 |
| Light device, action-task focused | Granite 3.0 2B Instruct | 0.844 | Tool 60%; SDB 50%; 4K context; dense 2.6B transformer; Apache 2.0 |
| Light device, Llama family | Llama 3.2 3B Instruct | 0.817 | Llama-family architecture; 242 TPS |
| ~1.7 GB budget (between 1B and 3B) | SmolLM2 1.7B Instruct | 0.701 | 391 TPS; Tool 82%; bridge model |
| Enhanced desktop — vision + tool calling (Pro) | Gemma 4 E4B | 0.771 | Electron default; Tool 97%; vision-capable; Apache 2.0 |
| Enhanced desktop — long context | Ministral 3 8B | 0.672 | LongCtx 86% (catalog leader); 256K context; vision-capable |
| Enhanced desktop — highest quality cross-platform (vision) | Gemma 4 12B | 0.53 | Quality 86% (2nd in catalog); Tool 97%; PhD 96%; 256K context; vision-capable; Apache 2.0 |
| Desktop Pro — highest quality + vision | Gemma 4 26B-A4B (MoE) | 0.405 | Quality leader; SDB 80%; vision-capable; 256K context; Apache 2.0 |
Complete Benchmark Data
Full benchmark results for all 44 models evaluated in the sweep — including models not selected for the catalog — are available in a dedicated reference article.
Notes on Hardware Acceleration
Benchmarks above were collected on a Ryzen 7 7800X3D + RTX 5070 Ti 16 GB system, Linux, GPU-accelerated. NotesXML supports GPU acceleration on NVIDIA CUDA/Vulkan, AMD Vulkan (RADV), and Intel iGPU (Vulkan / SYCL). CPU-only inference remains fully supported on every platform.
- NVIDIA CUDA / Vulkan: Full support. RTX 4060 class and above can run all 11 catalog models. The single Desktop Pro model (Gemma 4 26B-A4B) is best on 16+ GB VRAM at Q4_K_M.
- AMD Vulkan (RADV): Full support. APU systems (Ryzen 7000 with Radeon 780M) successfully run all catalog models through the Lightweight and Enhanced tiers.
- Intel iGPU: Vulkan is supported on Linux and Windows; SYCL is supported on platforms where the Intel oneAPI runtime is available. CPU-only inference is also fully supported.
About NotesXML AI
All model inference runs locally on your device using llama.cpp. No subscription is required for AI beyond the one-time Professional license. Models are downloaded once and stored locally — no internet connection is needed at inference time. Your notes, queries, and AI responses remain entirely private.
© 2026 IWV Digital Solutions LLC. All rights reserved.
© 2026 IWV Digital Solutions LLC. All rights reserved.
← Back to Articles