April 18, 2026 · IWV Digital Solutions · 20 min read

NotesXML AI Model Benchmark Results & Recommendations

IWV Digital Solutions LLC | Beta 5.60.4 | April 16–18, 2026

The Bottom Line: Local AI on Your Terms

All AI in NotesXML runs locally on your device. No internet is required, and your data never leaves your hardware. We test models against real-world hardware — from phones to high-end desktops — to ensure you get the performance you expect.

Understanding the Benchmarks

We use four custom tests to measure how a model "thinks" and "knows," rather than just how fast it types.

HEE (Human Education): General knowledge (history, science, math).
PLE (Professional License): Specialized reasoning (law, medicine, engineering).
PhD Philosophy: Our hardest test. Measures logic and high-level abstract reasoning.
SDB-1 (Synthetic Deduction): A "clean" logic test that uses made-up rules to ensure the AI isn't just repeating memorized answers.

Each benchmark has a scoring rubric (Below Standard, Developing, Proficient, Expert/Mastery) — see the Full Results section for the detailed rubrics.

Desktop Performance Results

Tested on: Ryzen 7 7800X3D + RTX 5070 Ti (Windows 11)

Model Tier	Model Name	Speed (TPS)	Knowledge (HEE)	Reasoning (PLE)	Philosophy (PhD)	Logic (SDB-1)
Ultra-Light	Gemma 3 1B	218.9	44%	41%	34%	20%
Lightweight	Phi-4 Mini 3.8B	176.8	90%	89%	93%	60%
Standard	Gemma 4 E2B	148.9	94%	81%	91%	0%
Enhanced	Ministral 3 8B	83.1	96%	94%	96%	50%
Advanced	Phi-4 14B	71.1	96%	94%	98%	80%
Desktop Pro	Mistral Small 3.1 24B	40.7	96%	96.5%	100%	50%

Representative model from each tier. See the Full Results section below for all 12 catalog models and full absolute scores (HEE /50, PLE /200, PhD /100, SDB-1 /10).

Key takeaway: Phi-4 14B is the logic champion, while the new Mistral Small 3.1 24B is the first model in the catalog to achieve a perfect 100/100 on our PhD Philosophy exam.

Hardware Comparison: Speed by Device

Not every model runs on every device. Speed is measured in Tokens Per Second (TPS) — higher is smoother.

Model	Desktop (GPU)	Mini PC (CPU)	Laptop (CPU)	Phone (S24 Ultra)	Tablet (S9 FE+)
Gemma 3 1B	218.9	65.1	14.7	13.0	1.7
Phi-4 Mini 3.8B	176.8	26.2	5.2	6.3	—
Ministral 3 8B	83.1	12.8	2.4	—	—
Phi-4 14B	71.1	7.9	1.6	—	—

"—" indicates the model exceeds the device's available memory. On a memory-constrained tablet such as the Samsung Galaxy Tab S9 FE+ (2.5 GB available RAM), only the Ultra-Light tier (Gemma 3 1B) fits — which is precisely why NotesXML ships an Ultra-Light tier. See the Full Results section for all 12 models across all five tested platforms.

Which Model Should You Use?

1. The "Speed Demon" — Gemma 3 1B

Best for: phones and older tablets.

The only model that runs on almost anything. It's incredibly fast (200+ TPS on desktop) and perfect for quick tasks like naming a note or fixing typos.

2. The "All-Rounder" — Phi-4 Mini 3.8B

Best for: laptops and general daily use.

This model punches way above its weight class. The best balance of high-speed performance and professional-level reasoning — 90% HEE, 89% PLE, 93% PhD Philosophy, 60% SDB-1 at 176.8 TPS on desktop.

3. The "Expert Writer" — Ministral 3 14B

Best for: complex drafting and summarization.

If your work involves heavy writing, editing, or multi-step professional analysis, the Ministral family offers the best "language feel" and instruction following. Ministral 3 14B tops the HEE benchmark catalog-wide (49/50, 98%).

4. The "Logic Specialist" — Phi-4 14B

Best for: coding, math, and logical analysis.

The highest score in the catalog on the SDB-1 logic test (80%, Advanced Deductive Reasoning) — the only benchmarked model to exceed 60%. Choose this when accuracy and "thinking through" a problem matter more than speed.

5. The "Deep Researcher" — Mistral Small 3.1 24B

Best for: Desktop Pro users with 16 GB+ VRAM.

The highest-quality model in the catalog on knowledge and professional reasoning. Master of philosophy (100/100 PhD, first in the catalog), professional exams (96.5% PLE, catalog leader), and vision-based OCR (100% extraction). Slower, but its answers are the most profound.

Pro Tip: Bring Your Own Model

While we curate 12 specific models, NotesXML is an open system. We support any GGUF-format model. Simply drop your preferred model file into the shared directory, and it will appear in your settings.

Note: Models outside the curated catalog have not been tested against the NotesXML benchmark suite. Performance, quality, prompt compatibility, and memory requirements may vary. Vision features require a model with a supported multimodal projection file (mmproj).

Quality vs. Speed — The Two-Axis View

Before diving into the full results, here is a single visual summary: all 12 catalog models mapped on two axes. Vertical is average benchmark score as a percentage (arithmetic mean of HEE, PLE, PhD Philosophy, and SDB-1, each normalized to 0–100). Horizontal is time per token in milliseconds (1 / TPS) — the practical measure of "how long does a user wait for each token." Each model appears once; shapes and colors are distinct per model. The four Mistral-family values reflect the post-CR-2026-415 re-benchmark; see the full tables below the plot for all per-model numbers.

Average benchmark score is the arithmetic mean of HEE (/50), PLE (/200), PhD Philosophy (/100), and SDB-1 (/10), each normalized to 0–100 before averaging. Time per token is calculated from the Desktop TPS values in the Full Results table (t = 1000 / TPS). All values shown are post-CR-2026-415 canonical numbers for the four Mistral-family models; the other eight models' values are from the 2026-04-16/17 multi-platform benchmark run.

What the plot shows

The Pareto frontier. The best model at any given latency budget is the one highest on the vertical axis — no slower model can match its quality on a faster machine. Following the upper-left edge of the scatter gives the practical frontier for this catalog: Gemma 3 1B (for the lowest-end hardware), Phi-4 Mini 3.8B (the best all-rounder under 6 ms/tok), Ministral 3 8B (the best sub-13 ms/tok option), and Phi-4 14B (the overall quality leader at 91.9%). Everything else in the catalog is dominated by one of these four on both axes — meaning for any model not on the frontier, one of the frontier models offers both higher quality AND lower latency.

Phi-4 Mini 3.8B is the catalog's star sub-6-ms model. It sits at 82.9% average quality and 5.66 ms/tok — closer to the 14B-parameter models in quality than to the other sub-6-ms models like Gemma 3 1B (34.6%) or SmolLM3 3B (64.6%). Phi-4 Mini's published industry profile (MMLU 67.3%, MMLU-Pro 52.8%, ARC-C 83.7%, TruthfulQA 66.4%) is the most comprehensive of any sub-4B model, and its SDB-1 score of 60% is 6–10× higher than any other sub-10B model. If you have to pick one catalog model for a mid-range laptop or desktop without a dedicated GPU, Phi-4 Mini is it.

Ministral 3 14B and Gemma 3 12B reveal a quality-per-latency gap. Both models land at ~17.5 ms/tok (59 TPS and 57 TPS respectively), but Ministral 3 14B scores 85% to Gemma 3 12B's 76.5% — an 8.5-point quality gap at identical latency. For users on hardware that can run a 12B-class model, Ministral 3 14B is the clear choice. Gemma 3 12B retains a niche for users who specifically need Google's Gemma license terms or Gemma-family tooling compatibility.

Phi-4 14B is the overall quality leader. At 91.9% average across all four benchmarks — the highest of any catalog model — Phi-4 14B runs at 14.06 ms/tok, notably faster than Ministral 3 14B and Mistral Small 3.1 24B despite comparable depth. Its 80% SDB-1 deductive-reasoning score is the single strongest differentiator in the benchmark suite; no other model exceeds 60%. If you can spare ~14 ms/tok and 8.4 GB of RAM, Phi-4 14B gives the strongest all-round quality profile in the catalog.

Mistral Small 3.1 24B sits off the frontier on average score — but that understates what it does. On the plot above, Mistral's 85.6% average at 24.57 ms/tok is dominated by Phi-4 14B (91.9% at 14.06 ms/tok). The reason is SDB-1: Mistral's 50% on deductive reasoning drags its average down, while its knowledge scores are actually best-in-catalog — a perfect 100/100 on PhD Philosophy (the only model to hit that), 96.5% on PLE (catalog leader), and 96% on HEE (tied for catalog leader). If you plot only HEE+PLE+PhD and drop SDB-1, Mistral 3.1 24B jumps to 97.5% and lands at or near the top of the catalog. The question isn't "is Mistral good?" — it's "does your workload lean on deductive reasoning?" For structured-reasoning tasks (logic puzzles, multi-step deduction, code reasoning), Phi-4 14B is the better pick. For knowledge-heavy tasks (summarization, translation, document analysis, Q&A, vision), Mistral 3.1 24B is the depth ceiling of the catalog and the reason the Desktop Pro tier exists.

Gemma 3 1B is an outlier in the lower-left — and that's by design. At 34.6% average quality it's the lowest in the catalog, but at 4.57 ms/tok (219 TPS) it's by far the fastest. Gemma 3 1B exists to guarantee that every NotesXML user — including those on low-RAM phones and aging devices — has functional on-device AI. Its sub-40% score on deductive reasoning (SDB-1 at 20%) reflects the hard reality that 1B-parameter models can't do multi-step reasoning reliably; they can, however, handle quick summarization, titling, and text polish instantly. Free-tier users with no hardware budget are not forced into a slow or cloud-dependent experience — they get Gemma 3 1B, and it responds in real time.

The shape of the frontier has a practical implication. The Pareto frontier rises steeply from Gemma 3 1B to Phi-4 Mini 3.8B (35% → 83% for an extra 1 ms/tok), then flattens — from 5.7 to 14 ms/tok we only gain about 9 points of average quality (Phi-4 Mini 82.9% → Phi-4 14B 91.9%). Spending more tokens-per-second beyond Phi-4 Mini gives diminishing quality returns on our benchmark suite. This is why NotesXML's default model on Electron is Phi-4 Mini 3.8B and on Android is Gemma 3 1B — they sit at or near the inflection points of the frontier where further investment yields less benefit.

Full Results

Detailed Benchmark Rubrics

NotesXML uses five purpose-designed benchmarks to evaluate models as they actually run on local hardware — testing the combination of model capability and local inference performance that users experience in practice. These benchmarks complement published academic scores by measuring real-world usability, not just raw accuracy under idealized server conditions.

HEE — Human Education Evaluation (50 points)

HEE evaluates broad factual knowledge and analytical competence across subject areas spanning mathematics, science, history, literature, geography, and reasoning. Questions are drawn from a representative curriculum ranging in difficulty from secondary school through postgraduate level, weighted toward the upper end.

Score Range	Level
45–50 (90–100%)	Post-Graduate / Mastery
40–44 (80–89%)	Undergraduate Level
30–39 (60–79%)	High School Graduate
Below 30 (< 60%)	Below Standard

HEE is the closest NotesXML benchmark to the academic MMLU family — it tests the breadth of a model's trained knowledge. A Post-Graduate score indicates a model that can reliably answer complex factual questions, explain technical concepts, and assist with research-level tasks across a wide variety of domains.

PLE — Professional License Exam (200 points)

PLE is a 200-question benchmarking suite designed to rigorously evaluate multidisciplinary reasoning and domain-specific knowledge across ten distinct professional fields: Law (MBE), Medicine (USMLE), Engineering (Electrical and Chemical PE), Psychology (EPPP), Nursing (NCLEX-RN), Real Estate, Finance (CPA), Architecture (ARE), and Cybersecurity (CISSP). Each field contributes 20 multiple-choice questions modeled after official licensure examinations, covering a broad spectrum of challenges ranging from quantitative mass balances and circuit analysis to ethical dilemmas, clinical diagnostics, and spatial reasoning.

Unlike general knowledge tests, PLE measures not only factual recall of industry standards and legal precedents but also a model's capacity for multi-step logical deduction, formula application, and contextual synthesis under the kind of high-stakes constraints that characterize real professional licensing exams. The result is a cross-domain stress test of reliability, accuracy, and expert-level reasoning across diverse professional standards.

Score Range	Level
190–200 (95–100%)	Expert / Mastery
170–189 (85–94%)	Expert / Mastery
160–169 (80–84%)	Proficient
140–159 (70–79%)	Developing
Below 140 (< 70%)	Below Competency

A model scoring Expert/Mastery on PLE has demonstrated the ability to reason at a level comparable to a licensed professional across multiple high-stakes disciplines — making it a strong indicator of real-world reliability for research, analysis, and complex writing tasks in NotesXML.

PhD Philosophy — PhD-Level Logic & Philosophy Comprehensive Exam (100 points)

Evaluating advanced philosophical knowledge requires more than testing rote memorization — it demands an assessment of the ability to navigate complex conceptual frameworks, synthesize foundational theorems, and parse the nuanced logical implications of historical and contemporary debates. The PhD-Level Logic & Philosophy Comprehensive Examination is a rigorous, 100-question assessment spanning the furthest reaches of analytic and continental thought, serving as a formidable benchmark for doctoral-level reasoning.

The examination is structured as a sweeping survey of the highest-level discourse in logic, epistemology, metaphysics, and value theory, deliberately crossing sub-disciplinary boundaries to test holistic understanding. Its core domains include:

Metalogic and Formal Systems — Candidates must grapple with the philosophical consequences of Gödel's Incompleteness Theorems, Tarski's Undefinability Theorem, the Löwenheim-Skolem Theorem, and Lindström's Theorem, demonstrating mastery of both syntax and semantics.

Modal and Non-Classical Logics — Proficiency in Kripke semantics, system S4 and S5 axioms, paraconsistent logics (including Priest's Dialetheism), intuitionistic logic, and relevance logics designed to resolve the paradoxes of material implication.

Philosophy of Language and Mind — Semantic externalism (Putnam's Twin Earth), Two-Dimensional Semantics (Chalmers), rigid designators (Kripke), and the Eliminative Materialism of the Churchlands, connecting theories of meaning directly to theories of consciousness.

Epistemology and Philosophy of Science — The evolution of justification and truth from Gettier's foundational disruption through Sosa's Virtue Epistemology and Bayesian Conditionalization, alongside the demarcation problem, Kuhn's incommensurability, and van Fraassen's Constructive Empiricism.

Metaphysics and Ontology — Extreme ontological positions including David Lewis's Modal Realism, Mereological Nihilism, Ontological Pluralism, and the A-series versus B-series of time.

Continental Phenomenology and Metaethics — Husserl's epoché and Lebenswelt, Heidegger's Dasein, and Derrida's Deconstruction alongside advanced metaethical debates including the Frege-Geach problem and Cornell Realism.

The multiple-choice format is engineered not to offer easy eliminations, but to present highly plausible distractors representing genuine opposing schools of thought. A model cannot simply know who wrote a theory — it must understand how that theory mechanically functions. Rather than asking what the Downward Löwenheim-Skolem Theorem is, the exam asks for its primary philosophical consequence regarding uncountable sets.

Score Range	Level
90–100 (90–100%)	PhD Mastery / Expert
80–89 (80–89%)	Advanced
70–79 (70–79%)	Proficient
Below 70 (< 70%)	Below PhD Level

Strong performance on this benchmark indicates a model capable of genuine philosophical reasoning — not pattern-matching to familiar names, but tracing the mechanical implications of ideas across logic, language, mind, and reality. It is the most intellectually demanding single benchmark in the NotesXML suite.

SDB-1 — Synthetic Deduction Benchmark (10 points)

Standard assessments like the LSAT, human IQ tests, or classic logic puzzles are inevitably contaminated for AI evaluation — a model has likely encountered the answers, or at least the exact semantic structures, during training. Furthermore, mapping AI capabilities to a human psychometric scale is fundamentally flawed: an AI might master a 15-variable constraint satisfaction problem in milliseconds yet fail to deduce the outcome of a dropped coffee mug. The SDB-1 was designed to solve this problem.

The Synthetic Deduction Benchmark is a specialized 10-question evaluation testing multi-step deductive reasoning, branching logic, and counterfactual elimination. Its defining feature is synthetic ontology: the benchmark completely strips away standard human scenarios, relying instead on novel, self-contained universes governed by arbitrary but logically absolute rules — for example, "The Axiom vibrates unless the Cipher is heavier than the Benthos." By decoupling the logical topology from familiar concepts, the model is forced to build a structural representation from scratch based purely on the provided syntax. There are no semantic priors to rely on.

The benchmark is structured across three escalating tiers of computational complexity:

Level 1 — Linear Conditional Cascades: Tests basic rule adherence and fundamental logical operations such as Modus Tollens (denying the consequent). Questions present isolated systems with strict if/then and if-and-only-if rules. The model must correctly chain conditions without skipping steps or making unauthorized assumptions.

Level 2 — Multi-Dimensional Constraint Satisfaction: Dramatically increases cognitive load by testing variable binding and elimination. Scenarios require tracking multiple independent variables — sequential ordering, spatial relationships, dominance hierarchies — simultaneously. Success demonstrates the ability to maintain a stable context and accurately map how one constrained variable restricts the possibilities of all others.

Level 3 — Recursive Logic and Counterfactuals: Evaluates non-standard metamathematics and Boolean network branching. Questions introduce nodes that lie or tell the truth based on the outputs of other nodes, recursive set exclusions, temporal paradox loops, and counterfactual machines. To solve Level 3, a model must assume a hypothetical state, track its implications to a contradiction, and eliminate that branch of logic to find the single necessary truth — and recognize when a scenario is intentionally logically impossible.

The SDB-1 does not output a human IQ equivalent. It is a pure stress test for logical fidelity. A perfect score indicates an uncompromised ability to perform independent variable binding, recursive truth-value assignment, and counterfactual validation without hallucinating or relying on linguistic familiarity — proving the system can actually deduce, not merely predict the most statistically likely next word.

Score Range	Level
8–10 (80–100%)	Advanced Deductive Reasoning
6–7 (60–79%)	Proficient
4–5 (40–59%)	Developing
Below 4 (< 40%)	Below Standard

SDB-1 proved the most discriminating benchmark in this suite, with scores ranging from 0% to 80% across the tested models. It is the benchmark most resistant to training data contamination, and the one most predictive of a model's ability to perform genuine logical analysis rather than fluent pattern completion.

Industry Reference Benchmarks

The following widely-used academic benchmarks are included to allow published scores from official model cards and independent research to be compared against NotesXML's own results. Scores are sourced from official technical reports and independent evaluations.

Important note on benchmark conditions: Published academic scores are typically obtained using full-precision or lightly quantized models on high-performance server hardware with controlled evaluation pipelines. NotesXML's desktop benchmarks use GGUF-quantized models with GPU offload on a Ryzen 7 7800X3D + RTX 5070 Ti system; all other platforms (laptop, mini PC, phone, tablet) use CPU-only inference. Real-world local inference performance can differ meaningfully from published reference scores — a distinction explored further in the results section.

MMLU — Massive Multitask Language Understanding

MMLU presents 14,000+ multiple-choice questions across 57 academic subjects — STEM, humanities, law, medicine, social sciences, and more — with four answer choices per question, evaluated in a 5-shot setting. MMLU is primarily a knowledge breadth and recall test. Scores above 70% are considered strong; above 80% indicates professional-grade knowledge breadth. Most models above 7B parameters now saturate this benchmark, which is why MMLU-Pro was developed as a more discriminating successor.

MMLU-Pro — Massive Multitask Language Understanding, Professional

MMLU-Pro extends MMLU with 12,000 rigorously curated questions across 14 domains, with answer choices expanded from 4 to 10 and a focus on reasoning-intensive problems over pure factual recall. Evaluated in a chain-of-thought zero-shot setting, MMLU-Pro requires models to reason through problems rather than pattern-match to memorized answers. Scores typically run 16–33% lower than MMLU for the same model. It is currently the most widely adopted single benchmark for comparing general reasoning capability across model families.

ARC-Challenge — AI2 Reasoning Challenge

ARC-Challenge contains science questions specifically selected because they were answered incorrectly by simple retrieval methods — meaning they require genuine reasoning, not pattern matching. Evaluated in a 10-shot setting. Strong performance above 85% indicates a model that reliably handles reasoning-based questions requiring applied knowledge over simple recall.

TruthfulQA — Truthfulness and Reliability

TruthfulQA measures how often a model generates truthful responses to questions that humans commonly answer incorrectly due to misconceptions, myths, or biases. The MC2 variant (multiple correct answers) is reported here. Higher scores indicate models less likely to confidently assert false information — particularly relevant for a productivity assistant used in research and writing.

HellaSwag — Commonsense Sentence Completion

HellaSwag tests commonsense reasoning through sentence completion, where wrong options are designed to sound plausible but are contextually implausible in the real world. Modern large models have largely saturated HellaSwag (90%+ is common for 7B+ models), making it most useful for differentiating capability in the sub-7B parameter range.

NotesXML Benchmark Results — Desktop (Ryzen 7 7800X3D + RTX 5070 Ti, Windows 11)

All 12 catalog models. The first 11 were tested 2026-04-16/17 in a single batch; Mistral Small 3.1 24B was added and tested 2026-04-18. All Mistral-family rows reflect the post-CR-2026-415 re-benchmark at --temp 0.15.

Model	Tier	Grade	TPS	HEE /50	PLE /200	PhD /100	SDB-1 /10
Gemma 3 1B	Ultra-Light	B	218.9	22 (44%)	81 (41%)	34 (34%)	2 (20%)
SmolLM3 3B	Lightweight	B	191.6	43 (86%)	155 (78%)	85 (85%)	1 (10%)
Ministral 3 3B	Lightweight	A	147.2	42 (84%)	176 (88%)	94 (94%)	2 (20%)
Phi-4 Mini 3.8B	Lightweight	A	176.8	45 (90%)	177 (89%)	93 (93%)	6 (60%)
Gemma 4 E2B	Standard	B	148.9	47 (94%)	162 (81%)	91 (91%)	0 (0%)
Gemma 3n E4B	Standard	A	87.6	48 (96%)	166 (83%)	95 (95%)	3 (30%)
Gemma 4 E4B	Enhanced	A	100.0	48 (96%)	182 (91%)	97 (97%)	3 (30%)
Ministral 3 8B	Enhanced	A	83.1	48 (96%)	188 (94%)	96 (96%)	5 (50%)
Gemma 3 12B	Enhanced	A	57.3	46 (92%)	182 (91%)	93 (93%)	3 (30%)
Ministral 3 14B	Advanced	A	57.0	49 (98%)	190 (95%)	97 (97%)	5 (50%)
Phi-4 14B	Advanced	A	71.1	48 (96%)	187 (94%)	98 (98%)	8 (80%)
Mistral Small 3.1 24B	Desktop Pro	A	40.7	48 (96%)	193 (96.5%)	100 (100%)	5 (50%)

TPS = tokens per second. GPU-accelerated inference via RTX 5070 Ti. Grade and MSI (Model Suitability Index) are computed by the NotesXML AI Benchmark Orchestrator. Of the 12 catalog models, 9 achieved Grade A and 3 achieved Grade B. Averages across the full 12-model catalog: HEE 44.5/50 (89%), PLE 169.9/200 (85%), PhD 89.4/100 (89%), SDB-1 3.6/10 (36%). Mistral Small 3.1 24B leads the PLE and PhD benchmarks outright (first and only model to achieve a perfect PhD score) and ties the HEE knowledge ceiling; Ministral 3 14B still holds the HEE lead alone at 49/50; Phi-4 14B still leads SDB-1 at 8/10. Post-CR-2026-415 re-benchmark at `--temp 0.15`: knowledge scores (HEE/PLE/PhD/SDB-1) were essentially unchanged for the Ministral family (knowledge is sampling-insensitive in this range) while Ministral 3 3B's PhD Philosophy improved by 3 points (91→94), a modest reasoning-precision gain attributable to tighter sampling. TPS changes across the Mistral-family are within ±2 TPS run-to-run noise.

Performance by Hardware Platform (TPS Comparison)

The same models were benchmarked across five platforms — a GPU-equipped desktop, a laptop, a mini PC, a flagship phone, and a mid-range tablet — to show how inference speed scales with hardware. Not all models fit on every device; models that exceed available memory are marked with a dash.

Model	Params	Ryzen 7 7800X3D + RTX 5070 Ti (Desktop, GPU)	Intel Ultra 7 155H (Laptop, CPU)	Ryzen 9 7940HS (Mini PC, CPU)	Galaxy S24 Ultra (Phone, CPU)	Galaxy Tab S9 FE+ (Tablet, CPU)
Gemma 3 1B	1B	218.9	14.7	65.1	13.0	1.7
SmolLM3 3B	3B	191.6	6.4	31.3	7.6	—
Ministral 3 3B	3B	147.2	5.1	27.1	6.5	—
Phi-4 Mini 3.8B	3.8B	176.8	5.2	26.2	6.3	—
Gemma 4 E2B	2.3B eff	148.9	6.0	33.0	7.5	—
Gemma 3n E4B	4B eff	87.6	3.9	18.4	3.6	—
Gemma 4 E4B	4.5B eff	100.0	3.7	18.4	—	—
Ministral 3 8B	8B	83.1	2.4	12.8	—	—
Gemma 3 12B	12B	57.3	1.9	7.6	—	—
Ministral 3 14B	14B	57.0	1.7	7.9	—	—
Phi-4 14B	14B	71.1	1.6	7.9	—	—
Mistral Small 3.1 24B	24B	40.7	—	—	—	—

All values in tokens per second. — = model exceeds available device memory. Mistral Small 3.1 24B (Desktop Pro tier) requires 16 GB of discrete GPU VRAM and is therefore only tested on the Prism 4 desktop — the Intel Ultra 7 155H laptop and Ryzen 9 7940HS mini PC in this test set are CPU-only with 16 GB of system RAM, which is the model's VRAM minimum, not its CPU working-set minimum; testing on CPU-only systems would spill into RAM and fall below a usable inference speed (typical 5–10 TPS range per partial-offload physics). The CR-2026-415 Mistral-family re-benchmark at `--temp 0.15` was run on Prism 4 only; the Laptop, Mini PC, Phone, and Tablet Mistral-family TPS values shown above are from the original 2026-04-16/17 multi-platform benchmark. Temperature affects sampling distribution, not token-throughput physics — Mistral-family TPS on CPU-only platforms would be within run-to-run noise of the values shown.

Platform	Device	Processor	Acceleration	RAM (available)	Models
Desktop	Skytech Prism 4	AMD Ryzen 7 7800X3D + RTX 5070 Ti 16 GB	GPU (CUDA)	32 GB	12/12
Laptop	LG Gram	Intel Core Ultra 7 155H (22 cores)	CPU only	31.5 GB (16.8 GB)	11/12
Mini PC	ALLOY9	AMD Ryzen 9 7940HS	CPU only	16 GB	11/12
Phone	Samsung Galaxy S24 Ultra	Qualcomm Snapdragon 8 Gen 3	CPU only	10.8 GB (5.7 GB)	6/12
Tablet	Samsung Galaxy Tab S9 FE+	Samsung Exynos 1380	CPU only	7.5 GB (2.5 GB)	1/12

The GPU-equipped desktop is the fastest platform by a wide margin. Among CPU-only systems, the Ryzen 9 7940HS mini PC outperforms the Intel Ultra 7 155H laptop despite having fewer cores — AMD's larger L3 cache (16 MB vs. Intel's shared cache architecture) benefits LLM inference workloads significantly. The Galaxy S24 Ultra runs 6 models comfortably at 3.6–13 TPS. The Galaxy Tab S9 FE+, with only 2.5 GB available RAM, can run only Gemma 3 1B at 1.7 TPS — demonstrating that NotesXML's Ultra-Light tier provides functional AI on even the most memory-constrained Android devices.

Published Industry Benchmark Scores

Model	MMLU	MMLU-Pro	ARC-C	TruthfulQA	HellaSwag
Gemma 3 1B	~38%	~27%	—	—	—
SmolLM3 3B	68.9% ¹	—	62.3% ¹	—	78.5% ¹
Ministral 3 3B	60.8% ²	35.3% ²	80.3% ²	62.9% ²	77.2% ²
Phi-4 Mini 3.8B	67.3% ³	52.8% ³	83.7% ³	66.4% ³	69.1% ³
Gemma 4 E2B	—	60.0% ⁴	—	—	—
Gemma 3n E4B	—	—	—	—	—
Gemma 4 E4B	—	69.4% ⁴	—	—	—
Ministral 3 8B	—	64.2% ⁵	—	—	—
Gemma 3 12B	~75% ⁶	~57% ⁶	—	—	—
Ministral 3 14B	~74% ⁷	—	—	—	—
Phi-4 14B	84.8% ³	73.3% ³	87.5% ³	77.0% ³	88.6% ³
Mistral Small 3.1 24B	~81% ⁸	66.76% ⁹	91.29% ⁸	—	—

Sources:
¹ HuggingFace SmolLM3 official blog (July 2025) — Global MMLU (multilingual variant); standard MMLU-Pro not separately evaluated
² Microsoft Phi-4 Mini Technical Report (February 2025) — comparative evaluation table
³ Microsoft Phi-4 Technical Report (December 2024 / February 2025)
⁵ Artificial Analysis independent evaluation
⁶ Approximated from Google Gemma 3 Technical Report (March 2025)
⁷ Mistral official benchmark reporting (MMLU Multilingual); standard MMLU not separately published for this variant
⁸ Mistral Small 3 announcement (January 2025) and Mistral-Small-24B-Base-2501 model card — MMLU "over 81%" for the Small 3 / 3.1 family; ARC-C 91.29% (0-shot) from base model evaluation
⁹ Mistral Small 3.1 → 3.2 comparative analysis (June 2025) — MMLU-Pro reported as 66.76% for 3.1 Instruct baseline
¹⁰ Google Gemma 3 Technical Report (March 2025)

~ = approximate score from published data. — = not officially published or independently verified for this model variant. Gemma 3n E4B (MatFormer architecture) has no widely published standard MMLU evaluation.

Reading the Two Tables Together

All 12 catalog models have both NotesXML benchmark results and published industry scores (with em-dashes where a given metric was not officially published for a specific model variant). Scores align well: models scoring high on PLE and PhD Philosophy consistently show strong MMLU and MMLU-Pro scores, confirming the NotesXML benchmarks measure genuine capability. Mistral Small 3.1 24B's ~81% MMLU and 66.76% MMLU-Pro put it above every model in the cross-platform tier set except Phi-4 14B (84.8% / 73.3%) — matching what the NotesXML results show, where Mistral leads the PLE and PhD benchmarks while Phi-4 14B retains leadership on SDB-1 deductive reasoning. The cross-platform TPS comparison above shows that quality scores are hardware-independent — the same model produces the same answers regardless of platform speed.

Key Observations

The Efficiency Sweet Spot (2–5B effective parameters): Models in this range deliver the best capability-to-speed ratios. Gemma 4 E2B (2.3B effective) achieves 94% on HEE and 91% on PhD Philosophy at 149 TPS — confirmed by its official MMLU-Pro of 60.0%.

Phi-4 Architecture and Deductive Reasoning: Phi-4 Mini 3.8B is the only sub-10B model to score meaningfully on SDB-1 (60%), and Phi-4 14B is the strongest deductive reasoner among the benchmarked models (80%). Published TruthfulQA of 66.4% (Mini) and 77.0% (14B) further confirm Phi-4 leads on factual reliability — a distinct architectural strength.

Ministral 3 Family and Language Mastery: Ministral 3 14B leads HEE outright (49/50, 98%) and holds the PLE lead among the 11 cross-platform models (95% Expert Mastery) — only Mistral Small 3.1 24B exceeds it at 96.5%, but that model is Desktop Pro tier only. The Ministral 3 family is optimized for language quality and instruction following — the strongest choice on mainstream hardware for writing, editing, and summarization. The CR-2026-415 re-benchmark at `--temp 0.15` produced a notable +3-point PhD Philosophy gain for Ministral 3 3B (91 → 94), reflecting tighter sampling on reasoning tasks; other knowledge scores across the family were sampling-insensitive and unchanged within ±1 point.

Gemma 4 Generational Leap: Gemma 4 E4B (4.5B eff) achieves 98% on PhD Philosophy — tied for first among the cross-platform benchmarked models — while running on hardware suited to mini PCs and laptops. Both Gemma 4 edge models demonstrate a significant architectural advance, with official MMLU-Pro scores of 60.0% (E2B) and 69.4% (E4B) confirming generational gains over their Gemma 3 predecessors.

Desktop Pro — First PhD Mastery: Mistral Small 3.1 24B is the first and only model in the catalog to achieve a perfect 100/100 on the PhD Philosophy benchmark, and it earned the highest composite MSI (Model Suitability Index) of any model tested (1.67 post-CR-415, essentially unchanged from 1.68 at temp=1.0). It also ties the HEE ceiling (48/50, 96%) and leads the PLE benchmark outright (193/200, 96.5%, Expert / Mastery). At the Mistral-AI-recommended `--temp 0.15` its per-task profile on the Model Benchmark is uniformly strong on structured tasks: Grammar Polish A at 65.8 TPS, Translation ES A at 59.7 TPS, XML Generation A at 37.8 TPS with a perfect 10/10, Action Item Extraction A at 49.8 TPS with 8/9, and Vision OCR A at 16.3 TPS with a perfect 100% (11/11 target terms) — the strongest vision performance of any catalog model. The one caveat is Summarization, which graded C at 42.5 TPS with output running 37% of source length (target: under 25%). CR-2026-415 reduced the sampling temperature from 1.0 to 0.15, but Summarization verbosity remained essentially unchanged (36% pre-CR-415 vs 37% post-CR-415) — indicating the verbosity is a prompt-compliance pattern rather than a sampling artifact. All four Mistral-family entries (Ministral 3 3B/8B/14B and Mistral Small 3.1 24B) show the same pattern (Grade C at 33–46% source length), suggesting a family-wide prompt-following characteristic that may warrant a future Mistral-specific summarize prompt revision. At 40.7 TPS average, Mistral Small 3.1 24B is the slowest model on the desktop GPU — a fair trade for the depth it delivers, and the price of admission for Desktop Pro tier.

Model Recommendations by Hardware Tier

Ultra-Light — Up to ~2 GB RAM Available for AI

Recommended: Gemma 3 1B

The only model viable under severe memory constraints. At 219 TPS it responds instantly. Best used for quick note titling, short summaries, and basic text polish. Not suitable for complex analysis. This is NotesXML's free-tier default, ensuring every user has functional AI from day one.

Lightweight — ~2–4 GB RAM Available for AI

Best for performance: Phi-4 Mini 3.8B — The strongest all-around result in this tier: HEE 90%, PLE 89%, PhD 93%, SDB-1 60%, 177 TPS, with the most comprehensively published external profile of any sub-4B model — MMLU 67.3%, MMLU-Pro 52.8%, ARC-C 83.7%, TruthfulQA 66.4%.

Best for language quality: Ministral 3 3B — PLE 88%, PhD 91%, MMLU 60.8%, and 20% SDB-1 at 146 TPS. Better language quality than its published MMLU suggests; especially strong for drafting and summarization.

Best for multilingual tasks: SmolLM3 3B — HEE 86%, PLE 78%, PhD 85% at 191.6 TPS — the fastest 3B-class model in the catalog. Trained on 11.2 trillion tokens with native support for 6 languages and a 128K context window via YaRN. Published Global MMLU 68.9%, ARC-C 62.3%, HellaSwag 78.5%. Quality trails Ministral 3 3B on reasoning benchmarks but offers the best multilingual coverage and context length in this tier.

Standard — ~3–5 GB RAM Available for AI

Recommended: Gemma 4 E2B (2.3B effective)

Post-graduate level knowledge and reasoning in an edge-friendly package: HEE 94%, PhD Philosophy 91%, MMLU-Pro 60.0% (official Google model card) at 149 TPS. Native audio support on E2B/E4B models enables voice-based workflows. The absence of SDB-1 performance is the only meaningful limitation for this tier.

Also strong: Gemma 3n E4B (4B effective) — HEE 96%, PhD 95%, 88 TPS. Comparable knowledge scores to Gemma 4 E2B at lower speed; a solid alternative for systems where RAM is slightly more available than the E2B's footprint requires.

Enhanced — ~5–9 GB RAM Available for AI

Recommended: Ministral 3 8B

Enters Expert/Mastery territory on PLE (94%) and PhD Philosophy (96%), matching much larger models on language benchmarks at 83 TPS. Published MMLU-Pro 64.2% confirms competitive reasoning capability.

Also strong: Gemma 4 E4B (4.5B effective) — PhD Philosophy 97% — second only to Phi-4 14B, with MMLU-Pro 69.4% from Google's official model card. Runs at 100 TPS. Particularly strong choice for knowledge-intensive and vision tasks.

Advanced — ~9–12 GB RAM Available for AI

Best for language quality: Ministral 3 14B — Tops HEE outright (98%) and leads PLE in the Advanced tier (95% Expert Mastery). The go-to choice on mainstream hardware for drafting, editing, summarization, and professional writing at 59 TPS.

Best for reasoning: Phi-4 14B — The only benchmarked model to achieve Advanced Deductive Reasoning on SDB-1 (80%), tied for top PhD Philosophy score in the Advanced tier (98%), with the strongest published external profile in the suite: MMLU 84.8%, MMLU-Pro 73.3%, ARC-C 87.5%, TruthfulQA 77.0%. The clear choice for structured analysis, logical reasoning, and high-reliability factual assistance at 71 TPS.

Desktop Pro — 16 GB+ Discrete GPU VRAM

Recommended: Mistral Small 3.1 24B

The deepest model in the catalog: first and only to achieve perfect 100/100 PhD Philosophy Mastery, leads PLE outright (96.5% Expert / Mastery), ties the HEE knowledge ceiling (96%), second on SDB-1 (5/10, behind Phi-4 14B's 8/10). Highest composite MSI of any model tested (1.67), vision-capable with perfect Vision OCR (11/11 terms). Runs at 40.7 TPS on RTX 5070 Ti 16 GB with 39 GPU layers at the Mistral-AI-recommended `--temp 0.15` — slower in raw tokens/sec than smaller models, but every response is graded at post-graduate / expert depth across domains. Published industry scores align: MMLU ~81%, MMLU-Pro 66.76%, ARC-C 91.29%. Requires 16 GB of discrete GPU VRAM; not suitable for CPU-only or integrated-graphics systems. The right choice when response quality matters more than response speed and the hardware is available.

Summary Recommendation Table

Your Hardware	Best Model	Primary Strength
Phone / low-RAM device	Gemma 3 1B	Only viable option under 2 GB; fast
Light device, general use (2–4 GB)	Phi-4 Mini 3.8B	Best all-rounder; SDB-1 leader in class
Light device, multilingual (2–4 GB)	SmolLM3 3B	128K context; 6 languages; strong MMLU
Standard device (3–5 GB for AI)	Gemma 4 E2B	Post-graduate knowledge at edge scale
Enhanced desktop (5–9 GB) — writing	Ministral 3 8B	Expert/Mastery language at practical speed
Enhanced desktop (5–9 GB) — knowledge	Gemma 4 E4B	Top PhD/HEE scores in this tier
Advanced (9–12 GB) — writing	Ministral 3 14B	Best PLE and HEE in benchmarked set
Advanced (9–12 GB) — reasoning	Phi-4 14B	Best SDB-1; strongest external profile
Desktop Pro (16+ GB VRAM)	Mistral Small 3.1 24B	First PhD Mastery; perfect Vision OCR; highest MSI

Notes on GPU Acceleration

The desktop benchmarks above use GPU-accelerated inference via an NVIDIA RTX 5070 Ti, while the laptop, mini PC, and mobile benchmarks reflect CPU-only inference. The Performance by Hardware Platform table shows the full range: from 219 TPS on the GPU desktop down to 1.7 TPS on the tablet. Users with a discrete NVIDIA GPU on any platform will see similar acceleration through GPU offload. NotesXML's AI Acceleration Roadmap includes planned support for NVIDIA CUDA (current), Intel iGPU via SYCL, and AMD ROCm in future releases.

About NotesXML AI

All model inference runs locally on your device using llama.cpp. No subscription is required for AI beyond the one-time Professional license. Models are downloaded once and stored locally — no internet connection is needed at inference time. Your notes, queries, and AI responses remain entirely private.

← Back to Articles