May 2026 · IWV Digital Solutions · 10 min read

Long Context Benchmark (LongCtx)

IWV Digital Solutions LLC | NotesXML AI Evaluation Series

20 tasks × 5 points each = 100 points maximum. 4 task types across 5 context length tiers (4K–128K tokens). VRAM cascade detection prevents scores from being silently inflated by insufficient hardware. The most hardware-sensitive benchmark in the suite.

Introduction

One of the most important capabilities a note-taking AI can have is the ability to reason about long documents — not just glance at the first paragraph, but track specific facts planted deep within a document, synthesize information scattered across thousands of words, and apply rules correctly even when the relevant details are separated by tens of thousands of tokens.

The Long Context Benchmark measures exactly this. It systematically tests whether an AI model can retain and recall information at five different context lengths — from 4K to 128K tokens — using four distinct task types that probe different aspects of long-context comprehension.

Background: The “Lost in the Middle” Problem

Language models have finite context windows. In practice, as context grows, many models suffer from what researchers call the “lost in the middle” phenomenon: information at the beginning and end of a long context is reliably recalled, but facts buried in the middle are frequently forgotten. A related issue is positional bias — models retrieving information from some document positions more reliably than others. LongCtx measures both.

Reference: Liu, N.F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arxiv.org/abs/2307.03172.

Assessment Overview

LongCtx comprises 20 tasks, each worth 5 points, for a total of 100 points. Tasks span five context length tiers and four task types. Every task uses the same pool of generic technical filler prose (procurement workflows, audit schedules, risk matrices, calibration certificates) to pad context without introducing content that could trigger hallucinated “correct” answers. maxTokens per response is capped at 200.

Per-stage timeouts protect against hung model loads: 5 minutes for model switching, and 2 minutes + 2 additional seconds per 1K context tokens for each inference task (so a 131K-context task gets a ~6 min 22 sec ceiling).

Context Length Tiers

Tier	Target Tokens	Approx. Word Count
1	4,096	~2,800 words
2	16,384	~11,500 words
3	32,768	~23,000 words
4	65,536	~45,900 words
5	131,072	~91,800 words

Task Types

Type	Description	Tasks	Grading
needle	Single fact embedded in filler at a specified position; model asked to retrieve it	LCB-001, 005, 009, 013, 017	Binary regex match: 0 or 5 pts (4 if verbose)
position_bias	Same as needle, but position varies (10%, 50%, 90%, 95%) to probe retrieval uniformity	LCB-003, 007, 011, 015, 019	Same rubric as needle
multi_fact	Multiple related facts at different positions; model must synthesize all	LCB-002, 006, 010, 014, 018	Partial credit per matched pattern
reasoning	Facts spread throughout context; model applies a rule and reaches a conclusion	LCB-004, 008, 012, 016, 020	Rubric-scored: correct answer + citations + reasoning

The 20 Tasks

Tier 1 — 4K Context

ID	Type	Needle / Facts	Question	Expected Answer / Rubric Key
LCB-001	needle (pos 50%)	“Project Crestfall achieved a resolution of 0.003 arc-seconds in its phase-two calibration.”	What resolution did Project Crestfall achieve in its phase-two calibration?	0.003 arc-seconds
LCB-002	multi_fact	Aurora Initiative: (1) 3 founding members at pos 20%; (2) established July 14, 2019 at pos 60%; (3) initial funding $12 million at pos 85%	How many founding members, when established, and what was initial funding?	3 / July 14, 2019 / $12 million
LCB-003	position_bias (pos 10%)	“The Helios-7 satellite completes one full orbit in 94 minutes.”	How long does Helios-7 take to complete one full orbit?	94 minutes
LCB-004	reasoning	Policy 3.2: safety audit every 18 months (pos 15%). Unit Alpha-12 last audit March 1, 2024 (pos 50%). Today is November 1, 2025 (pos 80%).	Is Unit Alpha-12 overdue for a safety audit? Briefly explain.	Yes/overdue + cites 18-month interval + cites last audit date

Tier 2 — 16K Context

ID	Type	Needle / Facts	Question	Expected Answer
LCB-005	needle (pos 30%)	“The Meridian Protocol requires all field reports to be submitted within 48 hours of the observed event.”	What is the submission deadline for field reports under the Meridian Protocol?	48 hours
LCB-006	multi_fact	Compound XR-7: (1) boiling point 312°C at pos 10%; (2) synthesized 1987 at pos 55%; (3) Group II hazardous material at pos 80%	Boiling point, year of synthesis, hazard classification?	312°C / 1987 / Group II
LCB-007	position_bias (pos 50%)	“The operational frequency of the Nexus relay is 2.4 GHz.”	What is the operational frequency of the Nexus relay?	2.4 GHz
LCB-008	reasoning	Contract Section 4.1: delivery within 30 days of PO date (pos 20%). PO #4472 issued September 15, 2025 (pos 60%). Delivery October 20, 2025 (pos 75%).	Was delivery under PO #4472 made on time? Briefly explain.	Yes/on time + cites 30 days + cites dates

Tier 3 — 32K Context

ID	Type	Needle / Facts	Question	Expected Answer
LCB-009	needle (pos 40%)	“The Thornwick Array uses exactly 72 resonance chambers in its primary configuration.”	How many resonance chambers does the Thornwick Array use?	72
LCB-010	multi_fact	Kepler-9 project: (1) 6-layer neural architecture at pos 15%; (2) deployed April 3, 2023 at pos 45%; (3) 200 ms latency target at pos 75%	Architecture layers, deployment date, latency target?	6 layers / April 3, 2023 / 200 ms
LCB-011	position_bias (pos 90%)	“The Cascade Bridge was constructed using 8,400 metric tons of structural steel.”	How much structural steel was used in the Cascade Bridge?	8,400 metric tons
LCB-012	reasoning	SOP-14: minimum crew of 4 for night operations (pos 10%). Current night crew: 3 personnel (pos 40%). Night operations scheduled tonight (pos 70%).	Is tonight's crew in compliance with SOP-14? Briefly explain.	No/not in compliance + cites minimum of 4 + cites current crew of 3

Tier 4 — 64K Context

ID	Type	Needle / Facts	Question	Expected Answer
LCB-013	needle (pos 20%)	“The Veridian compound requires a minimum curing temperature of 85 degrees Celsius.”	What is the minimum curing temperature for the Veridian compound?	85 degrees Celsius
LCB-014	multi_fact	Sentinel Array: (1) 144 ground-based monitoring stations at pos 10%; (2) transmit data every 6 minutes at pos 45%; (3) 2 geographically distinct archive facilities at pos 85%	Station count, transmission interval, number of archive facilities?	144 / every 6 minutes / 2 facilities
LCB-015	position_bias (pos 10%)	“The Vega-X compound has a melting point of 1742 degrees Celsius.”	What is the melting point of the Vega-X compound?	1742 degrees Celsius
LCB-016	reasoning	Spec SC-3: bolts torqued to 145 N·m ±5% (pos 10%). Reading of 138 N·m recorded on assembly line 4 (pos 45%). SC-3 applies to all bolts on assembly line 4 (pos 75%).	Was the torque reading on assembly line 4 within spec? Briefly explain.	No/out of spec + cites 138, 145, 5% tolerance + cites below lower bound

Tier 5 — 128K Context

ID	Type	Needle / Facts	Question	Expected Answer
LCB-017	needle (pos 50%)	“The Quintessa platform reached commercial readiness on August 8, 2032.”	On what date did the Quintessa platform reach commercial readiness?	August 8, 2032
LCB-018	multi_fact	Strategy Sigma: (1) proposed by Mei Wong, March 2025 at pos 5%; (2) targets infrastructure resilience in coastal regions at pos 40%; (3) 2027 fiscal allocation 2.4 billion at pos 70%; (4) renamed Strategy Sigma-Plus early 2028 at pos 95%	Who proposed it and when, target domain, 2027 fiscal allocation, when renamed?	Mei Wong / March 2025 / coastal resilience / 2.4 billion / early 2028
LCB-019	position_bias (pos 95%)	“The Aurora-9 prototype completed its final endurance test on December 1, 2034.”	On what date did the Aurora-9 prototype complete its final endurance test?	December 1, 2034
LCB-020	reasoning	Policy 8.1: retire any device older than 10 years (pos 5%). Unit RGV-441 manufacturing date June 12, 2014 (pos 30%). Last refurbished November 4, 2024 (pos 55%). Refurbishment does NOT reset manufacturing date for Policy 8.1 (pos 80%).	As of April 30, 2026, must RGV-441 be retired per Policy 8.1? Briefly explain.	Yes/must be retired + cites 2014/10+ years + cites refurbishment rule

Scoring Rubric

Needle and Position Bias Tasks

Score	Criteria
5	Exact correct answer, concise (≤60 words)
4	Correct answer with significant noise (>60 words)
3	Close match (token overlap ratio >0.6)
2	Partial match (token overlap 0.3–0.6)
1	Topic mentioned, wrong answer
0	Empty response, no topic reference, or completely wrong

Multi-Fact Tasks

Points awarded per correctly retrieved fact according to required regex patterns. Maximum equals 5 points per task.

Reasoning Tasks

Component	Points
Correct conclusion stated	2
Cites the relevant numbers or rule threshold	2
Explicit reasoning chain	1
Penalty: wrong conclusion stated	−5

VRAM Cascade and Tier Skipping

If a VRAM cascade reduces effective context below a tier's target, all 4 tasks in that tier are recorded as score: 0, reason: ‘insufficient_vram’ and marked skipped: true. Lower tiers that are achievable continue to run normally. This ensures scores are never inflated by running a 65K-context task in an actual 2K context window.

Overall Score Levels

Score	Percentage	Level
90–100	90–100%	Excellent Retention
75–89	75–89%	Strong Retention
60–74	60–74%	Functional Retention
40–59	40–59%	Partial Retention
20–39	20–39%	Limited Retention
0–19	0–19%	Severe Context Loss

Summary

Long Context performance is one of the most hardware-dependent capabilities in the on-device AI stack. The same model may score Excellent Retention at 4K context but Partial Retention at 64K, not because the model is defective but because the device lacks sufficient VRAM for the full context tier. LongCtx makes this visible and quantifiable.

The benchmark's built-in tier-skip mechanism ensures that reported scores always reflect genuine performance at the effective context size. The position bias sub-analysis reveals whether a model retrieves facts uniformly across the document or suffers from the “lost in the middle” effect. For NotesXML users working with large research notebooks, multi-chapter documents, or long meeting transcripts, LongCtx provides the most operationally relevant signal in the entire benchmark suite.

References

Liu, N.F. et al. (2023). Lost in the Middle. arxiv.org/abs/2307.03172
Kamradt, G. (2023). github.com/gkamradt/LLMTest_NeedleInAHaystack
Anthropic Research. anthropic.com/research

← Back to AI Benchmark Suite