← AI Benchmark Suite

Long Context Benchmark (LongCtx)

IWV Digital Solutions LLC | NotesXML AI Evaluation Series

20 tasks × 5 points each = 100 points maximum. 4 task types across 5 context length tiers (4K–128K tokens). VRAM cascade detection prevents scores from being silently inflated by insufficient hardware. The most hardware-sensitive benchmark in the suite.


Introduction

One of the most important capabilities a note-taking AI can have is the ability to reason about long documents — not just glance at the first paragraph, but track specific facts planted deep within a document, synthesize information scattered across thousands of words, and apply rules correctly even when the relevant details are separated by tens of thousands of tokens.

The Long Context Benchmark measures exactly this. It systematically tests whether an AI model can retain and recall information at five different context lengths — from 4K to 128K tokens — using four distinct task types that probe different aspects of long-context comprehension.

Background: The “Lost in the Middle” Problem

Language models have finite context windows. In practice, as context grows, many models suffer from what researchers call the “lost in the middle” phenomenon: information at the beginning and end of a long context is reliably recalled, but facts buried in the middle are frequently forgotten. A related issue is positional bias — models retrieving information from some document positions more reliably than others. LongCtx measures both.

Reference: Liu, N.F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arxiv.org/abs/2307.03172.


Assessment Overview

LongCtx comprises 20 tasks, each worth 5 points, for a total of 100 points. Tasks span five context length tiers and four task types. Every task uses the same pool of generic technical filler prose (procurement workflows, audit schedules, risk matrices, calibration certificates) to pad context without introducing content that could trigger hallucinated “correct” answers. maxTokens per response is capped at 200.

Per-stage timeouts protect against hung model loads: 5 minutes for model switching, and 2 minutes + 2 additional seconds per 1K context tokens for each inference task (so a 131K-context task gets a ~6 min 22 sec ceiling).

Context Length Tiers

TierTarget TokensApprox. Word Count
14,096~2,800 words
216,384~11,500 words
332,768~23,000 words
465,536~45,900 words
5131,072~91,800 words

Task Types

TypeDescriptionTasksGrading
needleSingle fact embedded in filler at a specified position; model asked to retrieve itLCB-001, 005, 009, 013, 017Binary regex match: 0 or 5 pts (4 if verbose)
position_biasSame as needle, but position varies (10%, 50%, 90%, 95%) to probe retrieval uniformityLCB-003, 007, 011, 015, 019Same rubric as needle
multi_factMultiple related facts at different positions; model must synthesize allLCB-002, 006, 010, 014, 018Partial credit per matched pattern
reasoningFacts spread throughout context; model applies a rule and reaches a conclusionLCB-004, 008, 012, 016, 020Rubric-scored: correct answer + citations + reasoning

The 20 Tasks

Tier 1 — 4K Context

IDTypeNeedle / FactsQuestionExpected Answer / Rubric Key
LCB-001needle (pos 50%)“Project Crestfall achieved a resolution of 0.003 arc-seconds in its phase-two calibration.”What resolution did Project Crestfall achieve in its phase-two calibration?0.003 arc-seconds
LCB-002multi_factAurora Initiative: (1) 3 founding members at pos 20%; (2) established July 14, 2019 at pos 60%; (3) initial funding $12 million at pos 85%How many founding members, when established, and what was initial funding?3 / July 14, 2019 / $12 million
LCB-003position_bias (pos 10%)“The Helios-7 satellite completes one full orbit in 94 minutes.”How long does Helios-7 take to complete one full orbit?94 minutes
LCB-004reasoningPolicy 3.2: safety audit every 18 months (pos 15%). Unit Alpha-12 last audit March 1, 2024 (pos 50%). Today is November 1, 2025 (pos 80%).Is Unit Alpha-12 overdue for a safety audit? Briefly explain.Yes/overdue + cites 18-month interval + cites last audit date

Tier 2 — 16K Context

IDTypeNeedle / FactsQuestionExpected Answer
LCB-005needle (pos 30%)“The Meridian Protocol requires all field reports to be submitted within 48 hours of the observed event.”What is the submission deadline for field reports under the Meridian Protocol?48 hours
LCB-006multi_factCompound XR-7: (1) boiling point 312°C at pos 10%; (2) synthesized 1987 at pos 55%; (3) Group II hazardous material at pos 80%Boiling point, year of synthesis, hazard classification?312°C / 1987 / Group II
LCB-007position_bias (pos 50%)“The operational frequency of the Nexus relay is 2.4 GHz.”What is the operational frequency of the Nexus relay?2.4 GHz
LCB-008reasoningContract Section 4.1: delivery within 30 days of PO date (pos 20%). PO #4472 issued September 15, 2025 (pos 60%). Delivery October 20, 2025 (pos 75%).Was delivery under PO #4472 made on time? Briefly explain.Yes/on time + cites 30 days + cites dates

Tier 3 — 32K Context

IDTypeNeedle / FactsQuestionExpected Answer
LCB-009needle (pos 40%)“The Thornwick Array uses exactly 72 resonance chambers in its primary configuration.”How many resonance chambers does the Thornwick Array use?72
LCB-010multi_factKepler-9 project: (1) 6-layer neural architecture at pos 15%; (2) deployed April 3, 2023 at pos 45%; (3) 200 ms latency target at pos 75%Architecture layers, deployment date, latency target?6 layers / April 3, 2023 / 200 ms
LCB-011position_bias (pos 90%)“The Cascade Bridge was constructed using 8,400 metric tons of structural steel.”How much structural steel was used in the Cascade Bridge?8,400 metric tons
LCB-012reasoningSOP-14: minimum crew of 4 for night operations (pos 10%). Current night crew: 3 personnel (pos 40%). Night operations scheduled tonight (pos 70%).Is tonight's crew in compliance with SOP-14? Briefly explain.No/not in compliance + cites minimum of 4 + cites current crew of 3

Tier 4 — 64K Context

IDTypeNeedle / FactsQuestionExpected Answer
LCB-013needle (pos 20%)“The Veridian compound requires a minimum curing temperature of 85 degrees Celsius.”What is the minimum curing temperature for the Veridian compound?85 degrees Celsius
LCB-014multi_factSentinel Array: (1) 144 ground-based monitoring stations at pos 10%; (2) transmit data every 6 minutes at pos 45%; (3) 2 geographically distinct archive facilities at pos 85%Station count, transmission interval, number of archive facilities?144 / every 6 minutes / 2 facilities
LCB-015position_bias (pos 10%)“The Vega-X compound has a melting point of 1742 degrees Celsius.”What is the melting point of the Vega-X compound?1742 degrees Celsius
LCB-016reasoningSpec SC-3: bolts torqued to 145 N·m ±5% (pos 10%). Reading of 138 N·m recorded on assembly line 4 (pos 45%). SC-3 applies to all bolts on assembly line 4 (pos 75%).Was the torque reading on assembly line 4 within spec? Briefly explain.No/out of spec + cites 138, 145, 5% tolerance + cites below lower bound

Tier 5 — 128K Context

IDTypeNeedle / FactsQuestionExpected Answer
LCB-017needle (pos 50%)“The Quintessa platform reached commercial readiness on August 8, 2032.”On what date did the Quintessa platform reach commercial readiness?August 8, 2032
LCB-018multi_factStrategy Sigma: (1) proposed by Mei Wong, March 2025 at pos 5%; (2) targets infrastructure resilience in coastal regions at pos 40%; (3) 2027 fiscal allocation 2.4 billion at pos 70%; (4) renamed Strategy Sigma-Plus early 2028 at pos 95%Who proposed it and when, target domain, 2027 fiscal allocation, when renamed?Mei Wong / March 2025 / coastal resilience / 2.4 billion / early 2028
LCB-019position_bias (pos 95%)“The Aurora-9 prototype completed its final endurance test on December 1, 2034.”On what date did the Aurora-9 prototype complete its final endurance test?December 1, 2034
LCB-020reasoningPolicy 8.1: retire any device older than 10 years (pos 5%). Unit RGV-441 manufacturing date June 12, 2014 (pos 30%). Last refurbished November 4, 2024 (pos 55%). Refurbishment does NOT reset manufacturing date for Policy 8.1 (pos 80%).As of April 30, 2026, must RGV-441 be retired per Policy 8.1? Briefly explain.Yes/must be retired + cites 2014/10+ years + cites refurbishment rule

Scoring Rubric

Needle and Position Bias Tasks

ScoreCriteria
5Exact correct answer, concise (≤60 words)
4Correct answer with significant noise (>60 words)
3Close match (token overlap ratio >0.6)
2Partial match (token overlap 0.3–0.6)
1Topic mentioned, wrong answer
0Empty response, no topic reference, or completely wrong

Multi-Fact Tasks

Points awarded per correctly retrieved fact according to required regex patterns. Maximum equals 5 points per task.

Reasoning Tasks

ComponentPoints
Correct conclusion stated2
Cites the relevant numbers or rule threshold2
Explicit reasoning chain1
Penalty: wrong conclusion stated−5

VRAM Cascade and Tier Skipping

If a VRAM cascade reduces effective context below a tier's target, all 4 tasks in that tier are recorded as score: 0, reason: ‘insufficient_vram’ and marked skipped: true. Lower tiers that are achievable continue to run normally. This ensures scores are never inflated by running a 65K-context task in an actual 2K context window.

Overall Score Levels

ScorePercentageLevel
90–10090–100%Excellent Retention
75–8975–89%Strong Retention
60–7460–74%Functional Retention
40–5940–59%Partial Retention
20–3920–39%Limited Retention
0–190–19%Severe Context Loss

Summary

Long Context performance is one of the most hardware-dependent capabilities in the on-device AI stack. The same model may score Excellent Retention at 4K context but Partial Retention at 64K, not because the model is defective but because the device lacks sufficient VRAM for the full context tier. LongCtx makes this visible and quantifiable.

The benchmark's built-in tier-skip mechanism ensures that reported scores always reflect genuine performance at the effective context size. The position bias sub-analysis reveals whether a model retrieves facts uniformly across the document or suffers from the “lost in the middle” effect. For NotesXML users working with large research notebooks, multi-chapter documents, or long meeting transcripts, LongCtx provides the most operationally relevant signal in the entire benchmark suite.

References


© 2026 IWV Digital Solutions LLC. All rights reserved.

← Back to AI Benchmark Suite