Long Context Benchmark (LongCtx)
IWV Digital Solutions LLC | NotesXML AI Evaluation Series
20 tasks × 5 points each = 100 points maximum. 4 task types across 5 context length tiers (4K–128K tokens). VRAM cascade detection prevents scores from being silently inflated by insufficient hardware. The most hardware-sensitive benchmark in the suite.
Introduction
One of the most important capabilities a note-taking AI can have is the ability to reason about long documents — not just glance at the first paragraph, but track specific facts planted deep within a document, synthesize information scattered across thousands of words, and apply rules correctly even when the relevant details are separated by tens of thousands of tokens.
The Long Context Benchmark measures exactly this. It systematically tests whether an AI model can retain and recall information at five different context lengths — from 4K to 128K tokens — using four distinct task types that probe different aspects of long-context comprehension.
Background: The “Lost in the Middle” Problem
Language models have finite context windows. In practice, as context grows, many models suffer from what researchers call the “lost in the middle” phenomenon: information at the beginning and end of a long context is reliably recalled, but facts buried in the middle are frequently forgotten. A related issue is positional bias — models retrieving information from some document positions more reliably than others. LongCtx measures both.
Reference: Liu, N.F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arxiv.org/abs/2307.03172.
Assessment Overview
LongCtx comprises 20 tasks, each worth 5 points, for a total of 100 points. Tasks span five context length tiers and four task types. Every task uses the same pool of generic technical filler prose (procurement workflows, audit schedules, risk matrices, calibration certificates) to pad context without introducing content that could trigger hallucinated “correct” answers. maxTokens per response is capped at 200.
Per-stage timeouts protect against hung model loads: 5 minutes for model switching, and 2 minutes + 2 additional seconds per 1K context tokens for each inference task (so a 131K-context task gets a ~6 min 22 sec ceiling).
Context Length Tiers
| Tier | Target Tokens | Approx. Word Count |
|---|---|---|
| 1 | 4,096 | ~2,800 words |
| 2 | 16,384 | ~11,500 words |
| 3 | 32,768 | ~23,000 words |
| 4 | 65,536 | ~45,900 words |
| 5 | 131,072 | ~91,800 words |
Task Types
| Type | Description | Tasks | Grading |
|---|---|---|---|
| needle | Single fact embedded in filler at a specified position; model asked to retrieve it | LCB-001, 005, 009, 013, 017 | Binary regex match: 0 or 5 pts (4 if verbose) |
| position_bias | Same as needle, but position varies (10%, 50%, 90%, 95%) to probe retrieval uniformity | LCB-003, 007, 011, 015, 019 | Same rubric as needle |
| multi_fact | Multiple related facts at different positions; model must synthesize all | LCB-002, 006, 010, 014, 018 | Partial credit per matched pattern |
| reasoning | Facts spread throughout context; model applies a rule and reaches a conclusion | LCB-004, 008, 012, 016, 020 | Rubric-scored: correct answer + citations + reasoning |
The 20 Tasks
Tier 1 — 4K Context
| ID | Type | Needle / Facts | Question | Expected Answer / Rubric Key |
|---|---|---|---|---|
| LCB-001 | needle (pos 50%) | “Project Crestfall achieved a resolution of 0.003 arc-seconds in its phase-two calibration.” | What resolution did Project Crestfall achieve in its phase-two calibration? | 0.003 arc-seconds |
| LCB-002 | multi_fact | Aurora Initiative: (1) 3 founding members at pos 20%; (2) established July 14, 2019 at pos 60%; (3) initial funding $12 million at pos 85% | How many founding members, when established, and what was initial funding? | 3 / July 14, 2019 / $12 million |
| LCB-003 | position_bias (pos 10%) | “The Helios-7 satellite completes one full orbit in 94 minutes.” | How long does Helios-7 take to complete one full orbit? | 94 minutes |
| LCB-004 | reasoning | Policy 3.2: safety audit every 18 months (pos 15%). Unit Alpha-12 last audit March 1, 2024 (pos 50%). Today is November 1, 2025 (pos 80%). | Is Unit Alpha-12 overdue for a safety audit? Briefly explain. | Yes/overdue + cites 18-month interval + cites last audit date |
Tier 2 — 16K Context
| ID | Type | Needle / Facts | Question | Expected Answer |
|---|---|---|---|---|
| LCB-005 | needle (pos 30%) | “The Meridian Protocol requires all field reports to be submitted within 48 hours of the observed event.” | What is the submission deadline for field reports under the Meridian Protocol? | 48 hours |
| LCB-006 | multi_fact | Compound XR-7: (1) boiling point 312°C at pos 10%; (2) synthesized 1987 at pos 55%; (3) Group II hazardous material at pos 80% | Boiling point, year of synthesis, hazard classification? | 312°C / 1987 / Group II |
| LCB-007 | position_bias (pos 50%) | “The operational frequency of the Nexus relay is 2.4 GHz.” | What is the operational frequency of the Nexus relay? | 2.4 GHz |
| LCB-008 | reasoning | Contract Section 4.1: delivery within 30 days of PO date (pos 20%). PO #4472 issued September 15, 2025 (pos 60%). Delivery October 20, 2025 (pos 75%). | Was delivery under PO #4472 made on time? Briefly explain. | Yes/on time + cites 30 days + cites dates |
Tier 3 — 32K Context
| ID | Type | Needle / Facts | Question | Expected Answer |
|---|---|---|---|---|
| LCB-009 | needle (pos 40%) | “The Thornwick Array uses exactly 72 resonance chambers in its primary configuration.” | How many resonance chambers does the Thornwick Array use? | 72 |
| LCB-010 | multi_fact | Kepler-9 project: (1) 6-layer neural architecture at pos 15%; (2) deployed April 3, 2023 at pos 45%; (3) 200 ms latency target at pos 75% | Architecture layers, deployment date, latency target? | 6 layers / April 3, 2023 / 200 ms |
| LCB-011 | position_bias (pos 90%) | “The Cascade Bridge was constructed using 8,400 metric tons of structural steel.” | How much structural steel was used in the Cascade Bridge? | 8,400 metric tons |
| LCB-012 | reasoning | SOP-14: minimum crew of 4 for night operations (pos 10%). Current night crew: 3 personnel (pos 40%). Night operations scheduled tonight (pos 70%). | Is tonight's crew in compliance with SOP-14? Briefly explain. | No/not in compliance + cites minimum of 4 + cites current crew of 3 |
Tier 4 — 64K Context
| ID | Type | Needle / Facts | Question | Expected Answer |
|---|---|---|---|---|
| LCB-013 | needle (pos 20%) | “The Veridian compound requires a minimum curing temperature of 85 degrees Celsius.” | What is the minimum curing temperature for the Veridian compound? | 85 degrees Celsius |
| LCB-014 | multi_fact | Sentinel Array: (1) 144 ground-based monitoring stations at pos 10%; (2) transmit data every 6 minutes at pos 45%; (3) 2 geographically distinct archive facilities at pos 85% | Station count, transmission interval, number of archive facilities? | 144 / every 6 minutes / 2 facilities |
| LCB-015 | position_bias (pos 10%) | “The Vega-X compound has a melting point of 1742 degrees Celsius.” | What is the melting point of the Vega-X compound? | 1742 degrees Celsius |
| LCB-016 | reasoning | Spec SC-3: bolts torqued to 145 N·m ±5% (pos 10%). Reading of 138 N·m recorded on assembly line 4 (pos 45%). SC-3 applies to all bolts on assembly line 4 (pos 75%). | Was the torque reading on assembly line 4 within spec? Briefly explain. | No/out of spec + cites 138, 145, 5% tolerance + cites below lower bound |
Tier 5 — 128K Context
| ID | Type | Needle / Facts | Question | Expected Answer |
|---|---|---|---|---|
| LCB-017 | needle (pos 50%) | “The Quintessa platform reached commercial readiness on August 8, 2032.” | On what date did the Quintessa platform reach commercial readiness? | August 8, 2032 |
| LCB-018 | multi_fact | Strategy Sigma: (1) proposed by Mei Wong, March 2025 at pos 5%; (2) targets infrastructure resilience in coastal regions at pos 40%; (3) 2027 fiscal allocation 2.4 billion at pos 70%; (4) renamed Strategy Sigma-Plus early 2028 at pos 95% | Who proposed it and when, target domain, 2027 fiscal allocation, when renamed? | Mei Wong / March 2025 / coastal resilience / 2.4 billion / early 2028 |
| LCB-019 | position_bias (pos 95%) | “The Aurora-9 prototype completed its final endurance test on December 1, 2034.” | On what date did the Aurora-9 prototype complete its final endurance test? | December 1, 2034 |
| LCB-020 | reasoning | Policy 8.1: retire any device older than 10 years (pos 5%). Unit RGV-441 manufacturing date June 12, 2014 (pos 30%). Last refurbished November 4, 2024 (pos 55%). Refurbishment does NOT reset manufacturing date for Policy 8.1 (pos 80%). | As of April 30, 2026, must RGV-441 be retired per Policy 8.1? Briefly explain. | Yes/must be retired + cites 2014/10+ years + cites refurbishment rule |
Scoring Rubric
Needle and Position Bias Tasks
| Score | Criteria |
|---|---|
| 5 | Exact correct answer, concise (≤60 words) |
| 4 | Correct answer with significant noise (>60 words) |
| 3 | Close match (token overlap ratio >0.6) |
| 2 | Partial match (token overlap 0.3–0.6) |
| 1 | Topic mentioned, wrong answer |
| 0 | Empty response, no topic reference, or completely wrong |
Multi-Fact Tasks
Points awarded per correctly retrieved fact according to required regex patterns. Maximum equals 5 points per task.
Reasoning Tasks
| Component | Points |
|---|---|
| Correct conclusion stated | 2 |
| Cites the relevant numbers or rule threshold | 2 |
| Explicit reasoning chain | 1 |
| Penalty: wrong conclusion stated | −5 |
VRAM Cascade and Tier Skipping
If a VRAM cascade reduces effective context below a tier's target, all 4 tasks in that tier are recorded as score: 0, reason: ‘insufficient_vram’ and marked skipped: true. Lower tiers that are achievable continue to run normally. This ensures scores are never inflated by running a 65K-context task in an actual 2K context window.
Overall Score Levels
| Score | Percentage | Level |
|---|---|---|
| 90–100 | 90–100% | Excellent Retention |
| 75–89 | 75–89% | Strong Retention |
| 60–74 | 60–74% | Functional Retention |
| 40–59 | 40–59% | Partial Retention |
| 20–39 | 20–39% | Limited Retention |
| 0–19 | 0–19% | Severe Context Loss |
Summary
Long Context performance is one of the most hardware-dependent capabilities in the on-device AI stack. The same model may score Excellent Retention at 4K context but Partial Retention at 64K, not because the model is defective but because the device lacks sufficient VRAM for the full context tier. LongCtx makes this visible and quantifiable.
The benchmark's built-in tier-skip mechanism ensures that reported scores always reflect genuine performance at the effective context size. The position bias sub-analysis reveals whether a model retrieves facts uniformly across the document or suffers from the “lost in the middle” effect. For NotesXML users working with large research notebooks, multi-chapter documents, or long meeting transcripts, LongCtx provides the most operationally relevant signal in the entire benchmark suite.
References
- Liu, N.F. et al. (2023). Lost in the Middle. arxiv.org/abs/2307.03172
- Kamradt, G. (2023). github.com/gkamradt/LLMTest_NeedleInAHaystack
- Anthropic Research. anthropic.com/research
© 2026 IWV Digital Solutions LLC. All rights reserved.
← Back to AI Benchmark Suite