← AI Benchmark Suite

Tool Calling Benchmark (ToolCall)

IWV Digital Solutions LLC | NotesXML AI Evaluation Series

25 tasks × 4 points each = 100 points maximum. 6 categories. Schema definitions only — no tool is ever executed, no network request is made, no external service is contacted. The benchmark grades JSON output correctness entirely.


Introduction

Tool calling is the ability of a language model to respond not with prose but with a structured API call that invokes an external function. A model that can correctly identify when to call a tool, which tool to call, and precisely what parameters to pass is orders of magnitude more useful for task automation than a model that can only generate text.

The Tool Calling Benchmark measures this capability systematically. It presents 25 tasks across six categories and grades the model's JSON output against a strict schema. No tool is actually executed — the benchmark grades the quality and correctness of the model's JSON output: shape, tool selection, parameter extraction, and sequencing, independent of any real system calls.

Why This Matters for NotesXML

NotesXML's IWV Agent capability (on the development roadmap) relies on AI models that can correctly issue tool calls to interact with the note-taking system — creating notes, searching content, setting reminders, and executing multi-step workflows. A model scoring Reliable Tool Use (60+) can be trusted for single-step automation. A model reaching Expert Tool Use (90+) is ready for multi-step agentic workflows.

Reference: Patil, S. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arxiv.org/abs/2305.15334.


Assessment Overview

Each model receives a system prompt containing the full JSON Schema definitions of six mock tools. It is then given 25 user messages — one per task — and must respond with a valid JSON tool_calls object. Context size: 4,096 tokens. Max tokens per response: 400.

The Mock Tool Catalog

All 25 tasks use the same six mock tools. These are schema definitions only — no tool is ever invoked, no network call is made, and no external service is contacted. The benchmark evaluates the model's JSON output purely for structural correctness.

Four tools (searchNote, createReminder, calculator, unitConvert) correspond to capabilities that NotesXML's IWV Agent will ship. The remaining two serve specific evaluation roles:

ToolDescriptionRequired ParametersOptional Parameters
searchNoteSearch the user notebook by keywordquery (string)limit (integer, 1–50, default 10)
createReminderCreate a reminder for a future date/timetext (string), dueIso (ISO 8601 string)
weatherGet current weather at a location (benchmark fixture only — not a shipping feature)location (string)unit (celsius | fahrenheit, default celsius)
calculatorEvaluate a math expressionexpression (string)
unitConvertConvert a numeric value between unitsvalue (number), fromUnit (string), toUnit (string)
noOpUse when no other tool is appropriatereason (string)

System Prompt

Every task is prefixed with this instruction:

You are a tool-using assistant. You have access to the following tools: [TOOLS JSON]

When the user message can be addressed by one or more tools, respond with ONLY a valid JSON object of the form:
{"tool_calls": [{"name": "<toolName>", "arguments": {...}}, ...]}

Rules: (1) Output ONLY JSON. No explanation, prose, or markdown code fences. (2) Use the noOp tool with a non-empty “reason” if no other tool fits or if the request is too ambiguous. (3) For multi-step tasks, include multiple entries in tool_calls in execution order. (4) Match parameter types in the schemas.


The 25 Tasks

Category 1 — Schema Understanding (TOOL-001 to TOOL-005)

Tests basic compliance with tool schemas — correct tool name and correct argument types for straightforward, unambiguous requests.

IDUser MessageExpected ToolKey Arguments
TOOL-001Search my notes for “quarterly review” and give me 5 results.searchNotequery=“quarterly review”, limit=5
TOOL-002What is the weather in Boston in fahrenheit?weatherlocation=“Boston”, unit=“fahrenheit”
TOOL-003Compute 17 plus 28.calculatorexpression=“17 + 28”
TOOL-004Convert 50 miles to kilometers.unitConvertvalue=50, fromUnit=“miles”, toUnit=“kilometers”
TOOL-005Find my notes about “project alpha”.searchNotequery=“project alpha”

Category 2 — Tool Selection (TOOL-006 to TOOL-010)

Tests whether the model correctly identifies which tool to use when the task type must be inferred from the request phrasing.

IDUser MessageExpected ToolKey Arguments
TOOL-006What is 47 times 83?calculatorexpression=“47 * 83”
TOOL-007How much is 200 grams in ounces?unitConvertvalue=200, fromUnit=“grams”, toUnit=“ounces”
TOOL-008Look up my note on “marketing strategy”.searchNotequery=“marketing strategy”
TOOL-009Set a reminder to file taxes on April 15, 2027 at 9am.createRemindertext=“file taxes”, dueIso=“2027-04-15T09:00”
TOOL-010What is the weather in Tokyo today?weatherlocation=“Tokyo”

Category 3 — Parameter Extraction (TOOL-011 to TOOL-015)

Tests precise argument extraction, including date arithmetic (“next Tuesday”), complex math expressions, and explicit numeric limits.

IDUser MessageExpected ToolKey Arguments
TOOL-011Remind me to call Sarah next Tuesday at 3pm. Today is Wed Apr 30 2026.createRemindertext=“call Sarah”, dueIso=“2026-05-06T15:00”
TOOL-012Calculate (12 + 8) * 3 / 5.calculatorexpression=“(12 + 8) * 3 / 5”
TOOL-013Find up to 25 notes about “ai roadmap”.searchNotequery=“ai roadmap”, limit=25
TOOL-014Convert 100 kilograms to pounds.unitConvertvalue=100, fromUnit=“kilograms”, toUnit=“pounds”
TOOL-015Get the weather in Paris in celsius.weatherlocation=“Paris”, unit=“celsius”

Category 4 — Multi-Step Sequences (TOOL-016 to TOOL-019)

Tests whether the model can plan and sequence multiple tool calls in a single response, where the second call depends on the result of the first.

IDUser MessageExpected Call Sequence
TOOL-016Compute 4 times 3, then convert that many cubic feet to cubic meters.1. calculator(“4 * 3”) → 2. unitConvert(value=12, “cubic feet”, “cubic meters”)
TOOL-017Search my notes for “Q3 budget”, then remind me to review the top result tomorrow morning at 9am. Today is Apr 30 2026.1. searchNote(“Q3 budget”) → 2. createReminder(text=“review”, dueIso=“2026-05-01T09:00”)
TOOL-018Calculate 25% of 480, then convert the result from miles to kilometers.1. calculator(“0.25 * 480”) → 2. unitConvert(value=120, “miles”, “kilometers”)
TOOL-019Get the weather in Denver, then set a reminder to check it again on May 5 2026 at noon.1. weather(“Denver”) → 2. createReminder(text=“weather check”, dueIso=“2026-05-05T12:00”)

Category 5 — Error Recovery / Graceful Ambiguity (TOOL-020 to TOOL-022)

Tests whether the model correctly recognizes underspecified requests and uses noOp rather than guessing. An incorrect tool call on these tasks is penalized.

IDUser MessageExpected ToolRequired Reason Pattern
TOOL-020Remind me about the meeting.noOpReason mentions: time / when / date / specify / which / meeting
TOOL-021Convert this to that.noOpReason mentions: value / unit / specify / ambiguous / unclear
TOOL-022Search for it.noOpReason mentions: query / what / specify / unclear

Category 6 — Inappropriate Refusal (TOOL-023 to TOOL-025)

Tests whether the model correctly avoids over-calling tools for conversational or general-knowledge requests. The correct behavior is noOp with a sensible reason — not an attempt to shoehorn the request into a tool, and not a refusal to respond at all.

IDUser MessageExpected ToolRequired Reason Pattern
TOOL-023Hi, how are you today?noOpReason mentions: no tool / chitchat / greeting / conversation
TOOL-024Tell me a joke.noOpReason mentions: no tool / joke / conversation
TOOL-025What is the capital of France?noOpReason mentions: knowledge / no tool / general

Scoring Rubric

Each task is worth 4 points. Total possible: 100 points.

Grading Mode: tool_name_strict_args_regex

PointsCriteria
4Correct tool name AND all required arguments match regex patterns AND output is valid JSON
3Correct tool name, most arguments correct, one slightly off or missing optional
2Correct tool name, most arguments wrong or key argument missing
1Wrong tool name but valid JSON structure
0Invalid JSON, empty response, or complete mismatch

Grading Mode: sequence_strict (TOOL-016 to TOOL-019)

PointsCriteria
4Correct array length, both calls correct (name + args), in the right order
3Correct array length, first call correct, second has minor arg error
2One of two calls correct
1Both calls named correctly but both have significant arg errors
0Wrong number of calls, invalid JSON, or complete mismatch

Score Levels

ScorePercentageLevel
90–10090–100%Expert Tool Use
75–8975–89%Advanced Tool Use
60–7460–74%Reliable Tool Use
40–5940–59%Basic Tool Use
20–3920–39%Inconsistent Tool Use
0–190–19%Cannot Use Tools

Category Breakdown

CategoryTasksMax PointsWhat It Reveals
Schema Understanding520Basic JSON schema compliance
Tool Selection520Intent recognition and tool routing
Parameter Extraction520Precise argument parsing including date arithmetic
Multi-Step416Sequential planning and dependent call chaining
Error Recovery312Graceful handling of underspecified requests
Inappropriate Refusal312Avoiding over-eager tool invocation

Summary

Tool calling is the bridge between an AI that talks about doing things and an AI that actually does them. The ToolCall Benchmark is the NotesXML suite's most forward-looking assessment: it measures readiness for the agentic workflows that NotesXML's IWV Agent feature is being built on.

A model at Reliable Tool Use (60+) is a practical automation partner for single-step tasks. A model at Expert Tool Use (90+) demonstrates the precision required for complex multi-step workflows, handles ambiguous requests without hallucinating parameters, and knows when to step back rather than guess.

References


© 2026 IWV Digital Solutions LLC. All rights reserved.

← Back to AI Benchmark Suite