May 2026 · IWV Digital Solutions · 12 min read

Tool Calling Benchmark (ToolCall)

IWV Digital Solutions LLC | NotesXML AI Evaluation Series

25 tasks × 4 points each = 100 points maximum. 6 categories. Schema definitions only — no tool is ever executed, no network request is made, no external service is contacted. The benchmark grades JSON output correctness entirely.

Introduction

Tool calling is the ability of a language model to respond not with prose but with a structured API call that invokes an external function. A model that can correctly identify when to call a tool, which tool to call, and precisely what parameters to pass is orders of magnitude more useful for task automation than a model that can only generate text.

The Tool Calling Benchmark measures this capability systematically. It presents 25 tasks across six categories and grades the model's JSON output against a strict schema. No tool is actually executed — the benchmark grades the quality and correctness of the model's JSON output: shape, tool selection, parameter extraction, and sequencing, independent of any real system calls.

Why This Matters for NotesXML

NotesXML's IWV Agent capability (on the development roadmap) relies on AI models that can correctly issue tool calls to interact with the note-taking system — creating notes, searching content, setting reminders, and executing multi-step workflows. A model scoring Reliable Tool Use (60+) can be trusted for single-step automation. A model reaching Expert Tool Use (90+) is ready for multi-step agentic workflows.

Reference: Patil, S. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arxiv.org/abs/2305.15334.

Assessment Overview

Each model receives a system prompt containing the full JSON Schema definitions of six mock tools. It is then given 25 user messages — one per task — and must respond with a valid JSON tool_calls object. Context size: 4,096 tokens. Max tokens per response: 400.

The Mock Tool Catalog

All 25 tasks use the same six mock tools. These are schema definitions only — no tool is ever invoked, no network call is made, and no external service is contacted. The benchmark evaluates the model's JSON output purely for structural correctness.

Four tools (searchNote, createReminder, calculator, unitConvert) correspond to capabilities that NotesXML's IWV Agent will ship. The remaining two serve specific evaluation roles:

weather — Included purely as a tool-discrimination fixture. NotesXML is a fully offline application; this tool does not exist in the shipping product and will never be called against a live API. It is in the catalog to create a distinct schema shape (location-based parameter extraction + unit enum) that tests whether the model can distinguish it from unitConvert, calculator, and searchNote. The weather tasks measure schema comprehension and argument correctness, not live API capability.
noOp — Used when no other tool is appropriate. Critical for error-recovery and inappropriate-refusal categories.

Tool	Description	Required Parameters	Optional Parameters
`searchNote`	Search the user notebook by keyword	query (string)	limit (integer, 1–50, default 10)
`createReminder`	Create a reminder for a future date/time	text (string), dueIso (ISO 8601 string)	—
`weather`	Get current weather at a location (benchmark fixture only — not a shipping feature)	location (string)	unit (celsius \| fahrenheit, default celsius)
`calculator`	Evaluate a math expression	expression (string)	—
`unitConvert`	Convert a numeric value between units	value (number), fromUnit (string), toUnit (string)	—
`noOp`	Use when no other tool is appropriate	reason (string)	—

System Prompt

Every task is prefixed with this instruction:

You are a tool-using assistant. You have access to the following tools: [TOOLS JSON]

When the user message can be addressed by one or more tools, respond with ONLY a valid JSON object of the form:
{"tool_calls": [{"name": "<toolName>", "arguments": {...}}, ...]}

Rules: (1) Output ONLY JSON. No explanation, prose, or markdown code fences. (2) Use the noOp tool with a non-empty “reason” if no other tool fits or if the request is too ambiguous. (3) For multi-step tasks, include multiple entries in tool_calls in execution order. (4) Match parameter types in the schemas.

The 25 Tasks

Category 1 — Schema Understanding (TOOL-001 to TOOL-005)

Tests basic compliance with tool schemas — correct tool name and correct argument types for straightforward, unambiguous requests.

ID	User Message	Expected Tool	Key Arguments
TOOL-001	Search my notes for “quarterly review” and give me 5 results.	searchNote	query=“quarterly review”, limit=5
TOOL-002	What is the weather in Boston in fahrenheit?	weather	location=“Boston”, unit=“fahrenheit”
TOOL-003	Compute 17 plus 28.	calculator	expression=“17 + 28”
TOOL-004	Convert 50 miles to kilometers.	unitConvert	value=50, fromUnit=“miles”, toUnit=“kilometers”
TOOL-005	Find my notes about “project alpha”.	searchNote	query=“project alpha”

Category 2 — Tool Selection (TOOL-006 to TOOL-010)

Tests whether the model correctly identifies which tool to use when the task type must be inferred from the request phrasing.

ID	User Message	Expected Tool	Key Arguments
TOOL-006	What is 47 times 83?	calculator	expression=“47 * 83”
TOOL-007	How much is 200 grams in ounces?	unitConvert	value=200, fromUnit=“grams”, toUnit=“ounces”
TOOL-008	Look up my note on “marketing strategy”.	searchNote	query=“marketing strategy”
TOOL-009	Set a reminder to file taxes on April 15, 2027 at 9am.	createReminder	text=“file taxes”, dueIso=“2027-04-15T09:00”
TOOL-010	What is the weather in Tokyo today?	weather	location=“Tokyo”

Category 3 — Parameter Extraction (TOOL-011 to TOOL-015)

Tests precise argument extraction, including date arithmetic (“next Tuesday”), complex math expressions, and explicit numeric limits.

ID	User Message	Expected Tool	Key Arguments
TOOL-011	Remind me to call Sarah next Tuesday at 3pm. Today is Wed Apr 30 2026.	createReminder	text=“call Sarah”, dueIso=“2026-05-06T15:00”
TOOL-012	Calculate (12 + 8) * 3 / 5.	calculator	expression=“(12 + 8) * 3 / 5”
TOOL-013	Find up to 25 notes about “ai roadmap”.	searchNote	query=“ai roadmap”, limit=25
TOOL-014	Convert 100 kilograms to pounds.	unitConvert	value=100, fromUnit=“kilograms”, toUnit=“pounds”
TOOL-015	Get the weather in Paris in celsius.	weather	location=“Paris”, unit=“celsius”

Category 4 — Multi-Step Sequences (TOOL-016 to TOOL-019)

Tests whether the model can plan and sequence multiple tool calls in a single response, where the second call depends on the result of the first.

ID	User Message	Expected Call Sequence
TOOL-016	Compute 4 times 3, then convert that many cubic feet to cubic meters.	1. calculator(“4 * 3”) → 2. unitConvert(value=12, “cubic feet”, “cubic meters”)
TOOL-017	Search my notes for “Q3 budget”, then remind me to review the top result tomorrow morning at 9am. Today is Apr 30 2026.	1. searchNote(“Q3 budget”) → 2. createReminder(text=“review”, dueIso=“2026-05-01T09:00”)
TOOL-018	Calculate 25% of 480, then convert the result from miles to kilometers.	1. calculator(“0.25 * 480”) → 2. unitConvert(value=120, “miles”, “kilometers”)
TOOL-019	Get the weather in Denver, then set a reminder to check it again on May 5 2026 at noon.	1. weather(“Denver”) → 2. createReminder(text=“weather check”, dueIso=“2026-05-05T12:00”)

Category 5 — Error Recovery / Graceful Ambiguity (TOOL-020 to TOOL-022)

Tests whether the model correctly recognizes underspecified requests and uses noOp rather than guessing. An incorrect tool call on these tasks is penalized.

ID	User Message	Expected Tool	Required Reason Pattern
TOOL-020	Remind me about the meeting.	noOp	Reason mentions: time / when / date / specify / which / meeting
TOOL-021	Convert this to that.	noOp	Reason mentions: value / unit / specify / ambiguous / unclear
TOOL-022	Search for it.	noOp	Reason mentions: query / what / specify / unclear

Category 6 — Inappropriate Refusal (TOOL-023 to TOOL-025)

Tests whether the model correctly avoids over-calling tools for conversational or general-knowledge requests. The correct behavior is noOp with a sensible reason — not an attempt to shoehorn the request into a tool, and not a refusal to respond at all.

ID	User Message	Expected Tool	Required Reason Pattern
TOOL-023	Hi, how are you today?	noOp	Reason mentions: no tool / chitchat / greeting / conversation
TOOL-024	Tell me a joke.	noOp	Reason mentions: no tool / joke / conversation
TOOL-025	What is the capital of France?	noOp	Reason mentions: knowledge / no tool / general

Scoring Rubric

Each task is worth 4 points. Total possible: 100 points.

Grading Mode: `tool_name_strict_args_regex`

Points	Criteria
4	Correct tool name AND all required arguments match regex patterns AND output is valid JSON
3	Correct tool name, most arguments correct, one slightly off or missing optional
2	Correct tool name, most arguments wrong or key argument missing
1	Wrong tool name but valid JSON structure
0	Invalid JSON, empty response, or complete mismatch

Grading Mode: `sequence_strict` (TOOL-016 to TOOL-019)

Points	Criteria
4	Correct array length, both calls correct (name + args), in the right order
3	Correct array length, first call correct, second has minor arg error
2	One of two calls correct
1	Both calls named correctly but both have significant arg errors
0	Wrong number of calls, invalid JSON, or complete mismatch

Score Levels

Score	Percentage	Level
90–100	90–100%	Expert Tool Use
75–89	75–89%	Advanced Tool Use
60–74	60–74%	Reliable Tool Use
40–59	40–59%	Basic Tool Use
20–39	20–39%	Inconsistent Tool Use
0–19	0–19%	Cannot Use Tools

Category Breakdown

Category	Tasks	Max Points	What It Reveals
Schema Understanding	5	20	Basic JSON schema compliance
Tool Selection	5	20	Intent recognition and tool routing
Parameter Extraction	5	20	Precise argument parsing including date arithmetic
Multi-Step	4	16	Sequential planning and dependent call chaining
Error Recovery	3	12	Graceful handling of underspecified requests
Inappropriate Refusal	3	12	Avoiding over-eager tool invocation

Summary

Tool calling is the bridge between an AI that talks about doing things and an AI that actually does them. The ToolCall Benchmark is the NotesXML suite's most forward-looking assessment: it measures readiness for the agentic workflows that NotesXML's IWV Agent feature is being built on.

A model at Reliable Tool Use (60+) is a practical automation partner for single-step tasks. A model at Expert Tool Use (90+) demonstrates the precision required for complex multi-step workflows, handles ambiguous requests without hallucinating parameters, and knows when to step back rather than guess.

References

OpenAI (2023). Function Calling and other API updates. platform.openai.com/docs/guides/function-calling
Patil, S. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arxiv.org/abs/2305.15334
Qin, Y. et al. (2023). ToolLLM. arxiv.org/abs/2307.16789

← Back to AI Benchmark Suite