Tool Calling Benchmark (ToolCall)
IWV Digital Solutions LLC | NotesXML AI Evaluation Series
25 tasks × 4 points each = 100 points maximum. 6 categories. Schema definitions only — no tool is ever executed, no network request is made, no external service is contacted. The benchmark grades JSON output correctness entirely.
Introduction
Tool calling is the ability of a language model to respond not with prose but with a structured API call that invokes an external function. A model that can correctly identify when to call a tool, which tool to call, and precisely what parameters to pass is orders of magnitude more useful for task automation than a model that can only generate text.
The Tool Calling Benchmark measures this capability systematically. It presents 25 tasks across six categories and grades the model's JSON output against a strict schema. No tool is actually executed — the benchmark grades the quality and correctness of the model's JSON output: shape, tool selection, parameter extraction, and sequencing, independent of any real system calls.
Why This Matters for NotesXML
NotesXML's IWV Agent capability (on the development roadmap) relies on AI models that can correctly issue tool calls to interact with the note-taking system — creating notes, searching content, setting reminders, and executing multi-step workflows. A model scoring Reliable Tool Use (60+) can be trusted for single-step automation. A model reaching Expert Tool Use (90+) is ready for multi-step agentic workflows.
Reference: Patil, S. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arxiv.org/abs/2305.15334.
Assessment Overview
Each model receives a system prompt containing the full JSON Schema definitions of six mock tools. It is then given 25 user messages — one per task — and must respond with a valid JSON tool_calls object. Context size: 4,096 tokens. Max tokens per response: 400.
The Mock Tool Catalog
All 25 tasks use the same six mock tools. These are schema definitions only — no tool is ever invoked, no network call is made, and no external service is contacted. The benchmark evaluates the model's JSON output purely for structural correctness.
Four tools (searchNote, createReminder, calculator, unitConvert) correspond to capabilities that NotesXML's IWV Agent will ship. The remaining two serve specific evaluation roles:
weather— Included purely as a tool-discrimination fixture. NotesXML is a fully offline application; this tool does not exist in the shipping product and will never be called against a live API. It is in the catalog to create a distinct schema shape (location-based parameter extraction + unit enum) that tests whether the model can distinguish it fromunitConvert,calculator, andsearchNote. The weather tasks measure schema comprehension and argument correctness, not live API capability.noOp— Used when no other tool is appropriate. Critical for error-recovery and inappropriate-refusal categories.
| Tool | Description | Required Parameters | Optional Parameters |
|---|---|---|---|
searchNote | Search the user notebook by keyword | query (string) | limit (integer, 1–50, default 10) |
createReminder | Create a reminder for a future date/time | text (string), dueIso (ISO 8601 string) | — |
weather | Get current weather at a location (benchmark fixture only — not a shipping feature) | location (string) | unit (celsius | fahrenheit, default celsius) |
calculator | Evaluate a math expression | expression (string) | — |
unitConvert | Convert a numeric value between units | value (number), fromUnit (string), toUnit (string) | — |
noOp | Use when no other tool is appropriate | reason (string) | — |
System Prompt
Every task is prefixed with this instruction:
You are a tool-using assistant. You have access to the following tools: [TOOLS JSON]
When the user message can be addressed by one or more tools, respond with ONLY a valid JSON object of the form:
{"tool_calls": [{"name": "<toolName>", "arguments": {...}}, ...]}
Rules: (1) Output ONLY JSON. No explanation, prose, or markdown code fences. (2) Use the noOp tool with a non-empty “reason” if no other tool fits or if the request is too ambiguous. (3) For multi-step tasks, include multiple entries in tool_calls in execution order. (4) Match parameter types in the schemas.
The 25 Tasks
Category 1 — Schema Understanding (TOOL-001 to TOOL-005)
Tests basic compliance with tool schemas — correct tool name and correct argument types for straightforward, unambiguous requests.
| ID | User Message | Expected Tool | Key Arguments |
|---|---|---|---|
| TOOL-001 | Search my notes for “quarterly review” and give me 5 results. | searchNote | query=“quarterly review”, limit=5 |
| TOOL-002 | What is the weather in Boston in fahrenheit? | weather | location=“Boston”, unit=“fahrenheit” |
| TOOL-003 | Compute 17 plus 28. | calculator | expression=“17 + 28” |
| TOOL-004 | Convert 50 miles to kilometers. | unitConvert | value=50, fromUnit=“miles”, toUnit=“kilometers” |
| TOOL-005 | Find my notes about “project alpha”. | searchNote | query=“project alpha” |
Category 2 — Tool Selection (TOOL-006 to TOOL-010)
Tests whether the model correctly identifies which tool to use when the task type must be inferred from the request phrasing.
| ID | User Message | Expected Tool | Key Arguments |
|---|---|---|---|
| TOOL-006 | What is 47 times 83? | calculator | expression=“47 * 83” |
| TOOL-007 | How much is 200 grams in ounces? | unitConvert | value=200, fromUnit=“grams”, toUnit=“ounces” |
| TOOL-008 | Look up my note on “marketing strategy”. | searchNote | query=“marketing strategy” |
| TOOL-009 | Set a reminder to file taxes on April 15, 2027 at 9am. | createReminder | text=“file taxes”, dueIso=“2027-04-15T09:00” |
| TOOL-010 | What is the weather in Tokyo today? | weather | location=“Tokyo” |
Category 3 — Parameter Extraction (TOOL-011 to TOOL-015)
Tests precise argument extraction, including date arithmetic (“next Tuesday”), complex math expressions, and explicit numeric limits.
| ID | User Message | Expected Tool | Key Arguments |
|---|---|---|---|
| TOOL-011 | Remind me to call Sarah next Tuesday at 3pm. Today is Wed Apr 30 2026. | createReminder | text=“call Sarah”, dueIso=“2026-05-06T15:00” |
| TOOL-012 | Calculate (12 + 8) * 3 / 5. | calculator | expression=“(12 + 8) * 3 / 5” |
| TOOL-013 | Find up to 25 notes about “ai roadmap”. | searchNote | query=“ai roadmap”, limit=25 |
| TOOL-014 | Convert 100 kilograms to pounds. | unitConvert | value=100, fromUnit=“kilograms”, toUnit=“pounds” |
| TOOL-015 | Get the weather in Paris in celsius. | weather | location=“Paris”, unit=“celsius” |
Category 4 — Multi-Step Sequences (TOOL-016 to TOOL-019)
Tests whether the model can plan and sequence multiple tool calls in a single response, where the second call depends on the result of the first.
| ID | User Message | Expected Call Sequence |
|---|---|---|
| TOOL-016 | Compute 4 times 3, then convert that many cubic feet to cubic meters. | 1. calculator(“4 * 3”) → 2. unitConvert(value=12, “cubic feet”, “cubic meters”) |
| TOOL-017 | Search my notes for “Q3 budget”, then remind me to review the top result tomorrow morning at 9am. Today is Apr 30 2026. | 1. searchNote(“Q3 budget”) → 2. createReminder(text=“review”, dueIso=“2026-05-01T09:00”) |
| TOOL-018 | Calculate 25% of 480, then convert the result from miles to kilometers. | 1. calculator(“0.25 * 480”) → 2. unitConvert(value=120, “miles”, “kilometers”) |
| TOOL-019 | Get the weather in Denver, then set a reminder to check it again on May 5 2026 at noon. | 1. weather(“Denver”) → 2. createReminder(text=“weather check”, dueIso=“2026-05-05T12:00”) |
Category 5 — Error Recovery / Graceful Ambiguity (TOOL-020 to TOOL-022)
Tests whether the model correctly recognizes underspecified requests and uses noOp rather than guessing. An incorrect tool call on these tasks is penalized.
| ID | User Message | Expected Tool | Required Reason Pattern |
|---|---|---|---|
| TOOL-020 | Remind me about the meeting. | noOp | Reason mentions: time / when / date / specify / which / meeting |
| TOOL-021 | Convert this to that. | noOp | Reason mentions: value / unit / specify / ambiguous / unclear |
| TOOL-022 | Search for it. | noOp | Reason mentions: query / what / specify / unclear |
Category 6 — Inappropriate Refusal (TOOL-023 to TOOL-025)
Tests whether the model correctly avoids over-calling tools for conversational or general-knowledge requests. The correct behavior is noOp with a sensible reason — not an attempt to shoehorn the request into a tool, and not a refusal to respond at all.
| ID | User Message | Expected Tool | Required Reason Pattern |
|---|---|---|---|
| TOOL-023 | Hi, how are you today? | noOp | Reason mentions: no tool / chitchat / greeting / conversation |
| TOOL-024 | Tell me a joke. | noOp | Reason mentions: no tool / joke / conversation |
| TOOL-025 | What is the capital of France? | noOp | Reason mentions: knowledge / no tool / general |
Scoring Rubric
Each task is worth 4 points. Total possible: 100 points.
Grading Mode: tool_name_strict_args_regex
| Points | Criteria |
|---|---|
| 4 | Correct tool name AND all required arguments match regex patterns AND output is valid JSON |
| 3 | Correct tool name, most arguments correct, one slightly off or missing optional |
| 2 | Correct tool name, most arguments wrong or key argument missing |
| 1 | Wrong tool name but valid JSON structure |
| 0 | Invalid JSON, empty response, or complete mismatch |
Grading Mode: sequence_strict (TOOL-016 to TOOL-019)
| Points | Criteria |
|---|---|
| 4 | Correct array length, both calls correct (name + args), in the right order |
| 3 | Correct array length, first call correct, second has minor arg error |
| 2 | One of two calls correct |
| 1 | Both calls named correctly but both have significant arg errors |
| 0 | Wrong number of calls, invalid JSON, or complete mismatch |
Score Levels
| Score | Percentage | Level |
|---|---|---|
| 90–100 | 90–100% | Expert Tool Use |
| 75–89 | 75–89% | Advanced Tool Use |
| 60–74 | 60–74% | Reliable Tool Use |
| 40–59 | 40–59% | Basic Tool Use |
| 20–39 | 20–39% | Inconsistent Tool Use |
| 0–19 | 0–19% | Cannot Use Tools |
Category Breakdown
| Category | Tasks | Max Points | What It Reveals |
|---|---|---|---|
| Schema Understanding | 5 | 20 | Basic JSON schema compliance |
| Tool Selection | 5 | 20 | Intent recognition and tool routing |
| Parameter Extraction | 5 | 20 | Precise argument parsing including date arithmetic |
| Multi-Step | 4 | 16 | Sequential planning and dependent call chaining |
| Error Recovery | 3 | 12 | Graceful handling of underspecified requests |
| Inappropriate Refusal | 3 | 12 | Avoiding over-eager tool invocation |
Summary
Tool calling is the bridge between an AI that talks about doing things and an AI that actually does them. The ToolCall Benchmark is the NotesXML suite's most forward-looking assessment: it measures readiness for the agentic workflows that NotesXML's IWV Agent feature is being built on.
A model at Reliable Tool Use (60+) is a practical automation partner for single-step tasks. A model at Expert Tool Use (90+) demonstrates the precision required for complex multi-step workflows, handles ambiguous requests without hallucinating parameters, and knows when to step back rather than guess.
References
- OpenAI (2023). Function Calling and other API updates. platform.openai.com/docs/guides/function-calling
- Patil, S. et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arxiv.org/abs/2305.15334
- Qin, Y. et al. (2023). ToolLLM. arxiv.org/abs/2307.16789
© 2026 IWV Digital Solutions LLC. All rights reserved.
← Back to AI Benchmark Suite