AXIS Scoring Framework
AXIS produces a composite 0 to 100 AXIS Result by evaluating four independent dimensions of agent performance. Each dimension captures a different aspect of how the agent interacted with your system.
The Four Dimensions
Composite AXIS Result
The final AXIS Result is the weighted average of all four dimension scores.
AXIS Result = (Goal Achievement x 0.4) + (Environment x 0.2) + (Service x 0.2) + (Agent x 0.2) All dimension scores are 0 to 100. The composite result is rounded to the nearest whole number.
Goal Achievement
Goal Achievement is evaluated by a Judge that reads the full agent transcript and compares the outcome against the rubric checks you defined in the scenario. Each check receives a score from 0 to 10, which is scaled to 0 to 100 and combined using the check weights.
This is the only dimension driven entirely by your rubric. The other three dimensions are calculated from the agent's interaction transcript.
Interaction Signals
The Environment, Service, and Agent dimensions are scored by analyzing every tool interaction in the agent's transcript. Each interaction is evaluated on five signals.
| Signal | Method | What It Measures |
|---|---|---|
| Success | Judge | Did the interaction complete without errors? Were the results usable? |
| Speed | Heuristic | How long did the interaction take relative to expectations for its category? |
| Weight | Judge | Was the tool invocation right-sized? Did the agent request more or less than needed? |
| Relevance | Judge | Was the tool output relevant and useful for completing the task? |
| Necessity | Judge | Were the interactions in this category actually needed, or were they avoidable? |
Judge signals are evaluated by an LLM that reads the full content of each tool call and its result. Heuristic signals are computed deterministically from measured values like duration, with no LLM involved.
Signal Weights by Category
Each category emphasizes different signals based on what matters most for that type of interaction.
| Signal | Environment | Service | Agent |
|---|---|---|---|
| Success | 0.35 | 0.25 | 0.15 |
| Speed | 0.15 | 0.15 | 0.15 |
| Weight | 0.15 | 0.20 | 0.20 |
| Relevance | 0.15 | 0.20 | 0.25 |
| Necessity | 0.20 | 0.20 | 0.25 |
Environment scores weight Success most heavily because failed shell commands or file operations are the most direct signal of a poor interaction. Speed is weighted equally across all categories at 0.15 — the category-specific speed thresholds (below) handle the fact that different operation types have different expected durations. Agent scores weight Necessity, Weight, and Relevance more heavily because self-organization quality is best measured by whether the agent's own actions were purposeful and well-scoped.
Speed Thresholds
Speed scores are based on how long each interaction took. Thresholds vary by category because different types of operations have different expected durations.
| Category | Excellent | Good | Fair | Slow | Very Slow |
|---|---|---|---|---|---|
| Environment | ≤500ms | ≤2s | ≤5s | ≤10s | >10s |
| Service | ≤2s | ≤5s | ≤10s | ≤25s | >25s |
| Agent | ≤2s | ≤5s | ≤15s | ≤30s | >30s |
Score Calibration
Raw signal scores are mapped to a 0 to 100 scale using an S-curve rather than linear scaling. This means:
- Improving from 20 to 50 is relatively easy (fixing obvious problems).
- Improving from 80 to 95 requires significant quality gains.
- A score of 50 represents median performance for that category.
Speed is aggregated using a severity-weighted average: slow interactions pull the score down disproportionately rather than being hidden by many fast interactions. Other signals (success, weight, relevance) are weighted by context size, so a failed API call that returns a large error response influences the score more than a trivial file read.
Transcript Categorization
Every tool interaction is classified into a category based on the tool name. This classification determines which dimension the interaction contributes to.
Environment
OS, filesystem, and dev tooling interactions:
- Shell: bash, shell, terminal, exec.
- File ops: read, write, edit, glob, grep, cat, head, tail, find, ls, mkdir, rm, cp, mv.
- Version control: git.
- Package managers: npm, yarn, pip, cargo, go, brew, apt.
- Build and test: make, tsc, docker, kubectl, node, python.
Agent
Self-organization and metacognition:
- Tool discovery: toolsearch, listtoolsets, list_tools.
- Task management: taskcreate, taskupdate, tasklist, todo_read, todo_write.
- Planning: enterplanmode, exitplanmode.
- User interaction: askuserquestion, askfollowupquestion.
- Skill invocation: skill.
Service
Everything else: external APIs, MCP tools, network calls, and custom services. Any tool not matching the environment or agent patterns is classified as a service interaction.
Some interactions span categories. For example, running curl via bash is both an
environment interaction (shell command) and a service interaction (network call). Environment
tools that target agent-internal paths (like .claude/) are reclassified as agent
interactions.
Interpreting Scores
| Range | Interpretation |
|---|---|
| 90 to 100 | Excellent. Agent completed the task efficiently with minimal waste. |
| 75 to 89 | Good. Task completed with minor inefficiencies or missed optimizations. |
| 50 to 74 | Fair. Notable issues in execution quality, speed, or unnecessary operations. |
| Below 50 | Poor. Significant failures, errors, or excessive waste in the execution. |
When a category score falls below 75, the CLI displays score insights that identify the weakest signal, helping you understand where the agent struggled.