Clerk Agent Guard
Compare family-level risk profiles across tool-call robustness, procedure compliance, multilingual serialization, uncertainty calibration, and long-context degradation.
Running in unauthenticated preview mode
| Dimension | GPT-4.1 OpenAI | Claude 4 Ops Anthropic |
|---|---|---|
Tool-Call Robustness Ability to keep tool names and parameters stable under paraphrase and noise. | 88% Weakness: 12% | 86% Weakness: 14% |
Procedure Compliance Ability to follow required action ordering for critical write workflows. | 84% Weakness: 16% | 91% Weakness: 9% |
Multilingual Serialization Argument shape and locale correctness when parsing non-English prompts. | 79% Weakness: 21% | 89% Weakness: 11% |
Uncertainty Calibration Whether confidence signals align with real error probability for risky calls. | 74% Weakness: 26% | 82% Weakness: 18% |
Long-Context Degradation Drop in tool execution quality as prompt context grows. | 71% Weakness: 29% | 80% Weakness: 20% |
Chat Completion | OpenAI | 2025-01
Tool-Call Robustness
Procedure Compliance
Multilingual Serialization
Uncertainty Calibration
Long-Context Degradation
Recommended guard actions
Run JSON schema validation before execution and reject enum drift, unknown keys, and type swaps between similar fields.
Require preflight checkpoints for every multi-step write workflow and verify prerequisite actions were completed.
Normalize and canonicalize locale-specific identifiers before runtime tool dispatch so locale text cannot mutate machine IDs.
Escalate when confidence width is high or disagreement between validators crosses a defined threshold.
Reduce active tool context to the minimal action slice and rehydrate state per turn for long conversations.
Function Agent | Anthropic | 2025-02
Tool-Call Robustness
Procedure Compliance
Multilingual Serialization
Uncertainty Calibration
Long-Context Degradation
Recommended guard actions
Run JSON schema validation before execution and reject enum drift, unknown keys, and type swaps between similar fields.
Require preflight checkpoints for every multi-step write workflow and verify prerequisite actions were completed.
Normalize and canonicalize locale-specific identifiers before runtime tool dispatch so locale text cannot mutate machine IDs.
Escalate when confidence width is high or disagreement between validators crosses a defined threshold.
Reduce active tool context to the minimal action slice and rehydrate state per turn for long conversations.
Tool-Call Robustness
GPT-4.1 88% vs Claude 4 Ops 86%
Δ 2%
Procedure Compliance
GPT-4.1 84% vs Claude 4 Ops 91%
Δ +7%
Multilingual Serialization
GPT-4.1 79% vs Claude 4 Ops 89%
Δ +10%
Uncertainty Calibration
GPT-4.1 74% vs Claude 4 Ops 82%
Δ +8%
Long-Context Degradation
GPT-4.1 71% vs Claude 4 Ops 80%
Δ +9%