The Story Deep Dive

When an AI Diagnoses Its Own Blind Spots: Inside the Hermes Self-Improving Tool-Use Loop

Hermes Agent

@hermesagents

April 9, 2026

10 min read

There is a particular kind of release-notes PR that you have to read twice, because the first reading does not tell you what actually happened. The relevant entry in the Hermes Agent v0.8.0 notes is this one:

Self-Optimized GPT/Codex Tool-Use Guidance — The agent diagnosed and patched 5 failure modes in GPT and Codex tool calling through automated behavioral benchmarking, dramatically improving reliability on OpenAI models. (#6120)

Reading that the first time, I assumed it was a sentence written by somebody who was being a little loose with what "self-optimized" meant. Reading it the second time, I realized it was literal. The agent ran a benchmark suite against itself, spotted systematic failures in how OpenAI models were calling its tools, generated targeted guidance to fix those failures, and measured that the fixes actually worked. Humans approved the result. But the whole diagnose and patch step was automated.

This is worth unpacking, because it is the kind of capability that sneaks into a release as a single line and changes what future releases can look like.

The specific problem

The pairing of Hermes Agent with OpenAI models — GPT-5, Codex — had been visibly wobbly for a few releases. Users would report that Anthropic Claude worked smoothly while GPT-5 sometimes produced arguments in the wrong shape, skipped steps, or lost track of what it had already done during a multi-tool sequence. These are not subtle bugs; you can watch them happen. But they are also maddening to fix by hand, because the failure modes are model-specific and depend on prompt wording in ways that are hard to intuit.

There were five recurring patterns, according to the PR description:

•Skipping recommended pre-checks before destructive tool calls.
•Producing tool arguments as raw strings where the schema wanted structured objects or numbers.
•Losing track of which tool call had already succeeded within a chained sequence, leading to duplicate calls.
•Declining to retry after transient errors even when the guidance said to.
•Drifting from "execute the plan" into "re-plan the plan" when given long contexts.

None of these are theoretical. Each of them had a trail of GitHub issues behind it.

The loop that fixed them

The approach in PR #6120 has three moving parts.

First, an automated behavioral benchmark suite. A harness runs the agent against a set of synthetic scenarios that are designed to elicit each of the five failure modes. For each scenario, the benchmark records what the model did, what it should have done, and whether the difference counts as one of the known failure patterns.

Second, a guidance-generation step. When the benchmark flags a failure, the harness produces candidate guidance strings — short, model-specific instructions to be added to the system prompt that target the exact pattern that failed. Not "be more careful"; things like "before calling any destructive tool, first call the matching preview tool with identical arguments and report its output." The candidates are generated against the specific failures observed, not against a generic rubric.

Third, a re-benchmarking step. Each candidate guidance string is tested on the same scenario suite. Guidance that improves scores gets kept. Guidance that regresses other scenarios gets discarded. The surviving strings are aggregated into the GPT/Codex system prompt Hermes ships.

Why this is a different kind of thing

Historically, the prompts that ship with an agent framework are written by humans — sometimes one human, sometimes a small team — who develop opinions about what works and what does not by running their own eyeballs across conversations. That process has three problems: it does not scale to many models, it cannot be re-run cheaply when a model updates, and it produces prompts that encode one person's superstitions alongside their actual evidence.

Replacing the "human writes, human evaluates" loop with a "benchmark writes candidates, benchmark evaluates them" loop is not the same thing as letting an AI tune itself unsupervised. Humans still approve the final guidance. But the actual work of finding patterns and proposing fixes is now measurable and repeatable. When GPT-6 ships, the same harness runs again. When a new failure mode is reported, a new scenario gets added and the loop runs again. The overhead of keeping the prompt in sync with the actual behavior of the model goes down by an order of magnitude.

The related changes in v0.8.0 hint at the same direction. Thinking-only prefill continuation (#5931) handles the specific case where a model produces a reasoning block without a content block and then gets stuck. Execution discipline guidance (#5414) adds a general "do not re-plan unless something actually changed" rule to the system prompt. Coerce tool call arguments (#5265) silently converts strings to numbers and booleans when the JSON schema expects them, papering over the exact argument-typing failure the benchmark caught. These three PRs look unrelated in a changelog. Read together with PR #6120, they are the visible surface of a deliberate campaign: make the agent more reliable on a specific model family by measuring, fixing, and re-measuring.

The quiet implication

The thing that makes this feel like a hinge moment, and not just a good release, is that it hints at what agent engineering looks like when you stop hand-writing prompts. A repo like Hermes has to target twelve different providers and thirty different models. Tracking the behavior of each one manually is already impossible. If you instead treat "how should we prompt model X" as a tuning problem that a benchmark harness solves, you have a process that scales with the size of the model zoo instead of drowning in it.

None of that was the headline of v0.8.0. The headline was "intelligence release" and the demo was live model switching. But the quiet thing under the floorboards is that Hermes now has a way to keep the agent smart on every model it supports, without a human spending a weekend per model re-tuning the system prompt. That is the kind of capability that will not show up as a feature next release either — it will just show up as "GPT reliability on Hermes keeps getting better" across the next six releases in a row.

Which, if you are reading release notes for fun like I am, is the shape of the thing you want to be watching.

auto_stories Related Articles

April 10, 2026 · 11 min read

When an AI Diagnoses Its Own Blind Spots: Inside the Hermes Self-Improving Tool-Use Loop

The specific problem

The loop that fixed them

Why this is a different kind of thing

The quiet implication

Read more

auto_stories Related Articles

27 Days, 7 Releases: Reading Hermes Agent's Public Velocity

Hermes Agent v0.8.0 — The Intelligence Release

Model Switching Without Lock-In: How Hermes Handles the Provider Zoo

Stay in the Loop