v0.14.0 shipped a stack of performance work that, taken together, makes Hermes Agent feel like a different program. The three headlines from the release notes:
- •~19 seconds off cold start. Across the import graph, the doctor checks, the model catalog lookup, the welcome banner.
- •180× faster
browser_console. One persistent CDP connection instead of opening a new DevTools session per call. - •Cross-session 1-hour Claude prompt cache.
/newdoesn't lose your warm prefix cache.
None of these is a "rewrite from scratch." They're each a series of small, locatable PRs that targeted specific hot paths. This post is the PR-by-PR breakdown — what was slow, what changed, and what the lessons are for any agent author.
The starting point
If you ran hermes in early May 2026, the cold start path looked like this:
- 1.Python interpreter starts (~100 ms)
- 2.
import hermestriggers the rest of the import graph (~5-10 s) - 3.
hermesplugin discovery scans for installed plugins (~2-3 s) - 4.
hermes doctorruns API connectivity probes sequentially (~3-5 s) - 5.Skills index loads, often re-fetching from network (~2-4 s)
- 6.Welcome banner renders (~1 s)
- 7.Prompt appears
That's 15-25 seconds before you can type. Long enough to break flow. v0.14.0's perf wave attacks each of those steps.
The cold-start wave (10+ PRs)
Skills cache + lazy adapter imports (#22138)
The single biggest win. Two parallel changes:
- 1.Skills index is cached to disk. Previous behavior: every
hermesinvocation re-read the skills directory and ran lightweight indexing. New behavior: index lives in a cache file; only re-built when the directory changes (mtime check). - 2.Heavy adapter SDKs are lazy. Before: importing
hermestriggered import of the Feishu SDK, the WhatsApp SDK, every voice/TTS provider, every image-gen SDK. New: each is deferred until first use. If you never use Feishu, the Feishu SDK never loads.
Both changes were in #22138 and trimmed ~5-8 seconds. The lazy-import pattern is the unsung hero — it's mechanically applied across the codebase but doesn't show up as a single dramatic PR.
Skip eager plugin discovery on known subcommands (#22120)
If you run hermes config set X Y, you don't need plugin discovery. Same for hermes tools, hermes model, hermes doctor. The dispatcher now short-circuits plugin discovery when the command is a known built-in. ~2 seconds saved on most CLI invocations.
models.dev cache-first lookup (#22808)
Hermes hits models.dev for the canonical model catalog. Pre-v0.14.0, every hermes start did the HTTP call. New: read from disk cache first; only refetch when stale (>24 h). The cache hit is ~5 ms instead of ~200-500 ms (network roundtrip).
Parallel API connectivity probes in hermes doctor (#22766)
hermes doctor checks "can I reach Nous Portal, OpenRouter, Anthropic, OpenAI" sequentially. New: asyncio.gather them, drop IMDS (instance metadata service) probing that was timing out on non-cloud machines. ~3-5 seconds shaved off doctor in the typical case.
chat -q skips the welcome banner (#22904)
If you're in single-query mode (hermes chat -q "what's the time"), the welcome banner is noise. v0.14.0 skips rendering it. ~1 second back for scripted/automation use cases.
Deferred imports across the graph
A long tail of PRs each defer one heavyweight import:
- •
google-cloudingoogle_chatadapter — only loads when google chat is first used (#22681) - •
QQAdapter,YuanbaoAdapter— module-level deferral via PEP 562 (#22790) - •
httpxin the teams adapter — only on first webhook call (#22831) - •
fal_client— only on first image generation (#22859)
Each PR is small. Together they're another 2-4 seconds.
Cache Nous auth + .env loads (#25341)
The hermes tools All-Platforms screen used to do per-platform auth probes synchronously. That dropped from 14 seconds to under 1.5 seconds when the cache landed. The screen still works the same; it just doesn't burn 14 s of network time every invocation.
Net result
Add up the wins and you get the headline 19 seconds. None individually is dramatic; together they make the program feel responsive.
The 180× browser_console story (#23226)
The browser tool in Hermes uses Chrome DevTools Protocol (CDP) to drive a headless Chrome. Pre-v0.14.0, every browser_console evaluation did this:
- 1.Open a new CDP WebSocket to Chrome
- 2.Wait for handshake (~1-2 s)
- 3.Send the JavaScript to evaluate
- 4.Read the result
- 5.Close the WebSocket
For a chain of evaluations — say, the agent inspecting a page step by step — this was ~1.5 s per evaluation. A typical "look at this page and tell me about it" workflow was painful slow.
The v0.14.0 change: one persistent WebSocket per supervisor, shared across all browser tool calls. The CDP handshake happens once per session; subsequent evaluations skip steps 1-2 entirely and just send the payload.
Result: per-call latency dropped from ~1.5 s to ~8 ms. That's ~180×.
The lesson is general. Anywhere your agent is doing N tool calls and each opens a new connection: connection pooling beats parallelism for chat-paced workloads. The total work is in the connection setup, not the work itself.
Cross-session 1-hour Claude prompt cache (#23828, #25434, #24778)
Claude's prompt cache (Anthropic API feature) lets you mark a chunk of context as cacheable. Subsequent requests within the TTL — historically 5 minutes — reuse the cached prefix at much lower cost and latency.
Pre-v0.14.0, Hermes used this within a session: the system prompt + active skills got cached, hits within the same conversation were faster and cheaper. But when you ran /new, the cache was invalidated, and the next conversation started cold.
The v0.14.0 work, across three PRs, extends the cache TTL to 1 hour and makes the cache key survive /new. The first response in a fresh session benefits from a warm cache from your previous session. For people who do many short conversations a day, this is a meaningful cost-and-latency win.
The cache also covers the background memory review fork (#25434) — the periodic process that revisits and consolidates your memory file now hits the same prompt cache instead of paying full price every iteration.
A complementary perf fix in this area: the prompt cache key uses a date-only timestamp so caches don't bust across timezones. Loud gateway-DB roundtrip logging rounds out the observability.
Tools that surfaced the wins
How did the authors find these specific hot paths? Two answers:
- •Cold-start profiling on real install. They ran
hermesunder timing and tracked which imports took the longest. Then went after them one by one. - •User reports of "Hermes feels slow." The
hermes toolsAll Platforms screen being slow was a specific bug report; once it surfaced, the fix was straightforward.
There's no magic. The wins are findable with python -X importtime, cProfile, and listening to people complain.
Lessons for any agent author
If you're building an agent, the v0.14.0 perf wave generalizes into a few rules:
- 1.Don't import what you don't use. Lazy-import every adapter and provider SDK. The import graph is your cold start.
- 2.Cache anything that's stable. Skills index, model catalog, auth tokens. Re-fetch only when known-stale.
- 3.Pool connections. If you call an API or a CDP server more than once per session, hold the connection.
- 4.Run independent probes in parallel. Doctor checks, health checks, anything where order doesn't matter.
- 5.Don't render UI you don't need. Welcome banners, decorative output — skip them in non-interactive paths.
None of these is novel. They're the standard performance playbook for any CLI. The interesting thing about v0.14.0 is they applied the whole playbook in one release window and the result was felt the moment users updated.
If your Hermes felt brittle in early May, v0.14.0 is the release where the cold path went away.