Deep Dive For Power Users

Memory Architecture in Hermes Agent: Honcho and the Pluggable Memory Interface

Hermes Agent

@hermesagents

March 29, 2026

9 min read

Most AI chat interfaces you have used do not really have memory. They have a context window, which is a very different thing. Whatever you said earlier in the same conversation is still in front of the model. Whatever you said in yesterday's conversation is gone. The next day you start again from nothing, and the assistant reintroduces itself as a stranger.

Hermes Agent is different. It has a real memory layer — separate from the conversation context — that learns things about you over time, carries those things across sessions and platforms, and makes the bot behave like the same entity every time you talk to it. This post is about how that actually works, which decisions matter, and what the v0.7.0 pluggable memory interface changed.

Short-term memory vs long-term memory

First, the distinction that matters.

Short-term memory in Hermes is the session context window. It is a slice of the conversation history the agent is currently having, managed with a proactive compression strategy: when the context gets close to the model's limit, Hermes runs a summarization pass that collapses older turns into a structured summary while preserving the most recent exchanges verbatim. The compression has been tuned through several releases — structured summaries with iterative updates in v0.4.0, token-budget tail protection, a configurable summary endpoint, and fallback model support. On long conversations this quietly keeps the agent fast and cheap without dropping important context.

Long-term memory is the interesting part. It is a store of facts, preferences, corrections, and user models that lives outside the conversation. When you tell the bot "my name is Alice" on Telegram today, that fact gets written to long-term memory. Tomorrow, when you ask it something on Slack, the fact is pulled out and injected into the context before the agent sees your message. The model still only gets what fits in its window, but the window is primed with things it should know about you.

Short-term memory is a buffer. Long-term memory is a person.

Honcho: what it is and why it matters

The default long-term memory provider in Hermes is Honcho, a library purpose-built for AI-native memory. Honcho's job is to run behind the agent and do three specific things:

1.Observe. Every user message and every agent response feeds into Honcho as an event stream. Honcho builds an internal user model from that stream — not raw chat history, but structured facts and preferences inferred from the conversation.
2.Reason about the user. Honcho runs a small "dialectic" layer that tries to build a coherent picture of who you are, what you want, and what you have corrected. It is not just keyword extraction — it is a running mental model of the user.
3.Inject. On each new turn, Honcho produces a short context snippet summarizing what it thinks matters about the user, which Hermes prefixes to the system prompt. The snippet changes as Honcho learns more.

Two details are worth pointing at because they are easy to miss.

First, Honcho writes are asynchronous. The agent does not block on a memory write. It responds, and the memory layer processes the exchange in the background. This means long conversations do not pay a latency tax for memory updates, and a memory backend outage does not stop the bot — you lose updates during the outage, but the assistant keeps talking.

Second, Honcho recall is kept out of the cached system prefix. Anthropic's prompt caching feature (used heavily on models like Claude Sonnet 4.6) wants the system prompt to be stable across turns so the cache hits. Honcho's injected snippet changes turn-to-turn, so Hermes deliberately appends it after the cached system section. The cache still works for the static parts; the dynamic memory layer still works for the parts that change. This is the kind of mechanical tradeoff that does not make it into release notes but decides whether your monthly bill is $50 or $500.

Multi-user isolation in gateway mode

The default Hermes gateway runs multiple users through the same agent process. Long-term memory has to be per-user, or else Alice's allergies end up in Bob's cooking suggestions. The v0.3.0 release added proper multi-user isolation for Honcho inside the gateway, which in practice means:

•Each gateway user ID maps to a distinct Honcho peer, and memory writes are scoped per-peer.
•Group-chat sessions inherit per-user sessions by default, so a shared channel still writes separate memory streams for each participant.
•Profile-scoped memory isolation (v0.5.0/v0.6.0) means if you run multiple Hermes profiles on the same machine, each profile's memory is a separate universe. Swapping profiles does not leak one persona into another.

None of this is visible to users. All of it is the reason the bot does not accidentally remember the wrong person.

The pluggable memory interface (v0.7.0)

For the first five releases of Hermes, Honcho was hardwired. In v0.7.0 the memory layer got refactored into a proper provider interface — a small Python ABC that any memory backend can implement. The change is architecturally modest and practically enormous.

The interface lets you swap memory backends without touching Hermes' core:

•Honcho is the reference provider (and still the default). It is full-featured, runs a real user model, and handles multi-user isolation correctly.
•Supermemory was added in v0.8.0 as a second first-class provider, with multi-container support, configurable search modes, and identity templating.
•mem0, OpenViking, RetainDB, Hindsight, and ByteRover all have community memory plugins in the Hermes plugin system, with varying depth of integration.
•You can also write your own. The ABC is small: implement write(), recall(), a few lifecycle hooks, and register as a plugin.

The built-in memory provider — the no-dependency default if you have not set up anything else — is an SQLite-backed fact store that handles the basics: write facts, recall them by relevance, scope by user. It is not as smart as Honcho, but it needs no external service, and for a personal assistant on a $5 VPS it is often all you need.

The quiet thing this unlocks

Pluggable memory is the kind of architectural change that looks like janitor work in release notes. "Refactored memory into a provider interface" is not a headline. What it actually does is decouple the question of "what should an AI assistant remember about you" from the question of "how does Hermes work."

You can now replace Honcho with a memory backend that is tuned for your use case — a vector store for people who want semantic search over a personal knowledge base, a graph database for people who want explicit entity relationships, a purely local SQLite store for people who do not want any memory data leaving their machine, a company-internal memory service for teams. The agent does not change. Only the thing behind the memory interface does.

That is the right abstraction for a project that wants to be around in a few years. Memory is personal, and the right memory backend for you is not necessarily the right one for anyone else. Hermes' job is to be a good citizen to whichever memory layer you plug in, and to get out of its way.