Skip to content
skillterm ★ GitHub

← Back to writing

Agent runtime vs. raw LLM API: why skillterm calls Claude Code instead of the Anthropic API

skillterm / neullabs · ·
architectureclaude-codeagent-runtime

The skillterm README opens with a diagram that is almost a manifesto. On the left, a traditional integration: App → Prompt → LLM API → Parse Response → Result. On the right, the skillterm approach: App → Claude Code/Codex (Headless) → Result, with WebSearch, WebFetch, Bash, Read/Write, and Skills branching off the agent runtime.

That diagram is the whole architecture argument. If you only have ten minutes to understand skillterm’s design, that is the diagram to internalise. The rest of this post unpacks why it is drawn that way, what it costs, and what it buys.

What “agent runtime” means in this context

A raw LLM API call is a text-completion endpoint with a chat envelope on top. You POST a JSON body with messages and parameters. You get back a JSON body with messages and a stop reason. That is the whole contract.

A headless agent runtime is a binary you can spawn — claude or codex — that exposes the same model but adds the tool-use machinery. It already knows how to call WebSearch. It already knows how to call WebFetch. It already knows how to spawn bash and read the output back. It already knows how to write a file at a path you allow. And it knows how to do these things in a multi-step loop: search, fetch what you found, parse the result, decide if you need more, search again, write the answer.

If you call the raw API, you have to implement every one of those steps yourself. If you call the runtime, you do not.

What skillterm would have to build if it called the API directly

Run through the actual generation flow for skillterm generate stripe:

  1. Classify stripe as a tool (CLI vs. SaaS vs. cloud provider vs. DevOps platform). Without a meta-skill loaded, the model has to decide this from scratch every time.
  2. Search the web for the canonical Stripe CLI documentation. This is a real HTTP API call to a search engine, with rate limiting, deduplication, and result ranking.
  3. Fetch the top results. That is HTTP, content-type sniffing, character encoding, redirect following, robots.txt courtesy.
  4. Parse the documentation. HTML to plain text, code blocks to fenced markdown, navigation chrome stripped.
  5. Run man stripe or stripe --help locally. That is bash execution with timeouts, captured stdout, captured stderr, exit code interpretation.
  6. Stitch the pieces together into a SKILL.md following the skill-creator format. That is the actual LLM call.
  7. Write the result to ~/.skillterm/skills/stripe/SKILL.md. That is file I/O with permissions checks.
  8. Retry on transient failures. That is exponential backoff with jitter.

In a raw-API integration, every one of these is your code. In skillterm’s actual implementation, steps 2 through 5 and 7 and 8 are already implemented inside the Claude Code runtime. You write the orchestration; the runtime does the work.

The README puts it bluntly: “By using Claude Code/Codex as the runtime, we get all their tools (web search, file ops, bash) for free.”

The trade-off skillterm accepts

The trade-off is not free. Choosing a headless agent runtime over a raw API means a hard dependency on a separate binary being present and on a separate trust boundary. The skillterm install instructions say so explicitly: install one of @anthropic-ai/claude-code or @openai/codex globally before running skillterm. If neither is on the PATH, skillterm cannot generate.

That is the cost: an extra install step, and a tighter coupling to the runtime’s CLI surface than to a stable JSON API. The runtime’s CLI can change, its flags can change, its output format can change. Anyone who has shipped a Rust binary that shells out to a Node CLI knows this is a real maintenance burden.

skillterm pays that cost because the alternative — re-implementing web search, web fetch, bash orchestration, file I/O, retry, and multi-step reasoning — is several orders of magnitude more code than wrapping a CLI. The README is upfront that this is a deliberate choice. The cost is bounded; the savings are open-ended.

Where the bootstrap skills come in

The runtime gives you tools. It does not give you opinions about how to use them. The “skill” of generating a SKILL.md — the structure, the YAML frontmatter, the heading conventions, the quality bar — has to be expressed somewhere. In skillterm, that expression lives in two bundled meta-skills:

  • skill-creator — the SKILL.md format specification, including the YAML frontmatter contract (name ≤ 64 chars, description ≤ 200 chars), the heading hierarchy, and the quality checklist.
  • saas-detector — a guide for how to classify a tool. Is this a CLI? A SaaS service? A cloud provider? Each category has different documentation patterns and authentication conventions.

Both are loaded as context every time the agent runs. They are not code skillterm executes; they are instructions skillterm injects into the agent’s reasoning. That is a very specific kind of leverage.

This is why the README emphasises that improving skillterm means updating those skills, not rewriting code. If you want better Stripe-shaped skills, you teach saas-detector about Stripe’s documentation layout. If you want a tighter heading hierarchy in generated SKILL.md files, you sharpen skill-creator. The behaviour change ships at the next generation — no rebuild, no release, just better instructions.

Why this matches how the field is moving

Two years ago, an LLM integration meant choosing between “call the API and build the scaffolding” and “use an opinionated framework that wraps the API.” Both options had problems. The API path meant a lot of plumbing. The framework path meant inheriting someone else’s abstractions over the LLM, which dated rapidly as model capabilities changed.

Agent runtimes — Claude Code, Codex, and the equivalents — are a third option. They are the framework, but the framework lives outside your process. They evolve independently. New tool support shows up the day the runtime supports it. You do not have to refactor when the runtime gains web search or file editing or sub-agents; you just keep calling the runtime.

skillterm is built on the bet that this third option keeps getting better, faster than a hand-rolled API integration could match. That bet is what the architecture diagram on the first page of the README is really about.

What this means for the user experience

Two practical consequences:

First, generation is slower than a raw API call. The docs note that a simple skill generation takes 30–60 seconds, and a SaaS skill with web fetching takes 60–120 seconds. That is the multi-step reasoning loop running in real time. skillterm hides that latency by making generation an explicit operation (skillterm generate kubectl) and decoupling it from completion. You generate once; you press Tab thousands of times.

Second, the SKILL.md file the agent writes is portable. Because the runtime follows the public Claude Agent Skills format, the same SKILL.md that drives skillterm complete can be loaded into any other Claude agent as context. The artefact has independent value. Even if you stopped using skillterm tomorrow, the skills you generated would keep working anywhere skills are read.

That portability is the cleanest signal that skillterm’s architecture is correctly factored: the durable thing is the SKILL.md, not the binary that wrote it.