On this page
The bulk-loading problem
The pattern
The default way to teach a frontier-model agent something is to write a
markdown file (a "skill"), drop it in .claude/skills/ or an equivalent
directory, and have it injected into the system prompt whenever the agent
detects relevance. Anthropic's SKILL.md format
is the canonical example. A typical skill is 100–500 lines of prose, YAML
frontmatter, and worked examples.
This works well at small scale. With 5 skills, the cost is negligible. With 50 skills it's noticeable. With 500 it breaks the agent.
Three measurable failures
Failure 1 — token cost grows with skill count, not task complexity
Every turn loads every potentially-relevant skill. With 60 skills averaging 200 lines each, you are spending ~30k tokens of context on knowledge the agent might not need. Most turns use one or two skills' worth of content, but pay for all 60.
# A real measurement (frontend-design corpus, v3 benchmark)
Bulk-loaded skills: 60 SKILL.md × ~200 lines = ~30 KB context
Skill Wiki index: 980 atoms × ~30 tok summary line = ~3 KB context
Atom bodies on demand: = ~600–2,000 tok per turn
Saving: ~10 × index reduction
~6 × typical-turn token cost Failure 2 — context pollution beats context budget
The token-budget framing misses the more serious failure: the wrong tokens, in the prompt, actively mislead the model. A skill on "OWASP input validation" loaded into a prompt about email template copy doesn't just waste tokens — it nudges the model toward security-flavoured responses that don't fit. We measured a quality-score delta of −13 between bulk-loaded and on-demand conditions on the same 20-task benchmark.
A 200-line markdown blob has no boundary. If you want to use just three facts and one rule from a skill, you have to load the whole thing. Coarse-grained knowledge plus eager loading is the failure mode.
Failure 3 — skills can't reference each other meaningfully
The relationship between two skills, in the SKILL.md format, is at best a hyperlink to "see also." There is no machine-checkable way to say "this rule requires that term," "this pattern contradicts that anti-pattern," or "this example validates that fact." So the validator can't catch contradictions, the retriever can't follow dependencies, and the chunker has no kind-aware notion of what to keep vs. drop.
The skill ends up being JavaScript: a typeless blob that happens to do the right thing in the cases the author tested.
What the alternative needs
The alternative needs three properties, taken together:
- Atomic granularity. The unit of knowledge must be small enough that it can be loaded individually. Not "a skill" — one fact, one rule, one pattern, one method.
- Typed edges. The relationships between atoms must be expressed
as typed verbs (
requires,validates-with,contradicts) so a validator can reason over them and a retriever can walk them. - Lazy projection. The agent must see which atoms exist (cheap) without loading their content (expensive). Loading happens by id, on demand, after retrieval picks targets.