June 29, 2026
What Instruction Bleed means if you write SKILL.md / AGENTS.md
In a previous post we walked through instruction bleed — the finding, from Lin & Liu’s June 2026 paper, that editing one prompt module can shift the behavior of an unrelated one because self-attention draws no boundary between them. That post was the neutral tour. This one is for the people who maintain the files: if your repo has a SKILL.md, an AGENTS.md, a stack of mode definitions, what should you actually do with this?
The honest headline first: you probably don’t need to do anything urgent. The measured effect was sub-threshold — it moved scores, not decisions. But the paper does hand you a concrete way to check, and the checking is cheap relative to finding out the hard way. Here’s the practitioner’s read.
Your stack is the paper’s example
You don’t have to squint to see yourself in this work. The paper’s survey of “prompt-composed agentic systems” names the exact convention many of us use: human-authored Markdown — SOUL.md, SKILL.md, AGENTS.md — that an LLM interprets rather than executes. The defining trait is that your behavioral logic (scoring rubrics, workflow steps, persona definitions, tool-selection criteria) lives in text the model reads through attention, not in code a runtime executes.
That’s the population where instruction bleed is possible. If your agent’s decisions are encoded in prose files sharing one context window, the mechanism applies to you. Whether it bites you is the empirical question — which is the whole point of what follows.
Where bleed is most likely to show up
The paper’s one positive result — the content channel — points at the practical risk. It wasn’t volume (adding bulk did nothing measurable) and it wasn’t formatting (reordering and restyling did nothing measurable). It was semantic content: an irrelevant but meaningful persona added to a shared file nudged an unrelated score. So the places to be wary, in rough order:
- Shared rules / persona files that many skills read. The more modules co-resident in one context,
the more surface for one to color another. A “voice and tone” block or a global archetype is exactly the kind of meaningful-but-off-topic text that moved the needle.
- Files that accumulate. A
SKILL.mdthat started tight and grew a dozen sections is a dozen chances
for cross-talk. Bleed is about content sharing a window, and growth quietly adds content.
- Persona and archetype language specifically. The effect in the study came from injecting a
“Professional Chef” archetype. Strong characterization is high-semantic-content text — the category most likely to leak.
None of this is “delete your files.” It’s “know which edits are the high-risk ones.” Adding a 200-line unrelated module was safe in the study; adding a flavorful one sentence was not.
What the authors suggest you do
The paper’s prescription is, in a phrase, regression testing for prompt composition — and its sharpest observation is that essentially nobody does it. Of the systems it surveyed, none ran any regression test along the dimensions it proposes. The suggested checks:
1. Compositional-consistency testing. Measure a module’s behavior in isolation, then again in the full assembled context, and confirm they roughly match. A large gap is bleed. This is the core idea: a skill should do about the same thing alone as it does surrounded by its neighbors. 2. Module-interaction (add) testing. When you add a module, re-check the focal behaviors that have nothing to do with it. The whole lesson is that “unrelated” additions aren’t guaranteed inert. 3. Format-robustness checking. Re-run after meaning-preserving reformatting. (This channel was null in the study’s one system — but other research has found models elsewhere are highly format-sensitive, so it’s worth confirming for your model rather than assuming.) 4. Model-migration re-measurement. Re-run all of the above whenever you change models. The authors predict the magnitude of bleed varies by model family, so a model swap can change your exposure even if you touched no prompt.
The catch the paper is upfront about: checking every pair of modules is O(n²), and at the scale of large skill libraries that’s prohibitive. Their answer is to sample — random or semantic-distance-weighted — rather than test exhaustively. And the deepest question they raise is left genuinely open: whether providers could ever offer “module-isolation primitives — separately cached prompt segments with restricted cross-segment attention — or whether global attention makes text-level isolation fundamentally impossible.” No one knows yet.
The pragmatic version for a normal team
You are not going to build a bootstrap-CI test harness this quarter. So the proportionate takeaways:
- Keep high-semantic modules small and separate. The less meaningful off-topic text shares a window
with a decision, the less there is to bleed. Compactness is cheap insurance.
- Treat persona/archetype edits as behavior-affecting changes, not cosmetic ones — because the
evidence says they can be.
- When something does drift after an “unrelated” edit, you now have a name and a first test: pull the
module out, run it alone, compare. That single comparison is the paper’s whole diagnostic, miniaturized.
- Re-baseline after a model upgrade. This is the one most teams skip and the one the authors most
expect to matter.
That’s the practitioner’s whole job here: not panic, but a sharper sense of which edits carry behavioral risk, and one concrete check to run when you suspect bleed. The finding is small today. The reason it’s worth knowing is that it’s invisible to the QA you already run — so the only way to see it is to look on purpose.
Source: Ching-Yu Lin & Yifan Liu, “Instruction Bleed: Cross-Module Interference in Prompt-Composed Agentic Systems,” arXiv:2606.26356 (submitted 24 June 2026). The content-channel effect (d = 0.63), the null volume/form channels, and the system survey namingSKILL.md/AGENTS.mdare drawn from that paper; the regression-testing prescriptions paraphrase its proposed framework.
Run your workflow as a protocol, not a board
kanbento is a headless, agent-native kanban — your agents operate the board through a CLI while state lives in plain files you can read and diff.