SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

Li, Zhiwei; Hu, Yong

A two-minute walkthrough — what changes when a harness keeps the why of every revision, and why the same skill bundle then transfers across model backbones unchanged.

The idea

A diff shows what changed. SkillHone records why.

Agent skills extend language-model agents with task-specific procedures, scripts, and references — but the tasks and environments they target continually change. Existing methods polish skills in bounded runs and retain only the final artifact, discarding the decision history that later agents need to interpret prior revisions, evaluations, and rejected alternatives.

SkillHone represents each development step as a structured decision record. When the environment shifts, later subagents can inspect the persistent history to decide whether a failure is new, whether a similar fix was already attempted, and why an alternative was rejected — preventing redundant edits and accidental rollbacks of working repairs.

01 — Artifact gap

Diff vs. decision

A diff says how files changed. A decision record links that change to the problem it targeted and the evidence used to evaluate it. Long-lived skills need the latter.

02 — Agent-facing harness

Portable across runtimes

SkillHone records diagnoses, revisions, evidence, and outcomes under separated roles — portable across Claude Code, Codex, and Hermes runtimes.

03 — Empirical gains

Bundles that transfer

Improves skill development across public benchmarks and internal deployments, with skill bundles that transfer across execution backbones with no re-optimization.

The harness

One dispatcher, two role-separated teams, one persistent history.

SkillHone splits each step into optimization and evaluation dispatches. The dispatcher is a message router that creates subagents on demand and records accepted outcomes — making the optimizer/evaluator split structural rather than prompt-imposed.

SkillHone architecture: optimization team and evaluation team connected through a dispatcher that writes a persistent decision history.

Figure 1. At each step, fresh role-bounded subagents are dispatched. Routed evidence, repository operations, and recorded outcomes form a persistent decision history that later agents reuse.

Dispatch patterns

Roles emerge from permissions, not prompts.

Each subagent is created on demand, assigned to a permission-bounded team, and granted only the actions it needs for that dispatch. The optimizer/evaluator split is enforced by what each team can read or write.

Optimization

Proposer

Reads redacted reports and decision history; writes diagnoses. No access to probe targets or validators.

Optimization

Explorer

Searches external resources for reusable patterns when a revision needs them. No access to probe targets or validators.

Optimization

Developer

Edits the current skill and proposes typed revisions. No access to unredacted evaluation assets.

Optimization

Reviewer

Reviews pending changes using redacted evidence and prior decision records. No access to probe targets.

Optimization

Decider

Accepts or rejects candidate revisions according to recorded evidence. No access to unredacted evaluation assets.

Evaluation

Executor

Runs the current skill on probe items. May inspect oracle targets, validators, and traces. No skill-repo write access.

Evaluation

Diagnoser

Analyses outcomes and traces; updates evaluation diagnostics. No skill-repo write access.

Evaluation

Reporter

Produces redacted problem reports for the optimization side. No revision-decision authority.

Evaluation

Auditor

Checks probe metadata, traces, and redacted reports for evaluation-side consistency. No skill-repo write access.

Runtime

Dispatcher

Creates subagents and routes artifacts between repositories. No direct skill edits; no access to unredacted probe targets.

Main results

Best average on both benchmarks — without curated search.

Under the raw open-web setting, agents receive no pre-integrated search tools and must organise search, parsing, extraction, and recovery through portable skill bundles. SkillHone leads both averages, despite the baseline keeping its commercial search stack.

Setting	System	GAIA				WebWalkerQA-EN
Setting	System	L1	L2	L3	Avg.	Easy	Med.	Hard	Avg.
Curated search	deep-research agent	61.9	47.0	26.3	48.8	58.5	62.3	67.1	63.2
Raw open-web	Existing-Skills	64.3	33.3	21.1	41.7	51.2	48.5	52.6	50.2
	Skill-Creator	64.3	37.9	21.1	44.1	36.6	36.9	40.8	38.1
	Hermes-SE	73.8	40.9	31.6	50.4	53.7	51.5	55.3	53.0
	SkillHone (Ours)	76.2	66.7	31.6	64.6	53.7	69.2	68.4	66.4

Table 1. Main results. Bold marks the best score per column. SkillHone runs without curated search yet achieves the strongest averages.

GAIA accuracy under the development backbone (Qwen3.6-35B-A3B) and the transfer backbone (Claude Sonnet 4.6).

Cross-backbone transfer

With zero re-optimization on the new backbone, the SkillHone skill bundle transfers from Qwen3.6-35B-A3B (development) directly to Claude Sonnet 4.6 (transfer).

SkillHone reaches 72.4% on GAIA under Sonnet 4.6 — beating Hermes-SE by 10.2 pp, Existing-Skills by 15.7, and Skill-Creator by 24.4. The gain reflects the skill procedure, not fitting to one model.

The controls are doing real work.

No decision history keeps role-separated subagents but starts each step from the latest skill artifact alone. No role separation keeps decision history but lets a single subagent access the skill repo and unredacted evaluation assets jointly.

Variant	GAIA	WebWalkerQA-EN
SkillHone (full)	64.6	66.4
w/o decision history	51.2 (−13.4)	55.5 (−10.9)
w/o role separation	58.2 (−6.4)	61.1 (−5.3)

Table 2. Ablation. Decision history is the larger lever.

Trajectory

Where persistent history earns its keep.

Probe-split trajectories for SkillHone vs. Hermes-SE across five optimisation iterations from a shared seed skill. SkillHone improves 30% → 70% while recovering from two regressed revisions through targeted follow-up edits. Hermes-SE accepts or rejects whole prompt candidates under a scalar validation signal.

Optimisation trajectories of SkillHone and Hermes-SE.

Figure 2. Later revisions can target the offending part of a change while retaining useful edits — only possible when prior decisions are still inspectable.

Iter 1 looks like “no progress” on pass rate — but timeouts dropped 4→0 and average solve time fell 3×. Only a harness that reads its own history can see that and keep going. — Iteration 1 observation, travel-qa

Deployment study

+18.8 pp average across seven internal scenarios.

Beyond public benchmarks, SkillHone is deployed on internal tool-mediated analysis scenarios with exact-match evaluation. It improves six of seven seeded skills; gains concentrate where the initial procedure leaves reusable analysis steps underspecified.

Counting

+30.0

Aggregation

+26.3

Structure parsing

+25.0

Density estimation

+23.1

Span retrieval

+21.5

Filtered ranking

+5.9

List filtering

+0.0

Average

+18.8

Table 3. Optimised minus seeded accuracy on recurring internal scenarios.

Cite

BibTeX

@article{li2026skillhone,
  title   = {SkillHone: A Harness for Continual Agent Skill Evolution
             Through Persistent Decision History},
  author  = {Li, Zhiwei and Hu, Yong},
  year    = {2026},
  note    = {WeChat AI, Tencent Inc.},
  url     = {https://github.com/Tencent/SkillHone}
}

Skills that remember why they changed.

A diff shows what changed. SkillHone records why.

Diff vs. decision

Portable across runtimes

Bundles that transfer

One dispatcher, two role-separated teams, one persistent history.

Roles emerge from permissions, not prompts.

Proposer

Explorer

Developer

Reviewer

Decider

Executor

Diagnoser

Reporter

Auditor

Dispatcher

Best average on both benchmarks — without curated search.

Cross-backbone transfer

The controls are doing real work.

Where persistent history earns its keep.

+18.8 pp average across seven internal scenarios.

BibTeX