Skills don't work the way we think they do

Cover image for Skills don't work the way we think they do

I just finished reading SkillBench paper: https://arxiv.org/pdf/2602.12670

And the results are definitely not what most people expect.

What researchers did

SkillBench research setup screenshot

They did 86 real-work tasks across 11 domains and executed 7,308 runs.

Each task was tested in three modes:

  1. Baseline (no skills)
  2. Curated skills (human-written)
  3. Self-generated skills by the model
SkillBench result comparing smaller models with skills to larger models without skills
haiku with good skills is better than vanilla opus

Without further ado, below are some conclusions that I found interesting in the paper.

Self-generated skills don't help

One of the most hyped ideas in agent research is:

“Let the model write its own tools / skills.”

But it is mostly a wasted effort. In this research, self-generated skills produced no meaningful improvement over baseline.

In some cases, they made performance worse.

Today's models simply cannot reliably create useful reusable procedural abstractions.

This matters because a huge part of current agent research assumes models can recursively improve by generating better skills/tools. This benchmark suggests that assumption is premature.

SkillBench chart showing self-generated skills did not meaningfully improve performance

Human-made skills work A LOT better

When Skills were carefully written by humans, performance jumped +16.2 percentage points on average.

But here's what's even more surprising:

Domain variance was extreme

SkillBench chart showing high domain variance for human-made skills

Skills don't help the same in different fields.. They disproportionately help in structured, procedural domains.

Smaller models + skills ≈ bigger models without skills

A smaller model with curated Skills matched or exceeded a larger model without Skills.

This is huge for cost optimization:

Too many skills can hurt

Overly broad or verbose skill libraries degraded performance. Focused, minimal skill modules performed better.

SkillBench result showing too many skills can degrade performance

Pick your skills carefully. 2-3 skills work better than 4+ skills.

Here is my takeaway

If this paper is right (and i think it is, mostly because of my personal experiences with skill files):

Read the full paper: https://arxiv.org/pdf/2602.12670