Skills don't work the way we think they do

February 20, 2026 · originally published on X

I just finished reading SkillBench paper: https://arxiv.org/pdf/2602.12670

And the results are definitely not what most people expect.

What researchers did

They did 86 real-work tasks across 11 domains and executed 7,308 runs.

Each task was tested in three modes:

Baseline (no skills)
Curated skills (human-written)
Self-generated skills by the model

SkillBench result comparing smaller models with skills to larger models without skills — haiku with good skills is better than vanilla opus

Without further ado, below are some conclusions that I found interesting in the paper.

Self-generated skills don't help

One of the most hyped ideas in agent research is:

“Let the model write its own tools / skills.”

But it is mostly a wasted effort. In this research, self-generated skills produced no meaningful improvement over baseline.

In some cases, they made performance worse.

Today's models simply cannot reliably create useful reusable procedural abstractions.

This matters because a huge part of current agent research assumes models can recursively improve by generating better skills/tools. This benchmark suggests that assumption is premature.

SkillBench chart showing self-generated skills did not meaningfully improve performance

Human-made skills work A LOT better

When Skills were carefully written by humans, performance jumped +16.2 percentage points on average.

But here's what's even more surprising:

Domain variance was extreme

Some domains saw small gains (~4-5 pp)
Others saw enormous gains (~50+ pp)

SkillBench chart showing high domain variance for human-made skills

Skills don't help the same in different fields.. They disproportionately help in structured, procedural domains.

Smaller models + skills ≈ bigger models without skills

A smaller model with curated Skills matched or exceeded a larger model without Skills.

This is huge for cost optimization:

Local agents
Edge deployment
Open-source models

Too many skills can hurt

Overly broad or verbose skill libraries degraded performance. Focused, minimal skill modules performed better.

SkillBench result showing too many skills can degrade performance

Pick your skills carefully. 2-3 skills work better than 4+ skills.

Here is my takeaway

If this paper is right (and i think it is, mostly because of my personal experiences with skill files):

Scaling alone isn't enough
Autonomy narratives are premature
Skill architecture design is now a first-class research problem

Read the full paper: https://arxiv.org/pdf/2602.12670