paper v12 final · 2026-05-10

When auto-mined skills help robot code generation

Agentic coding pipelines share a recurring loop — discover useful subroutines from prior generations, accumulate them into a library, re-apply them later. NVIDIA CaP-X already ships with a skill library, but those helpers were manually curated by researchers reading many LLM trial programs. auto-capx (this work) instantiates that self-improvement loop on CaP-X and tests whether an automatic mining loop can match — or, under some conditions, improve on — NVIDIA's hand-curation step on a single concrete task. The broader direction this points at: code-as-policy robots that scale to wider settings without humans hand-expanding their helper API. Single-task, reduced-API evidence so far; the cross-setting claim is not yet tested. Four research questions on the way: extract → useful → evolve → survive.

Status: paper v12 · NVIDIA-9 closure integrated · arXiv submission tar 151 KB · cumulative cost ~$261.

No-skill baselineRobot code agent gets only the base CaP-X-style API.P21_a
No-dedup mined-skill libraryExecutable mined functions injected without structural dedup.C2
Production structural-dedup librarySame pipeline after current survivor selection.C3v2
Quality-ranked same-size libraryTop-11 by quality_score (drops 3 lowest from C2).manual_11
Final success rate by experiment arm with Wilson confidence intervals
No-dedup mined-skill libraries beat the no-skill baseline; production structural-hash dedup regresses toward baseline; quality-ranked same-size selection recovers performance.

One-paragraph answers to the four questions

Setup — what was compared, and what counts as success

NVIDIA CaP-X gives an LLM a small base API plus a manually-curated set of helper skills (9 functions in the published S3 configuration, FrankaControlApiReducedSkillLibrary). The helpers were chosen offline by NVIDIA's authors from ~182 unique functions across hundreds of trials, filtered to ~73 candidates, then human-selected to 9. auto-capx (this work) instantiates an agentic self-improvement loop on top of CaP-X — mine subroutines from prior generations → gate → dedup → re-inject into later trials — and tests whether that automatic loop can reach performance comparable to NVIDIA's hand-curation step in this reduced-API setting: trial Python is parsed, gated, optionally deduplicated, and re-injected into later prompts and execution namespaces — no human in the loop. The broader direction: a code-as-policy robot agent that scales to wider settings without humans hand-expanding the helper API. The diagrams below show the curation difference; the experiments compare auto-mined library variants against nvidia9 (NVIDIA's published 9-skill library, the proper baseline) and against P21_a (a no-helpers ablation used as the anchor point for "+pp" gains).

NVIDIA CaP-X vs auto-capx

NVIDIA CaP-X (within-trial retry only) vs auto-capx (this work, adds skill memory layer across trials)

Both setups share the within-trial loop (LLM → Python → execute → VDM → retry). auto-capx adds the bottom row: extract reusable functions from trial logs, gate them, dedup, and re-inject the resulting library into the LLM's namespace on subsequent trials. Everything in this paper is the ablation of that bottom row.

Conditions used in the ablation

Each card explains what is in the library, what gets tested by it, and why it matters. The library size is the most important number on each card — the dedup story (Q2d) hinges on whether two libraries with the same size behave differently.

P21_abaseline

The reference point. The robot code agent gets only the base CaP-X-style API — no mined skills at all. Everything else (P21_a's success rate at $n{=}50$ is 78.0%) is what we compare against.

0 skillsno libraryno namespace seeding
Role · Anchor for every "+pp" we report.
C1no-dedup max

Everything quality gates passed, no dedup. All 16 mined functions injected into prompt + execution namespace. Namespace seeding (the runtime fix that makes promoted skills find their dependencies) is active.

16 skillsno dedupnamespace seeding ✓
Role · Tests "more skills, more help" as an upper bound.
C2paper's ship-this

Quality-gated 14 skills, no dedup. C1 with the two lowest-passing skills removed. This is the arm the paper recommends shipping ("namespace seeding + gates + no structural dedup").

14 skillsno dedupnamespace seeding ✓
Role · The cleanest "skills help" evidence.
C3v2production dedup

C2 + structural-hash dedup. The current production survivor rule clusters functions by AST hash and keeps one per cluster. Library size shrinks to 11. The downstream effect is the central counter-evidence: this arm is statistically indistinguishable from the no-skill baseline.

11 skillsstructural-hash dedupnamespace seeding ✓
Role · Shows dedup can erase the entire effect.
manual_11smart dedup

Same 11 skills size, picked by quality_score instead of structure. Take C2's 14 skills, sort by quality_score desc, keep the top 11 (drops the 3 weakest). Same library size as C3v2 but different survivors.

11 skillsquality-ranked dedupnamespace seeding ✓
Role · Proves the dedup damage is selection, not size.
empty_nsstub control

16 function signatures with empty bodies (pass). The LLM sees the typed namespace exactly like in C1, but the executable bodies are gone. Used to ask: is a typed scaffolding alone enough?

16 namesno bodiesnamespace seeding ✓
Role · Falls below baseline (43.3%) — bodies matter.
nvidia9NVIDIA published baseline

NVIDIA's manually-curated 9-skill library, hardcoded as class methods on FrankaControlApiReducedSkillLibrary. The 9 helpers were selected by NVIDIA's authors by offline compilation: ~182 unique functions across hundreds of LLM trials, filtered to ~73 candidates, then human-picked to 9. This is NVIDIA's published S3 configuration — the proper baseline that auto-mined libraries should match.

9 skillsmanually curated (offline)class methods (no namespace seeding)gpt-4.1: 45/50 = 90.0% [78.6, 95.7]DeepSeek: 42/50 = 84.0% [71.5, 91.7]
Role · The proper NVIDIA baseline. Auto-mined libraries (C2/manual_11) match this within Wilson CI overlap — O1 parity (pre-registered).
dedup_v3_k10 / k12 / k13algorithmic robustness

The manual_11 recipe at neighbouring k values. Same "top-k by quality_score" rule, but k = 10 / 12 / 13 instead of 11. Tests whether the smart dedup recipe is robust to small k changes, or accidental at k = 11.

k = 10 / 12 / 13quality-rankedn = 70 each
Role · Q2d — non-monotone (k=12 dip), open mechanism.

auto-capx pipeline (full loop)

auto-capx pipeline: generate → execute → mine → gate → dedup → reinject

Success metric — auto-capx-defined "Task completed"

For each trial we report whether its final attempt succeeded (the definition used in summaries.txt). Counting trial directories directly over-counts: a trial with multiple sandbox attempts can have one success even when its final attempt fails. All numbers in this paper use the same definition.

Q1 — Can the LLM extract reusable skills from its own code?

RQ1 · extract
Hypothesis
AST extraction + quality gates (success rate, generality, dependency resolution) can yield a compact, reusable skill library from raw trial code.
Experiment
Re-analyse Group D (n=50 trials, 89 retry directories, 227 code blocks). Build the skill usage matrix; audit the unpromoted pool; quantify how each gate filters.
Result
  • 341 candidate functions → quality gates → 11 promoted skills
  • The top-4 promoted skills account for 76% of all calls (heavy-tailed)
  • Call arity is 100% correct across 744 calls — the LLM understands its own contracts
  • Hard-fail + dedup do 93% of the filtering (soft-only would keep 148; actual = 11)
  • Two dead skills (zero calls); both have no docstring — extractable but not invocable (foreshadows Q4)
Boundary
Skills were mined from cube_lifting only. Whether the same mining pipeline works on harder tasks is downstream of Q2c (the vision pipeline floors cube_stack_3, so cross-task mining is currently unmeasurable).
Skill usage matrix — top-heavy distribution and dead skills
Unpromoted pool — naming explosion and near-misses

Left: top-4 of 11 promoted skills = 76% of calls; two dead skills (no docstring). Right: 330 unpromoted candidates, 91% fail the generality gate; 18 share a structural hash with a promoted skill.

Q1 answer · Yes

The mining loop works: from a single trial corpus we end up with a small library where heavy-tailed usage is concentrated on a handful of skills with correct call signatures. Extraction is the easy part. Whether the resulting library actually helps is Q2.

Q2 — Do mined skills help on the next trial?

The central question. We split it into four sub-questions: (a) within-task, (b) across LLM backbones, (c) across tasks, (d) sensitivity to the dedup algorithm.

RQ2a · within-task baseline (cube_lifting)
Hypothesis
Injecting mined skills into prompt + namespace raises task success rate above the no-skill baseline.
Experiment
cube_lifting × gpt-4.1, n=50/70 per condition.
Result
ConditionnSuccessWilson 95% CIvs P21_a
P21_a5039/50 = 78.0%[64.8, 87.2](reference)
C17067/70 = 95.7%[88.1, 98.5]+18pp ✅
C27068/70 = 97.1%[90.2, 99.2]+19pp ✅
C3v27058/70 = 82.9%[72.4, 89.9]+5pp (CI overlap)
manual_117066/70 = 94.3%[86.2, 97.8]+16pp ✅
nvidia9 NVIDIA published5045/50 = 90.0%[78.6, 95.7]+12pp (CI overlap w/ C2)
empty_ns3013/30 = 43.3%[27.4, 60.8]−35pp
Implication
  • No-dedup libraries (C1/C2) beat baseline by 18–19pp with separated CIs.
  • Production structural-hash dedup (C3v2) erases the gain — statistically indistinguishable from baseline.
  • Function bodies, not the typed namespace, drive the effect. Empty stubs fall below baseline.
  • The dedup problem is selection, not size. Same 11 skills picked by quality_score (manual_11) outperforms structural-hash 11 by +11pp.

Visual evidence — sample trials

C2 success — no-dedup library

The "ship this" arm: namespace seeding + gates + no structural dedup.

P21_a success — baseline

Baseline succeeds 78% of the time; the library effect needs n to detect.

P21_a fail — baseline

22% baseline failures leave headroom for the library.

RQ2b · across LLM backbones
Hypothesis
The library effect replicates across multiple backbones, not just gpt-4.1.
Experiment
cube_lifting × {gpt-4.1, Claude Sonnet 4, DeepSeek v3} × {P21_a, manual_11, nvidia9}, n=50 each. nvidia9 = NVIDIA's published 9-skill library (pre-registered A2-min closure, Phase v12).
Result
BackboneNo helpers (P21_a)NVIDIA-9 (manual)Auto-mined (manual_11 / C2)Auto vs NVIDIA-9
gpt-4.178.0% [64.8, 87.2]90.0% [78.6, 95.7]97.1% [90.2, 99.2] (C2)+7pp directional; CI overlap → O1 parity
Claude Sonnet 486.0%— (not run)98.0%
DeepSeek v36.0% [2.1, 16.2]84.0% [71.5, 91.7]94.0% [83.8, 97.9] (manual_11)+10pp directional; CI overlap → O1 parity

Wilson CI overlap: gpt-4.1 NVIDIA-9 [78.6, 95.7] ∩ C2 [90.2, 99.2] = [90.2, 95.7]. DeepSeek NVIDIA-9 [71.5, 91.7] ∩ manual_11 [83.8, 97.9] = [83.8, 91.7]. One-sided binomial p=0.015 (gpt-4.1) and p=0.009 (DeepSeek) indicate directional signal but at n=50 the gaps sit inside the CI overlap — pre-registered O1 outcome.

Implication
Library magnitude tracks how weak the baseline is. The weakest backbone (DeepSeek at 6%) gains the most. This pattern holds for any reasonable library — NVIDIA-9 also rescues DeepSeek from 6% to 84%, indicating the +88pp magnitude finding from the auto-mined arm is not specific to auto-mining. The auto-mined library does not dominate NVIDIA's manual curation; in this single-task setting it reaches Wilson 95% CIs that overlap NVIDIA-9, with a directional ~7–10pp auto-mining advantage and no pre-registered equivalence margin. The contribution is a feasibility result, not a quantitative win over, or demonstrated substitute for, manual curation.
RQ2c · across tasks (cube_stack_3 cross-task)
Hypothesis
The cube_lifting +19pp library effect transfers to a related task family (cube_stack_3 = stack three cubes red-on-green-on-blue).
Experiment
cube_stack_3 × gpt-4.1 × {P21_a, manual_11}, n=15 each.
Result
ConditionnTask completedAvg rewardCode blocksRegenerations
p21_a150/15 (0%)0.0405.7334.733
manual_11150/15 (0%)0.0604.1333.133

Floor on both arms. All three cubes are present in the simulator (verified by sim.data.xpos), but SAM3 segmentation produces highly fragmented masks (200+ entries per cube) on multi-cube scenes — the downstream pose-from-mask functions get nothing usable.

Library decouples from task success: even at the floor, manual_11 cuts code blocks by 27.9% (5.73 → 4.13), regenerations by 33.8% (4.73 → 3.13), and wall time by 24% (2338s → 1777s). "Library helps task completion" and "library helps generation cost" are separable dimensions; only the former is captured by binary task_completed.
Implication
Cross-task transfer of the +19pp effect is currently unmeasurable because the vision pipeline saturates before LLM logic gets a chance. cube_lifting's 90%+ rates are partly a perception-easy regime (one chromatically distinct cube on a neutral background). The fix is environmental, not algorithmic: per-cube SAM prompts with class anchors, contact-graspnet's instance-aware pose, or grounded-segment-anything.
cube_stack failure mode breakdown
Production dedup boost — n=50 to n=70 Wilson interval
RQ2d · dedup algorithm sensitivity
Hypothesis
Production C3v2 erases the gain because of its specific survivor rule. A quality_score-ranked top-k recipe should be more robust.
Experiment
Take the 14 C2-promoted skills, rank by quality_score, keep the top k for k ∈ {10, 11, 12, 13}. Run n=70 each on gpt-4.1.
Result
knSuccessvs C3v2 (82.9%)
k=107064/70 = 91.4%marginal (P=0.034)
k=11 (manual_11)7066/70 = 94.3%separated (P=10⁻³)
k=127061/70 = 87.1%not separated (P=0.22, dip)
k=137068/70 = 97.1%separated (P=2×10⁻⁴)
Implication
Non-monotone: 91.4 / 94.3 / 87.1 / 97.1. Recipe is "top-k by quality_score with k near 11 or 13"; avoid k=12. The k=12 dip mechanism is open — the simple "docstring-as-tiebreaker" hypothesis does not explain it (manual_11 also keeps a docstring-less skill without a dip).

Q2 answer · Conditional yes — with NVIDIA-9 closure context (O1 parity)

Within cube_lifting + no-dedup, the auto-mined library delivers a robust +18–19pp effect over the no-helpers baseline (P21_a), replicated across three backbones with magnitudes that scale inversely with baseline strength (Δ = +12 / +19 / +88pp). Production structural-hash dedup erases the gain. Cross-task transfer is gated by the vision pipeline, not the library.

NVIDIA-9 closure (Phase v12): In this single-task setting at n=50 each, auto-mined libraries reach Wilson 95% CIs that overlap NVIDIA's manually-curated 9-skill library on both backbones — gpt-4.1: NVIDIA-9 90.0% [78.6, 95.7] vs C2 97.1% [90.2, 99.2]; DeepSeek: NVIDIA-9 84.0% [71.5, 91.7] vs manual_11 94.0% [83.8, 97.9]. Auto-mined libraries trend ~7–10pp higher with directional binomial p<0.05 at n=50, but no pre-registered equivalence margin was set. The appropriate reading is therefore "failure to reject equality with a directional auto-mining advantage," not demonstrated equivalence. The pre-registered O1 boundary (gpt-4.1 ∈ [90%, 96%]; DeepSeek ∈ [70%, 90%]) is satisfied. The contribution is a feasibility result: in this single-task, reduced-API setting the auto-mining loop performs comparably to NVIDIA's manual offline curation. Stronger substitutability claims would require pre-registered equivalence-margin tests (TOST or Newcombe-score) and replication on tasks beyond cube_lifting.

Q3 — How does the library evolve over multi-session refinement?

RQ3 · evolve (multi-session)
Hypothesis
If we mine round-1 trials, ship the library, and then mine round-2 from the new fail trials, round-2 should add genuinely new skills and improve over round-1.
Experiment
Round-1 = dedup_v3_k12 (mid-range, 87.1%, 6 fail trials available in cold storage). Run AST extraction + quality filter over the fail trials to produce round-2 promoted set.
Result
Zero new promotable skills. The fail-trial code.py files contain no top-level def blocks. Given a dense library, the LLM composes API calls imperatively (pose = get_grasp_pose_for_mask(mask); execute_grasp(pose); ...) without wrapping anything in new helper functions. Round-2 = round-1; the trial sweep was skipped because it would replicate round-1 exactly.
Mechanism — namespace saturation. auto-capx's mining loop assumes trial code contains novel def blocks. As library density grows, that assumption fails monotonically. cube_lifting under gpt-4.1 with k=12 is at or past saturation. The LLM is doing the right thing — there is no abstraction left to invent — but the mining pipeline cannot extract anything from this regime.
Implication
Multi-session refinement requires task complexity ≫ library coverage. cube_lifting + k=12 violates this. Two paths forward: (a) round-2 mining on a harder task (after the cross-task vision fix), (b) replace def-extraction with semantic-similarity clustering of frequently-co-occurring API-call sequences.

Q3 answer · Open, with a saturation boundary finding

We measured a single snapshot. The round-2 saturation finding identifies a natural termination regime for the mining loop, but true long-horizon evolution dynamics (n=100–200 with library snapshots every N) remain unmeasured. This is the largest open question for paper v3.

Q4 — Which skills survive vs die in the library?

Four evidence streams converge on a survival rule.

Pattern 1 · Docstring is the decisive tiebreaker
Result
  • Across the 11 promoted skills, docstring-having skills are called 8× more on average; both dead skills lack docstrings.
  • Forced-choice probe (Phase 3.3, n=15): when the LLM sees doc and no-doc variants of the same skill, it picks doc 100% of the time (20/0 calls).
  • Neutral-name replication (counter-balanced n=30): 89:0 doc share. Confound from the variable name _undocumented is removed.
  • Multi-backbone replication: gpt-4.1 100% / Claude 100% / DeepSeek 94.8% — the docstring preference replicates across the three backbones tested (we do not extrapolate to LLMs in general).
Implication
The 8× call rate from Phase 1.5 is causal, not just correlational. A production survivor rule of (has_docstring, success_rate, name) should outperform structural-hash dedup — and it does (manual_11 vs C3v2 = +11pp).
Pattern 2 · Pipeline-hub stability beats long-tail
Result
Top-4 skills account for 76% of calls; execute_grasp is invoked in 89/89 trial directories (universal). Long-tail skills appear only on retry. Two dead skills both lack docstrings and sit far down the call distribution.
Implication
Skill survival is a function of position in the pipeline. Hubs (perception → planning → execution branchpoints) and low-level utilities (coordinate transforms, quaternion math) are stable; mid-level abstractions are retry-only.
Pattern 3 · Retries downgrade abstractions
Result
Phase 1.9 retry analysis: when a high-level skill fails, the LLM peels it back into low-level primitives. REPLACE pattern (different skill chosen) 66.7% > SAME_SET (same call repeated) 44%. execute_grasp failures get rewritten as move_to_pose_world + pose_matrix_to_pos_quat.
Implication
Library design implication: when shipping high-level skills, also include their low-level primitives in the namespace so the LLM can fall back without leaving the library.
Retry abstraction downgrade pattern
Pattern 4 · The library is a self-correction substitute (mechanism for Q2b)
Hypothesis
The multi-backbone gap (Δ = +12 / +19 / +88pp) is a self-correction difference, not a reasoning difference.
Experiment
cube_lifting × p21_a × {Claude, DeepSeek}, n=3 each, with verbose attempt-level reward logging.
Result
  • DeepSeek 0/3: five failing sandbox attempts at reward 0.5–0.7 — grip is OK, lift trajectory is unstable, never reaches task-complete height.
  • Claude 3/3: same partial-grip-and-drop failure on first attempt (rewards 0.478–0.546), but converges to 1.000 within one or two regenerations.
  • Both backbones make the same class of first-attempt error. Their divergence is in the regeneration loop.
Mechanism finding: the library's outsized DeepSeek benefit (+88pp) is consistent with the library acting as a substitute for DeepSeek's broken self-correction loop. The library replaces a broken iteration with a fixed correct skill. DeepSeek's 6% baseline is a self-correction failure, not a reasoning failure.
Falsifiable prediction
If true, then improving DeepSeek's iteration (reflection-style regeneration prompting) should reduce the library effect on DeepSeek. We do not run this test here; it is the cleanest paper-v3 follow-up.

Q4 answer · Yes — four survival patterns

Skills survive when they are documented, sit at a pipeline hub, can be downgraded to primitives on retry, and substitute for weak self-correction in the host LLM. The production survivor rule should be (has_docstring, success_rate, name), not structural hash.

Limitations and paper-v3 candidates

Open questions, organised by which RQ they map to.

RQLimitation / open questionpaper-v3 candidate
Q1Mining was run on cube_lifting only. Whether the same pipeline produces useful skills on harder tasks is not measurable until Q2c is unblocked.Re-run mining on cube_stack_3 / LIBERO after the vision-pipeline upgrade.
Q2Cross-task transfer floors out (cube_stack_3 vision saturation; LIBERO floors even with privileged-API state).Vision-pipeline upgrade: per-cube SAM with class anchors, contact-graspnet instance-aware pose, or grounded-segment-anything. cube_stack_3 floor → measurable transfer regime.
Q3Single-snapshot only. The saturation finding is a boundary, not an evolution measurement.Long-horizon Q3: n=100–200 trials with library snapshots every N. Or round-2 mining on a harder task (saturation sidestep).
Q4The self-correction substitute mechanism is a falsifiable hypothesis we did not yet test.DeepSeek iteration test: reflection-style regeneration prompting should reduce the +88pp library effect on DeepSeek if the mechanism is right.
Generalcube_lifting's 90%+ rates may be a perception-easy ceiling; gpt-4.1 snapshot is unpinned (OpenRouter).Harder baseline tasks (cube_stack 3+, NutAssembly), backbone snapshot pinning, variance characterisation.
What this points at. The within-paper evidence is from one task in CaP-X's reduced-API setting: on cube_lifting an automatic mining loop reaches performance comparable to NVIDIA's hand-curated 9-skill helpers (Wilson 95% CIs overlap; no equivalence margin pre-registered). The broader hypothesis behind this work is that the same self-improvement loop — discover useful subroutines from prior generations, accumulate them, re-apply them — is what would let a code-as-policy robot agent scale to a wider range of robot settings without humans hand-expanding the helper API. Testing that broader hypothesis is the paper-v3 program in the table above (vision-pipeline upgrade → re-mining on harder tasks → long-horizon evolution → equivalence-margin re-test). This paper is one piece of evidence on the way; it is not a cross-setting claim.
Methodology note. Nine narrative reversals over the 19-day project — including one inside paper v10's body, caught at v11 by mechanical re-verification — are documented in the closing retrospective. The rule that single-run retrospective causal claims must be treated as hypotheses pending controlled cross-condition checks is the most generalisable lesson.

arXiv submission package and reproducibility