paper v12 final · 2026-05-10

자가 추출 skill 이 로봇 코드 생성 에 도움이 되는가

에이전트형 코딩 파이프라인 (Voyager, SWE-agent, OpenHands 등) 은 같은 작업을 반복할수록 자기가 만든 코드에서 쓸 만한 함수를 추출 → 라이브러리로 쌓아 → 다음 번에 재활용 하는 자기개선 루프를 공통으로 가지고 있습니다. NVIDIA CaP-X 도 이미 skill library 를 가지고 있지만 그 helper 들은 연구자(사람) 가 직접 큐레이션 한 것입니다. auto-capx (이 연구) 는 그 자기개선 루프를 CaP-X 위에 얹어, 자동 mining loop 가 NVIDIA 의 수동 큐레이션을 같은 수준으로 수행 하거나 일부 조건에서 더 잘 할 수 있는지 를 단일 task 에서 검증합니다. 이 연구가 가리키는 큰 그림: 더 일반적인 환경의 로봇이 사람이 helper API 를 손으로 늘려주지 않아도 code-as-policy 를 스스로 잘 적용하는 방향. 현재 증거는 단일 task / reduced-API 한정이며, cross-setting 일반화는 아직 검증되지 않았습니다. 4 가지 research question: 추출(extract) → 유용성(useful) → 진화(evolve) → 생존(survive).

📄 paper v12 PDF 🇺🇸 English version 🌱 쉬운 버전 (5분) paper v1 한국어판 arXiv tar GitHub

상태: paper v12 · NVIDIA-9 closure 통합 · arXiv 제출 tar 151 KB · 누적 비용 ~$261.

No-skill baseline로봇 코드 에이전트가 base CaP-X API 만 사용.P21_a

No-dedup mined-skill library자동 추출된 함수 dedup 없이 그대로 주입.C2

Production structural-dedup library현재 production survivor selection 적용.C3v2

Quality-ranked same-size libraryquality_score top-11 (C2 의 하위 3개 제외).manual_11

No-dedup mined-skill library 가 baseline 을 이김. Production structural-hash dedup 은 baseline 으로 회귀. Quality-ranked same-size 가 성능 회복.

4가지 research question — 한 단락 요약

Q1 — LLM 이 자기 코드에서 재사용 가능한 skill 을 추출할 수 있는가? Yes AST 추출 + quality gate 로 11 promoted skills; 상위 4개가 호출의 76% 차지; arity 100% 정확.
Q2 — 추출한 skill 이 다음 trial 에 도움이 되는가? 조건부 Yes cube_lifting + no-dedup 에서 helper 미적용 baseline (P21_a) 대비 +18~19pp. Production dedup 은 효과 erase. 3 backbone Δ = +12 / +19 / +88pp (gpt-4.1 / Claude / DeepSeek, 본 연구에서 측정한 세 backbone 에서 replicate). Cross-task 는 vision pipeline 한계로 unmeasured. NVIDIA-9 closure (v12): 단일 task n=50 setting 에서 auto-mining 의 Wilson 95% CI 가 NVIDIA 수동 9-skill 라이브러리와 overlap (사전등록 O1 만족). Equivalence margin 사전등록 안 됨 → "방향성 auto-mining +7~10pp 이지만 동등성 미입증" 이 정확한 해석. 기여 = 자동화 가능성 (feasibility), 수동 큐레이션 대체 입증 아님.
Q3 — Library 가 multi-session 에서 어떻게 진화하는가? Open + saturation Round-2 mining 시 dense library 에서 0 new skills. Mining loop 의 natural termination 발견; long-horizon 진화는 future work.
Q4 — 어떤 skill 이 살아남고 어떤 게 버려지나? Yes, 4 patterns Docstring 이 본 연구에서 측정한 가장 강한 선택 신호 (observational 데이터에서 8× 호출 격차, counter-balance forced-choice probe n=15 에서 100% 선택); pipeline hub > long-tail; retry 시 abstraction downgrade; library = self-correction substitute (multi-backbone gap 의 제안 mechanism, 아직 falsify 되지 않음).

실험 설계 — 무엇을 비교했고, 무엇이 success 인가

NVIDIA CaP-X 는 LLM 에게 base API + 연구자가 손으로 고른 helper skill 셋 (FrankaControlApiReducedSkillLibrary, S3 published configuration 의 9 함수) 을 주고 robot 제어 Python 을 작성하게 합니다. helper 9 개는 NVIDIA 저자들이 수백 trial 에서 ~182 개 함수 → ~73 후보 → 사람이 9 개 선별한 것. auto-capx (이 연구) 는 그 위에 에이전트형 자기개선 루프를 얹습니다 — 이전 generation 에서 유용한 함수를 mine → gate → dedup → 다음 trial 에 재주입 — 그리고 그 자동 루프가 NVIDIA 의 수동 큐레이션과 reduced-API setting 에서 비슷한 수준의 성능을 낼 수 있는지 검증합니다 (trial Python parsing → quality gate 통과 → (선택적) dedup → 다음 trial 의 prompt + execution namespace 에 재주입, *no human in the loop*). 이 연구의 큰 그림: 사람이 helper API 를 손으로 늘려주지 않아도 code-as-policy 로봇이 더 다양한 환경에 스스로 적응하는 방향. 아래 diagram 이 큐레이션 차이를 보여주고, 실험은 auto-mined library variants 를 nvidia9 (NVIDIA 의 published 9-skill 라이브러리, 적절한 baseline) 와 P21_a (helpers 모두 ablate, "+pp" 기준점) 와 비교합니다.

NVIDIA CaP-X vs auto-capx

NVIDIA CaP-X (trial 내 재시도) vs auto-capx (본 연구, 스킬 메모리 레이어 추가)

두 setup 모두 within-trial loop 공유 (LLM → Python → execute → VDM → retry). auto-capx 는 아래 줄 추가: trial logs 에서 재사용 가능한 함수 추출, gate, dedup → 그 library 를 LLM 의 namespace 에 reinject. 본 paper 의 모든 ablation 이 *그 아래 줄* 의 효과를 측정.

비교한 실험 조건

각 카드는 *library 안에 무엇이 있는지*, *어떤 가설을 검증하는지*, *왜 의미가 있는지* 를 설명합니다. 가장 중요한 숫자는 library size — dedup 이야기 (Q2d) 의 핵심은 *같은 size 의 두 library 가 다르게 작동하는가* 이기 때문입니다.

P21_abaseline

비교의 기준점. 로봇 코드 에이전트가 base CaP-X-style API 만 사용 — mined skill 0 개. 이외의 모든 condition (P21_a 의 n=50 성공률 78.0%) 이 여기에 비교됨.

0 skillslibrary 없음namespace seeding 없음

Role · 모든 "+pp" 의 anchor.

C1no-dedup max

quality gate 통과한 모든 함수, dedup 없음. 16 mined functions 가 prompt + execution namespace 에 주입. Namespace seeding (promoted skill 이 dependency 찾을 수 있게 하는 runtime fix) 활성.

16 skillsdedup 없음namespace seeding ✓

Role · "많은 skill = 더 좋음" 의 upper bound test.

C2paper 의 ship-this

quality-gated 14 skills, dedup 없음. C1 에서 가장 약한 2개 skill 제거. paper 가 production 으로 권장하는 arm ("namespace seeding + gates + no structural dedup").

14 skillsdedup 없음namespace seeding ✓

Role · "skills help" 의 cleanest evidence.

C3v2production dedup

C2 + structural-hash dedup. 현재 production survivor rule 이 함수를 AST hash 로 cluster 하고 cluster 당 하나만 keep. Library size 가 11 로 축소. 결과: 이 arm 이 no-skill baseline 과 통계적으로 동등 — paper 의 핵심 counter-evidence.

11 skillsstructural-hash dedupnamespace seeding ✓

Role · Dedup 이 효과를 erase 한다는 증거.

manual_11smart dedup

같은 11 size, structure 가 아니라 quality_score 로 선택. C2 의 14 skills 를 quality_score desc 로 정렬, top-11 keep (가장 약한 3개 drop). C3v2 와 size 같지만 survivor 다름.

11 skillsquality-ranked dedupnamespace seeding ✓

Role · Dedup 의 문제는 size 가 아니라 selection 임을 증명.

empty_nsstub control

16 함수 signature 와 empty body (pass). LLM 이 보는 typed namespace 는 C1 과 동일, 단 실행 가능한 body 가 없음. 질문: typed scaffolding 자체가 효과의 source 인가?

16 namesbody 없음namespace seeding ✓

Role · baseline 보다 *낮은* 43.3% — body 가 source.

nvidia9NVIDIA published baseline

NVIDIA 가 직접 큐레이션한 9-skill 라이브러리. FrankaControlApiReducedSkillLibrary 클래스 메서드로 하드코딩되어 있음. 9 helper 는 NVIDIA 저자들이 수백 트라이얼에서 ~182 개 함수 → ~73 후보 → 사람이 9 개 선별. NVIDIA 의 published S3 configuration — auto-mined 라이브러리가 맞춰야 할 적절한 baseline.

9 skills수동 큐레이션 (오프라인)class methods (namespace seeding 없음)gpt-4.1: 45/50 = 90.0% [78.6, 95.7]DeepSeek: 42/50 = 84.0% [71.5, 91.7]

Role · 적절한 NVIDIA baseline. Auto-mined 라이브러리 (C2/manual_11) 가 Wilson CI 범위 내에서 이와 동등 — O1 parity (사전등록).

dedup_v3_k10 / k12 / k13algorithmic robustness

manual_11 recipe 를 인접한 k 값에서 적용. 같은 "top-k by quality_score" rule, k = 10 / 12 / 13. Smart dedup recipe 가 작은 k 변경에 robust 한지, 아니면 k=11 의 우연인지 검증.

k = 10 / 12 / 13quality-rankedn = 70 each

Role · Q2d — 비-단조 (k=12 dip), mechanism open.

auto-capx 전체 pipeline (full loop)

auto-capx 루프: generate → execute → mine → gate → dedup → reinject

Success metric — auto-capx-defined "Task completed"

각 trial 의 final attempt 가 success 했는지 (summaries.txt 의 정의). Trial directory count 직접 분석은 over-counting — multiple sandbox attempts 중 *어떤 것이라도* success 면 통과로 잘못 계산. paper 의 모든 numbers 가 같은 metric 으로 통일됨.

Q1 — LLM 이 자기 코드에서 재사용 가능한 skill 을 추출할 수 있는가?

RQ1 · 추출 (extract)

Hypothesis

AST 추출 + quality gate (success rate, generality, dependency resolution) 가 trial code 에서 *compact 하고 reusable* 한 skill library 추출 가능.

Experiment

Group D (n=50 trials, 89 retry directories, 227 code blocks) 의 retrospective 분석. Skill usage matrix 구축, unpromoted pool audit, 각 gate 의 filtering 정량화.

Result

341 candidate functions → quality gate → 11 promoted skills
Top-4 promoted skills 가 전체 invocation 의 76% (heavy-tailed)
호출 arity 100% 정확 (744 calls 모두) — LLM 의 contract 이해 정확
Hard-fail + dedup 이 전체 filter 의 93% (soft-only 면 148 skills, 실제 11)
2 dead skills (호출 0 회) 모두 docstring 없음 — 추출 가능하지만 *invocable* 한지는 별도 (Q4 와 연결)

Boundary

Cube_lifting 에서만 mining 진행. 다른 task 에서 같은 mining pipeline 이 작동하는지는 Q2c 의 downstream — vision pipeline 가 cube_stack_3 에서 막혀서 cross-task mining 측정 currently 불가능.

Skill usage matrix — top-heavy 분포와 dead skills

Unpromoted pool — naming explosion 과 near-misses

왼쪽: 11 promoted skills 중 top-4 = 호출의 76%; 2 dead skills (no docstring). 오른쪽: 330 unpromoted candidates, 91% 가 generality gate fail; 18개가 promoted skill 과 structural hash 일치.

Q1 답 · Yes

Mining loop 는 작동: single trial corpus 에서 시작해 heavy-tailed 호출 분포의 small library 와 정확한 call signature 도출. 추출 자체는 쉬운 부분. 결과 library 가 실제 도움이 되는지는 Q2.

Q2 — 추출한 skill 이 다음 trial 에 실제로 도움이 되는가?

auto-capx 의 핵심 question. 4 sub-question 으로 분해: (a) within-task, (b) across LLM backbones, (c) across tasks, (d) dedup algorithm sensitivity.

RQ2a · within-task baseline (cube_lifting)

Hypothesis

Library 를 prompt + namespace 에 주입하면 baseline 대비 task success rate 상승.

Experiment

cube_lifting × gpt-4.1, n=50/70 per condition.

Result

Condition	n	성공률	Wilson 95% CI	vs P21_a
`P21_a`	50	39/50 = 78.0%	[64.8, 87.2]	(reference)
`C1`	70	67/70 = 95.7%	[88.1, 98.5]	+18pp ✅
`C2`	70	68/70 = 97.1%	[90.2, 99.2]	+19pp ✅
`C3v2`	70	58/70 = 82.9%	[72.4, 89.9]	+5pp (CI overlap)
`manual_11`	70	66/70 = 94.3%	[86.2, 97.8]	+16pp ✅
`nvidia9` NVIDIA published	50	45/50 = 90.0%	[78.6, 95.7]	+12pp (CI overlap w/ C2)
`empty_ns`	30	13/30 = 43.3%	[27.4, 60.8]	−35pp

Implication

No-dedup library (C1/C2) 가 baseline 을 +18~19pp 단단히 이김 (CI 분리)
Production structural-hash dedup (C3v2) 이 효과 erase — baseline 과 통계적 동등
Function body 자체가 cause — empty stub 은 baseline *보다 낮음*. Typed namespace scaffolding 이 mechanism 이 아님
Dedup 의 문제는 size 가 아니라 selection — 같은 11 skills 를 quality_score 로 뽑은 manual_11 이 structural-hash 11 보다 +11pp

Visual evidence — sample trials

C2 success — no-dedup library

"ship this" arm: namespace seeding + gates + no structural dedup.

P21_a success — baseline

Baseline 78% 성공률 — library 효과 detect 하려면 n 필요.

P21_a fail — baseline

22% baseline 실패 — library 효과의 측정 가능 영역.

RQ2b · across LLM backbones

Hypothesis

Library 효과는 여러 backbone 에서 replicate 된다 — gpt-4.1 에 한정된 것 아님.

Experiment

cube_lifting × {gpt-4.1, Claude Sonnet 4, DeepSeek v3} × {P21_a, manual_11, nvidia9}, 각 n=50. nvidia9 = NVIDIA 의 published 9-skill 라이브러리 (사전등록 A2-min closure, Phase v12).

Result

Backbone	No helpers (P21_a)	NVIDIA-9 (수동)	Auto-mined (manual_11 / C2)	Auto vs NVIDIA-9
gpt-4.1	78.0% [64.8, 87.2]	90.0% [78.6, 95.7]	97.1% [90.2, 99.2] (C2)	+7pp 방향성; CI overlap → O1 parity
Claude Sonnet 4	86.0%	— (미실행)	98.0%	—
DeepSeek v3	6.0% [2.1, 16.2]	84.0% [71.5, 91.7]	94.0% [83.8, 97.9] (manual_11)	+10pp 방향성; CI overlap → O1 parity

Wilson CI overlap: gpt-4.1 NVIDIA-9 [78.6, 95.7] ∩ C2 [90.2, 99.2] = [90.2, 95.7]. DeepSeek NVIDIA-9 [71.5, 91.7] ∩ manual_11 [83.8, 97.9] = [83.8, 91.7]. One-sided binomial p=0.015 (gpt-4.1), p=0.009 (DeepSeek) — 방향성 신호 있으나 n=50 에서 CI overlap 범위 내 → 사전등록 O1 outcome.

Implication

Library magnitude 는 baseline 약함에 비례. 이 패턴은 어떤 합리적인 라이브러리에서도 동일 — NVIDIA-9 도 DeepSeek 을 6% → 84% 로 끌어올리므로 +88pp 의 magnitude 는 auto-mining 에 특정한 것이 아니라 *어떤 좋은 라이브러리라도* DeepSeek 의 약한 baseline 을 끌어올린다는 의미. auto-mining 은 NVIDIA 의 수동 큐레이션 대비 정량적 우위를 입증한 것이 아니라 동등성 내에서 자동화 가능성을 보임. Mechanism 은 Q4 에서 unpacking.

RQ2c · across tasks (cube_stack_3 cross-task)

Hypothesis

cube_lifting 의 +19pp library effect 가 관련 task family (cube_stack_3 = 3개 큐브 stacking, 빨강-초록-파랑) 에 transfer.

Experiment

cube_stack_3 × gpt-4.1 × {P21_a, manual_11}, 각 n=15.

Result

Condition	n	Task completed	Avg reward	Code blocks	Regenerations
`p21_a`	15	0/15 (0%)	0.040	5.733	4.733
`manual_11`	15	0/15 (0%)	0.060	4.133	3.133

두 arm 모두 floor. 3개 큐브 모두 simulator 에 실재 (sim.data.xpos 직접 verify), 단 SAM3 segmentation 이 multi-cube scene 에서 fragmented mask (cube 당 200+ entries) 생성 → downstream pose-from-mask 함수가 사용 가능한 mask 없음.

Library 가 task success 와 분리됨: floor 에서도 manual_11 이 code blocks 를 27.9% 줄임 (5.73 → 4.13), regenerations 를 33.8% 줄임 (4.73 → 3.13), wall time 을 24% 줄임 (2338s → 1777s). "Library 가 task completion 도움" vs "library 가 generation cost 도움" 은 분리 가능 dimensions; binary task_completed 는 후자만 capture.

Implication

+19pp 효과의 cross-task transfer 는 vision pipeline saturation 으로 *currently unmeasurable*. cube_lifting 의 90%+ rates 는 partly perception-easy regime (단일 chromatically distinct cube, neutral background). Fix 는 environmental, algorithmic 아님: per-cube SAM prompt with class anchors, contact-graspnet 의 instance-aware pose, 또는 grounded-segment-anything.

Production dedup boost — n=50 to n=70 Wilson interval

RQ2d · dedup algorithm sensitivity

Hypothesis

Production C3v2 가 효과 erase 한 것은 specific survivor rule 때문. Quality_score-ranked top-k recipe 는 robust 한가?

Experiment

14 C2-promoted skills 를 quality_score 로 ranking, top-k ∈ {10, 11, 12, 13} 각 n=70 측정 (gpt-4.1).

Result

k	n	성공률	vs C3v2 (82.9%)
k=10	70	64/70 = 91.4%	marginal (P=0.034)
k=11 (manual_11)	70	66/70 = 94.3%	separated (P=10⁻³)
k=12	70	61/70 = 87.1%	not separated (P=0.22, dip)
k=13	70	68/70 = 97.1%	separated (P=2×10⁻⁴)

Implication

Non-monotone: 91.4 / 94.3 / 87.1 / 97.1. Recipe = "top-k by quality_score with k near 11 or 13"; avoid k=12. k=12 dip mechanism 은 *open* — simple "docstring-as-tiebreaker" hypothesis 로는 설명 안 됨 (manual_11 도 docstring-less skill 가짐 *without dip*).

Q2 답 · 조건부 Yes — NVIDIA-9 closure 포함 (O1 parity)

cube_lifting + no-dedup 안에서 auto-mined library 는 no-helpers baseline (P21_a) 대비 robust +18~19pp 효과, 3 backbone 에 걸쳐 baseline 약함과 inversely 비례하는 magnitude (Δ = +12 / +19 / +88pp). Production structural-hash dedup 은 효과 erase. Cross-task transfer 는 vision pipeline 에 막혀 있음.

NVIDIA-9 closure (Phase v12): 단일 task n=50 setting 에서 auto-mining 의 Wilson 95% CI 가 NVIDIA 의 수동 큐레이션 9-skill 라이브러리와 overlap — gpt-4.1: NVIDIA-9 90.0% [78.6, 95.7] vs C2 97.1% [90.2, 99.2]; DeepSeek: NVIDIA-9 84.0% [71.5, 91.7] vs manual_11 94.0% [83.8, 97.9]. Auto-mined 라이브러리가 ~7~10pp 높은 방향성, directional binomial p<0.05 (n=50). Equivalence margin 은 사전등록 안 됨 → "방향성 auto-mining 우위, 단 동등성 미입증 (failure to reject equality)" 이 정확한 해석이며 demonstrated equivalence 아님. 사전등록 O1 boundary (gpt-4.1 ∈ [90%, 96%]; DeepSeek ∈ [70%, 90%]) 는 만족. 본 연구의 기여는 feasibility 결과: 단일 task / reduced-API setting 에서 auto-mining loop 가 NVIDIA 의 오프라인 수동 큐레이션과 비슷한 수준의 성능에 도달함. 강한 substitutability claim 은 사전등록된 equivalence margin (TOST 또는 Newcombe-score) 과 cube_lifting 외 task replication 이 필요.

Q3 — Library 가 multi-session refinement 에서 어떻게 진화하는가?

RQ3 · 진화 (multi-session)

Hypothesis

Round-1 trials 에서 mining 후 library 를 ship 하고 round-2 mining 시 새 fail trials 에서 추출하면, round-2 가 새 skill 추가 + round-1 보다 개선.

Experiment

Round-1 = dedup_v3_k12 (mid-range, 87.1%, cold storage 에 6 fail trials 보관). Fail trials 의 AST 추출 + quality filter 로 round-2 promoted set 생성.

Result

0 promotable new skills. Fail trial code.py 들에 top-level def block 0 개. Dense library 받으면 LLM 이 API call 을 imperative 하게 composes (pose = get_grasp_pose_for_mask(mask); execute_grasp(pose); ...), 새 helper function 만들지 않음. Round-2 = round-1; trial sweep skip (round-1 의 replication 만 됨).

Mechanism — namespace saturation. capx 의 mining loop 는 implicit 가정 "trial code 에 novel def block 들어있다". Library density 늘면 이 가정 monotonic 하게 fail. cube_lifting × gpt-4.1 × k=12 는 saturation point 에 있거나 그 이상. LLM 의 *correct behavior* — 새로 만들 abstraction 없음 — 이지만 mining pipeline 이 이 regime 에서 추출 못함.

Implication

Multi-session refinement 가 효과적이려면 task complexity ≫ library coverage. cube_lifting + k=12 는 위반. 두 paths forward: (a) round-2 mining on harder task (cross-task vision fix 후), (b) def-extraction 을 frequently-co-occurring API-call sequence 의 semantic-similarity clustering 으로 replace.

Q3 답 · Open + saturation boundary finding

Single snapshot 만 측정. Round-2 saturation finding 은 mining loop 의 *natural termination regime* 발견 — but true long-horizon evolution dynamics (n=100-200 with library snapshots every N) 는 unmeasured. Paper v3 의 가장 큰 open question.

Q4 — 어떤 skill 이 살아남고 어떤 게 버려지나?

survival rule 의 4 evidence stream 이 converge.

Pattern 1 · Docstring = 가장 강한 선택 신호

Result

11 promoted skills 평균: docstring 있는 skill 이 8× 더 호출; 두 dead skills 모두 docstring 없음.
Forced-choice probe (Phase 3.3, n=15): LLM 이 doc / no-doc 두 변종 보면 doc 100% 선택 (20/0 calls).
Neutral-name replication (counter-balanced n=30): 89:0 doc share. _undocumented 변수명 confound 제거.
Multi-backbone replication: gpt-4.1 100% / Claude 100% / DeepSeek 94.8% — docstring preference 가 세 backbone 모두에서 replicate (LLM 일반으로의 extrapolation 은 하지 않음).

Implication

Phase 1.5 의 8× 호출 빈도 = causal, not just correlational. Production survivor rule = (has_docstring, success_rate, name) 가 structural-hash dedup 보다 좋음 — manual_11 vs C3v2 = +11pp 입증.

Pattern 2 · Pipeline-hub stability > long-tail

Result

Top-4 skills 가 호출의 76%; execute_grasp 는 89/89 trial directories 에서 호출 (universal). Long-tail skill 은 retry 시에만 등장. 2 dead skills 모두 docstring 없음 + call 분포의 far down.

Implication

Skill survival 은 pipeline 위치 의 함수. Hub (perception → planning → execution branchpoint) 와 low-level utility (좌표 변환, quaternion math) 가 stable; mid-level abstraction 은 retry-only.

Pattern 3 · Retry 시 abstraction downgrade

Result

Phase 1.9 retry 분석: high-level skill fail 시 LLM 이 *low-level primitive 로 풀어헤침*. REPLACE pattern (다른 skill 호출) 66.7% > SAME_SET (같은 호출 반복) 44%. execute_grasp fail → move_to_pose_world + pose_matrix_to_pos_quat 로 분해.

Implication

Library design implication: high-level skill ship 시 그 low-level primitive 도 namespace 에 포함 → LLM 이 library 떠나지 않고 fallback 가능.

Pattern 4 · Library = self-correction substitute (Q2b 의 mechanism)

Hypothesis

Multi-backbone gap (Δ = +12 / +19 / +88pp) 은 *self-correction* 차이, *reasoning* 차이 아님.

Experiment

cube_lifting × p21_a × {Claude, DeepSeek}, 각 n=3, verbose attempt-level reward logging.

Result

DeepSeek 0/3: 5 sandbox attempts 모두 reward 0.5-0.7 — grip OK, lift trajectory 불안정, task-complete height 못 reach.
Claude 3/3: 같은 partial-grip-and-drop class 의 first-attempt fail (0.478-0.546), but regeneration 후 1-2번 만에 1.000 으로 converge.
두 backbone 모두 first-attempt error 의 *same class*. 차이는 regeneration loop 능력.

Mechanism finding: Library 의 outsized DeepSeek benefit (+88pp) 은 library 가 DeepSeek 의 broken self-correction loop 를 substitute 하는 것과 일치. Library 가 broken iteration 을 fixed correct skill 로 replace. DeepSeek 의 6% baseline 은 self-correction failure, not reasoning failure.

Falsifiable prediction

위 mechanism 이 옳다면, DeepSeek 에 reflection-style regeneration prompting 적용 시 → library effect *감소* 예측. 본 phase 에서 test 안 함; cleanest paper-v3 follow-up.

Q4 답 · Yes — 4 survival patterns

Skill 은 다음 시 살아남음: documented; pipeline hub 위치; retry 시 primitive 로 downgrade 가능; host LLM 의 weak self-correction 을 substitute. Production survivor rule = (has_docstring, success_rate, name), structural hash 아님.

한계 + paper-v3 후보

Open question 들을 mapping 되는 RQ 별로 정리.

RQ	한계 / open question	paper-v3 candidate
Q1	cube_lifting 에서만 mining 진행. 다른 task 에서 같은 pipeline 이 useful skill 산출하는지는 Q2c unblock 까지 unmeasurable.	Vision pipeline 수정 후 cube_stack_3 / LIBERO 에서 mining 재실행.
Q2	Cross-task transfer floor (cube_stack_3 vision saturation; LIBERO 도 privileged-API 도 floor).	Vision pipeline upgrade: per-cube SAM with class anchors, contact-graspnet instance-aware pose, 또는 grounded-segment-anything. cube_stack_3 floor → measurable transfer regime.
Q3	Single snapshot only. Saturation finding 은 boundary, evolution measurement 아님.	Long-horizon Q3: n=100-200 trials with library snapshots every N. 또는 round-2 mining on harder task (saturation sidestep).
Q4	Self-correction substitute mechanism = falsifiable hypothesis 인데 본 phase 에서 test 안 함.	DeepSeek iteration test: reflection-style regeneration prompting 으로 +88pp library effect 감소 예측 검증.
일반	cube_lifting 의 90%+ rates 가 perception-easy ceiling 일 수 있음; gpt-4.1 snapshot 미pin (OpenRouter).	더 어려운 baseline task (cube_stack 3+, NutAssembly), backbone snapshot pinning, variance characterization.

이 연구가 가리키는 방향. 본 paper 내 evidence 는 CaP-X reduced-API setting 의 단일 task (cube_lifting) 한정입니다 — 자동 mining loop 가 NVIDIA 의 수동 9-skill helper 와 Wilson 95% CI overlap 수준 (equivalence margin 미사전등록). 이 작업이 가리키는 더 큰 가설은 같은 자기개선 루프 (이전 generation 에서 유용한 함수 발견 → 누적 → 재활용) 가 사람이 helper API 를 손으로 늘려주지 않아도 code-as-policy 로봇이 더 다양한 환경에 스스로 적응 하도록 도울 수 있다는 것입니다. 그 broader hypothesis 검증은 위 표의 paper-v3 program (vision pipeline 업그레이드 → 더 어려운 task 에서 re-mining → long-horizon evolution → equivalence-margin 재검증) 입니다. 본 paper 는 그 과정의 한 evidence piece 이며, cross-setting claim 자체는 아닙니다.

Methodology note. 19일 프로젝트 에서 9개의 narrative reversal — paper v10 본문 안의 mechanism 가설 1건 포함, v11 단계의 mechanical re-verification 으로 catch — 이 closing retrospective 에 정리. *Single-run retrospective causal claim 은 controlled cross-condition check 로 검증되기 전까지 가설로 다뤄야 한다* 는 가장 generalisable lesson.

arXiv 제출 패키지 + reproducibility

paper v12 PDF — NVIDIA-9 closure 통합 (O1 parity)
capx-paper-arxiv.tar.gz — TeX + figure PDFs only (151 KB)
SUBMISSION_README.md — metadata + upload checklist
LaTeX source
paper v1 한국어 companion · English version (this page in English)
capx-current-status.md · capx-log.md
Figure build script
GitHub repo