> **Archive notice**: This file is an archival research trace, not the reader-first public explanation. For the current public framing, condition glossary, and paper package, start at [`docs/index.html`](index.html) / <https://capx.ryeol.kim/>. Internal run IDs such as `C2` and `P21_a` are preserved here for traceability.

# CaP-X Project — Current Status

**Last updated:** 2026-05-03
**Project root:** `/Users/ryeol-13/GitHub/oh-my-cap-x/`
**Deployed status page:** https://capx.ryeol.kim — stale HTML snapshot now carries a 2026-05-03 supersession banner; use this file + `docs/paper/` as source of truth.

---

## 1. Project direction (north star)

**Curiosity-driven** observation of LLM skill emergence in robotic code generation.
Inspired by agent harnesses (Voyager, OpenClaw, SWE-agent, OpenHands) — 같은 현상이 로봇 코드 생성에서 어떻게 나타나는가.

### 4 core questions
- **Q1.** LLM이 자기 코드에서 쓸만한 스킬을 뽑을 수 있는가?
- **Q2.** 뽑은 스킬이 다음 trial에 실제로 도움이 되는가?
- **Q3.** 스킬 라이브러리가 시간에 따라 어떻게 진화하는가?
- **Q4.** 어떤 스킬이 살아남고 어떤 게 버려지나 — 패턴이 있나?

### Success criteria
Publication 아님. p-value 불필요. 관찰로 끝나도 OK.
- "X가 Y번 일어났다" (관찰) 이 단위
- 주장 ("X가 Y를 일으킨다") 은 별도 검증
- Ceiling/null 관찰도 데이터

---

## 2. Where we are

| Stage | Status |
|---|---|
| Group D 50-trial (49/50 = 98%) | ✅ 완료 |
| Methodology review (15 이슈 식별) | ✅ 완료 — `docs/superpowers/reviews/2026-04-17-methodology-review.md` |
| Phase 1 zero-cost observations (9건) | ✅ 완료 |
| Phase 1 synthesis retrospective | ✅ `docs/superpowers/retrospectives/2026-04-24-phase-1-synthesis.md` |
| Phase 0 infrastructure (seed fix, config 원칙, VDM 원인, metadata) | ✅ **완료 (2026-04-24)** |
| **Phase 2.4 3-fix ablation (60 trials)** | ✅ 완료 (2026-04-24) — dedup이 negative 로 드러남 |
| **Phase 2.1 counter-factual (10 trials)** | ✅ 완료 (2026-04-24) — skills 없는 쪽이 +20pp |
| **Phase 2.4c Dedup v2 + cube_stack held-out** | ✅ 완료 (2026-04-25) — Dedup v2 가 cube_lifting (n=15)에서 +20pp 보였으나 n=50 으로는 +7pp marginal |
| **Phase 2.4 E (C3v2 n=50 variance)** | ✅ **완료 (2026-04-26)** — 80% (CI [67%, 89%]). n=15 의 93% 는 outlier. |
| **Production code cleanup (Dedup v2 적용)** | ✅ 완료 (2026-04-26) — `library.py` + `.capx_skills.json` regenerated |
| **HTML status report 갱신** | ✅ 완료 (2026-04-28) — capx.ryeol.kim, Phase 2.4e + 2.5 + paper 섹션 추가 |
| **Phase 3.3 adversarial decoy** | ✅ 완료 (2026-04-28) — 14/15 success, decoy invocation rate 6.7%, verify-style 만 호출됨 |
| **Phase 3.3 docstring on/off (causality)** | ✅ 완료 (2026-04-29) — 15/15 success, doc/nodoc forced choice 에서 doc 100% (20:0) — Phase 1.5 의 8× 상관 = 인과 |
| **North star checkpoint (2026-04-29)** | 🟢 부분 달성 — Q1/Q4 강, Q2 정직한 negative, Q3 (long-horizon evolution) 만 미답 (~$32 로 닫을 수 있음). `docs/superpowers/retrospectives/2026-04-29-northstar-checkpoint.md` |
| **Research methodology review v2 (2026-04-30)** | ✅ 작성 — 8 카테고리 issue 정리. **A1 (name confound) Critical-leaning Important** 발견. arXiv 게시 전 *3 wording fix + 1 strengthening + neutral-name replication* (~$8.5) 권장. `docs/superpowers/reviews/2026-04-30-research-methodology-review-v2.md` |
| **Confound-closure 실험 계획서 v2 (2026-04-30)** | ✅ 작성 — 7 실험 (wording fix + 6 실험) phased budget A→E. Phase A ($8.5) 만 해도 paper v2 honest scope 깨끗. `docs/superpowers/plans/2026-04-30-confound-closure-experiments.md` |
| **Phase A — Wording fix + neutral-name replication (2026-04-30)** | ✅ 완료 — 8 paper fix + I6/H2 closure + neutral-name 89:0 doc share (counter-balanced n=30). **A1 confound 완전 닫힘**. paper v2 PDF 9p / 413KB 에서 "causally responsible" 격상. 비용 $8.5. `docs/superpowers/observations/2026-04-30-phase-3-3-doc-neutral.md` |
| **Phase B — Baseline boost + dedup ROI (2026-05-01)** | ✅ 완료 — P21_a n=10→30 (73.3% [55.6, 85.8]), C1/C2 n=15→50 (둘 다 96.0% [86.5, 98.9]). **CI 분리 → library +23pp** vs baseline. **Dedup 은 cost -16pp** (C2→C3v2). 5/1 paper narrative 반전. `docs/superpowers/observations/2026-05-01-phase-B-baseline-boost.md` |
| **Phase C — Multi-backbone doc replication (2026-05-01)** | ✅ 완료 — gpt-4.1 100% / Claude Sonnet 4 100% / DeepSeek 94.8% doc share (모두 ≥90%). **Mechanism universality 확정** (LLM-general, not backbone-specific). `docs/superpowers/observations/2026-05-01-phase-C-multi-backbone-doc.md` |
| **Phase F — Strict confound closure (2026-05-02)** | ✅ 완료 — empty_ns 43%, group_a_repro 63%, P21_a 78% (n=50), C3v2 83% (n=70). 5/1 의 +23pp claim 반전. Library content 효과 real (+40pp vs empty_ns) but production library 가 baseline 통계적 동등. `docs/superpowers/observations/2026-05-02-phase-F-strict-confound-closure.md` |
| **Phase G — C1/C2 boost + LIBERO priv smoke (2026-05-03)** | ✅ 완료 — C1 n=70 = 95.7%, C2 n=70 = 97.1% (5/1 의 96% real, not lucky). **No-dedup library +18-19pp positive ROI vs P21_a (CI 분리)**. **Dedup 자체가 -14pp cost** (C2 vs C3v2 CI separated). LIBERO priv smoke 1/10 = floor → perception 외 cause. `docs/superpowers/observations/2026-05-03-phase-G-c1c2-boost-libero-priv-smoke.md` |
| **Paper v5 (2026-05-03)** | ✅ — abstract / §VI.C / §V.B / §IX / conclusion 정정. "ship namespace+gates+skills (C2), NOT dedup". 10p PDF, narrative 통계적으로 단단. |
| **Phase H' — multi-backbone + manual_11 + LIBERO Object 0 priv (2026-05-03)** | ✅ 완료 — c2_claude 19/20 = 95%, c2_deepseek 20/20 = 100% (C2 endpoint multi-backbone), **manual_11 initial 20/20 → boosted n=70 = 94.3% [86.2, 97.8]** (Dedup 결함이 algorithm, NOT size — smart dedup 가능), libero_object_0_priv 0/10 = floor. `docs/superpowers/observations/2026-05-03-phase-Hprime-multi-backbone-and-manual-11.md` |
| **Paper v6 (2026-05-03)** | ✅ — manual_11 diagnostic + multi-backbone C2 + LIBERO Object 0 priv 통합. "smart dedup 가능" 명시. 10p PDF (411KB). arXiv submission tar 갱신 (139KB). |
| **Paper v7 + v8 (2026-05-03)** | ✅ — strict review v3 후 critical fixes (author/Table 2/mechanism/Δpp/group_a footnote) + manual_11 n=20→70 boost integration. **manual_11 n=70 = 94.3% [86.2, 97.8]**, vs C3v2 one-sided P≈0.6%. paper v8 PDF + arXiv tar 갱신. |
| **Codex P0 arXiv readiness cleanup (2026-05-03)** | ✅ 완료 — SUBMISSION_README author/stale metadata 정리, arXiv abstract 1475자, figure final n=70 수치 반영, bibliography metadata 수정, closing retrospective 작성, local+tar compile 검증 완료. `docs/superpowers/retrospectives/2026-05-03-closing-retrospective.md` |
| **manual_11 boost (2026-05-03)** | ✅ 완료 — pod $0.66 + LLM ~$15 = ~$16. trial 21..70 = 46/50 = 92%. 누적 ~$162. |
| **LIBERO simpler subtask smoke** | ✅ 완료 (2026-04-28) — spatial_2 + object_0 모두 0/5 floor (환경 incompatibility 가설) |
| **Phase 1+2 통합 narrative v1** | ✅ 완료 (2026-04-26) — `2026-04-26-capx-narrative-v1.md` |
| **Phase 2.5 LIBERO held-out (smoke 5)** | ✅ 완료 (2026-04-27) — 5/5 reward 0 floor. heldout 들 runner bug 로 미실행. |
| **arXiv-quality paper 초안 (10p 2col)** | ✅ 완료 (2026-04-27) — `docs/paper/capx-paper-2026-04.tex` (451 lines, 8 figures) |
| RunPod pods | 🗑️ 모두 terminate. 잔액 ~$16.5 |
| R2 백업 | ✅ libero 1.3 MiB + paper 194 KiB 추가 완료 |
| Phase 3 long-horizon | ⏳ 대기 |
| pdflatex compile / arXiv tar | ✅ 완료 (2026-05-03) — TinyTeX 10p PDF (441,158 bytes), tar-only source package 134KB, extraction compile OK. |
| **v2 paper foundation — smoke triplet proposal** | 📝 **제안 (2026-05-03)** — Tier 1 Dedup v3 k 민감도 + D2 closure + LIBERO 진단 (~$11). 결과 보고 후 real run ($15–$65) 결정. `docs/superpowers/proposals/2026-05-03-v2-foundation-smoke-triplet.md` |
| **v2 smoke triplet — execution + analysis (2026-05-04)** | ✅ 완료 — n=140 trials, ~$8-10. P21_a sanity 80% / manual_11 100% (env reproducibility 검증). Dedup v3 k=10/12/13 모두 ≥93%. D2 backbone asymmetry 발견 (Claude=100% / DeepSeek=0%). 11 environment fix 발견 → manifest §8 backup_protocol 첫 적용. `docs/superpowers/observations/2026-05-04-v2-smoke-triplet-results.md` |
| **Phase 1+2 real run (2026-05-04, ~9h pod, ~$64-69)** | ✅ 완료 — sanity + B3 + A2 D2 closure n=200 + A1 Dedup v3 real n=210. **multi-backbone library effect**: gpt-4.1 Δ=+19pp / Claude Δ=+2pp (null) / **DeepSeek Δ=+88pp** (가장 큰 single-task library benefit). **Dedup v3 algorithmic robustness**: k∈{10,11,12,13} 모두 ≥90%, k=10/11/13 가 C3v2 와 CI 분리. Bootstrap 17min (manifest fix list 효과). Monitor v6+v7 = 진짜 active sub-agent monitor 첫 사례. `docs/superpowers/observations/2026-05-04-phase12-real-results.md` |
| **Paper v9 update (D2 closure + Dedup v3)** | ✅ 완료 — abstract + Table 2 (9 new rows) + §V.E (D2 closure) + §V.F (Dedup v3 robustness) + Limitations + Conclusion. 11 pages, 423650 bytes. arxiv-submission tar 재생성 (138KB). |
| **Paper v10 metric correction (8th reversal, 2026-05-06)** | ✅ 완료 — capx-defined "Task completed" (final attempt success) 가 source-of-truth. 5 conditions number 정정 (Claude P21_a 98%→86%, dedup_v3_k10 97.1%→91.4% 등). Claude library effect = +12pp (P=0.5%) detectable. 11p / 445,052 bytes. `docs/superpowers/observations/2026-05-06-b4-metric-correction.md` |
| **Paper v10.1 §V.F mechanism self-correction (9th reversal, 2026-05-07)** | ✅ 완료 — k=12 dip 의 "docstring-less get_grasp_pose_for_mask" 가설 *empirically refuted before publication* (해당 skill 은 k=13 에만 존재; manual_11 도 docstring-less skill 가짐 without dip). §V.F mechanism 단락 self-correction. B4 micro-test cancelled ($3 saved). 11p / 446,953 bytes. `docs/superpowers/observations/2026-05-07-9th-reversal-k12-mechanism-refuted.md` |
| **Paper v11 closing (Smoke phase 2 + Limitations 보강, 2026-05-08)** | ✅ 완료 — §V.G new section: Smoke #1 cube_stack_3 cross-task floor (0/15 baseline, 0/15 library; vision-pipeline saturation localised; library still confers 24-34% code-efficiency reduction at floor); Smoke #2 multi-session round-2 mining returned 0 promotable new skills (namespace saturation finding); Smoke #3 verbose mechanism (DeepSeek partial-grip-and-drop, Claude robust completion within 1-2 regens — multi-backbone gap = self-correction not reasoning). Abstract + Limitations 보강. 13p / 460,482 bytes. arxiv-submission tar 갱신 (151KB). 누적 ~$248 (smoke phase 2 ~$7 + 이전 ~$241). 잔액 OpenRouter $38 + RunPod $17. **Project status: closed, paper v11 published-ready.** `docs/superpowers/observations/2026-05-08-smoke{1,2,3}-*.md` |

---

## 3. Phase 1 — 9개 관찰의 한 줄 요약

| # | 주제 | 문서 | 핵심 finding |
|---|---|---|---|
| 1.1 | Skill usage matrix | `docs/superpowers/observations/2026-04-19-phase-1-1.md` | 11 promoted 중 **2개 dead** (`select_top_grasp`, `grasp_pose_to_ik`). top-4 = 전체 호출 76%. `execute_grasp` 89/89 dir universal. |
| 1.2 | Unpromoted pool audit | `2026-04-19-phase-1-2.md` | 330 unpromoted 중 **91%가 generality gate fail**. 18개가 promoted와 structural-hash 일치 = near-duplicate. Near-miss 62개. |
| 1.3 | Failure event sequence | `2026-04-24-phase-1-3.md` | trial_42 12-block 실패 = Molmo base API 루프. **VDM reasoning 82% empty** (228/277 blocks) — 시스템 결함 후보. |
| 1.4+1.8 | Survival + gate sensitivity | `2026-04-19-phase-1-4-and-1-8.md` | 16 pre-GD promoted 전부 경계선. **Hard-fail gates + dedup이 전체 filter의 93%** (soft-only=148 → 실제=11). |
| 1.5 | Prompt position + docstring | `2026-04-19-phase-1-5.md` | Position bias null (\|r\| ~ 0). **Docstring 있는 skill이 평균 8× 더 호출**. Dead 2개 모두 no-docstring. |
| 1.6 | Invocation context | `2026-04-24-phase-1-6.md` | 744 calls **arity 100% 일치**. Docstring-usage 괴리 0건. 이름 self-explanatory면 docstring 없어도 OK. |
| 1.7 | Trial-order behavior | `2026-04-24-phase-1-7.md` | Correlation 전부 \|r\| < 0.23 = noise. LLM stateless와 부합. |
| 1.9 | Retry behavior | `2026-04-24-phase-1-9.md` | Retry 시 **abstraction downgrade** (high-level → low-level). REPLACE 66.7% > SAME_SET 44%. Long-tail skill은 retry에서 등장. |
| Synthesis | 9건 통합 | `docs/superpowers/retrospectives/2026-04-24-phase-1-synthesis.md` | Q1/Q2/Q3/Q4 현재 답 + Phase 2 재설계 제안. |

### 4 북극성 질문에 대한 현재 답

| Q | 현재 관찰 답 |
|---|---|
| Q1 | 뽑는다. **단, 같은 skill을 여러 이름으로 중복 발견** (F1). `transform_cam_to_world` 1 기능 = 10+ 이름 변종. Dedup이 95% 처리. |
| Q2 | **조건부 YES**: no-dedup C2 library 는 P21_a baseline 을 +18–19pp 이김 (C2 n=70 = 97.1% [90.2, 99.2] vs P21_a n=50 = 78.0% [64.8, 87.2], CI 분리). 단 dedup-applied production C3v2 는 82.9% [72.4, 89.9] 로 baseline 과 통계적 동등. 즉 “skills help” 는 C2 setup 에서는 YES, 현재 structural-hash dedup production 에서는 NO/NULL. Transfer beyond cube_lifting 은 cube_stack/LIBERO floor 로 미측정. |
| Q3 | 현재는 snapshot만. Phase 3.1 long-horizon (n=100–200) 에서만 답 가능. |
| Q4 | 패턴: ① **dedup 의 survivor rule 이 결정적** — `(has_docstring, sr, name)` 으로 바꾸면 +20pp. ② docstring + 이름 self-explanation (Phase 1.5) ③ pipeline 허브 or low-level util ④ retry-level abstraction downgrade. |

---

## 4. Phase 0 infrastructure — 완료

- [x] `capx/envs/simulators/robosuite_cube_lift.py:110-112` **seed propagation fix** (`np.random.seed(seed)` 삽입)
- [x] **Config 원칙 재정의**: 기존 baseline YAML 수정 금지, 파생 YAML로 통일 — Phase 2.4/2.1 파생 (`..._ablation_c{0..3}.yaml`, `..._p21_a.yaml`)
- [x] **Per-run metadata** (`run_metadata.json`): commit hash, dirty flag, args, config — `capx/envs/runner.py::_write_run_metadata()`
- [x] **VDM reasoning empty** 원인 파악: 모델 한계 + prompt 요구 부재. 수정은 Phase 2.4 에는 포함 안 함 (confound 회피). Phase 3 전 micro-task 로 분리.
- [ ] Library snapshot 자동화 (Phase 3.1 전까지만 필요, 아직 안 함)

---

## 5. Phase 2 실행 결과 (2026-04-24 완료)

| # | 실험 | 결과 |
|---|---|---|
| 2.4 | **3-fix ablation** (60 trials) | ✅ 완료. C0 33% / C1 **100%** / C2 **100%** / C3 73%. **dedup 이 −27pp**, gates 중립, namespace +67pp. |
| 2.1 | Counter-factual (10 trials) | ✅ 완료. P21_a 90% > C3 73% (같은 seed 1–10). skills 가 negative. |
| 2.2 | Dose response | 보류 (2.4 결과 반영 후 재설계) |
| 2.3 | Held-out task transfer | 대기 (dedup 재설계 선결) |
| 3.3 | Adversarial injection | 대기 |

**실제 비용**: Pod $1.54 + LLM ~$20 = ~$22. 잔액 $29.43 + OpenRouter 잔액 사용.

**서사 재작성 필요**:
- Group D retrospective 의 "3-fix 패키지" 주장 → ablation 으로 **반박**.
- Namespace seeding 하나가 Group C→D 회복의 본체였다.
- Dedup 은 cube_lifting 에서 **해로움** (dead skill 보존 방식 재설계 필요).

---

## 6. 주요 mechanical 발견 (기억 필요)

1. **Naming diversity 폭발**: LLM이 같은 함수를 계속 다른 이름으로 재발견. `transform_cam_to_world` 1개에 대해 10+ 이름 변종.
2. **Docstring ≫ prompt position**: 호출 빈도의 1차 predictor는 docstring 존재 (8× 차이), position은 |r| ≈ 0.
3. **Abstraction downgrade** (retry 행동): 고수준 skill 실패 시 low-level primitive로 풀어헤침. 인간 디버깅과 유사.
4. **실패는 skill이 아닌 base API**: trial_42는 Molmo 실패. Skill layer 자체는 arity 100% 준수.
5. **Hard-fail + dedup이 gate의 93%**: soft gates만으로는 341 → 148, 실제 11. 3-fix 중 이 부분이 dominant.
6. **VDM reasoning 82% empty**: 모델 (gpt-4.1) 이 reasoning 필드 미반환 + prompt 가 설명 요구 안 함. Phase 2 직전 prompt 수정 과제.
7. **⭐ 3-fix 는 monotone 아님 (Phase 2.4)**: C1 100% > C2 100% > C3 73%. **Dedup 이 −27pp** 로 성능 감소. Group D 서사 재작성 필요.
8. **⭐ Skill library 가 cube_lifting 에서 negative (Phase 2.1)**: P21_a (0 skills, 90%) > C3 (11 skills, 73%). 같은 seed set 20pp 차이.
9. **⭐ Dedup v2 (doc-prefer survivor) — n=15 +20pp → n=50 +5–10pp marginal (Phase 2.4c + E)**: C3 73% → C3v2 80% (95% CI [67%, 89%]). 1 swap (`get_grasp_pose_for_mask` ↔ `plan_and_select_grasp`). Algorithm fix 효과 있지만 dramatic 아님.
10. **Cube_stack held-out 모두 0–20% (Phase 2.4c)**: skill 없는 baseline 도 0%. Task 자체가 너무 어려움. C3v2 만 유일하게 2/10 success.
11. **⭐ LIBERO Spatial 0 baseline 도 floor 0/5 (Phase 2.5)**: cube_stack 과 동일 패턴. 두 held-out target 모두 floor → **skill library 전이 측정 currently impossible** with this pipeline.
12. **arXiv-quality paper 초안 완성**: `docs/paper/capx-paper-2026-04.tex` (10p 2col, 8 figures). Honest framing — namespace seeding dominant, dedup as engineering hygiene, transfer unmeasured.

---

## 7. 방법론 이슈 — Phase 0 전에 반드시

`docs/superpowers/reviews/2026-04-17-methodology-review.md` 3 Critical + 7 Important.

| # | 이슈 | 해결 단계 |
|---|---|---|
| C1 | Seed propagation — robosuite_env.reset()에 seed 미전달 | Phase 0 |
| C2 | Config 불일치 (A=12 workers, D=1) | Phase 0 |
| C3 | Train/test contamination — cube_lifting에서 추출+평가 | Phase 2.3 held-out에서 완화 |
| I1–I7 | 세부 methodology | Phase 0/2 단계별 반영 |

---

## 8. 백업 & 자산 위치

### R2 (`r2:capx/`) — 2.27 GiB / 23,869 objects
- `outputs/` — 2.19 GiB (Group A/B/C/D 모든 trial raw)
- `group_d/` — 77 MiB (separate snapshot)
- `skill_backups/final_2026-04-22/` — skills JSON 4개
- `logs/` (6 experiment) + `logs/server/` (5 server: graspnet, molmo, proxy, sam3)
- `artifacts/` (dose-response skill files), `scripts/molmo_server.py`

### Local
- `/Users/ryeol-13/GitHub/oh-my-cap-x/outputs/api_injection/group_d_{fixed,sanity}/`
- `/Users/ryeol-13/GitHub/oh-my-cap-x/.capx_skills.json` (11 promoted + 330 unpromoted)
- `/Users/ryeol-13/GitHub/oh-my-cap-x/.capx_skills.json.pre_group_d` (16 promoted snapshot)
- Phase 1 analysis: `scripts/analysis/` + `outputs/analysis/phase_1_*/`

### RunPod
- **모든 pod terminate 완료.** 비용 $0/hr. Phase 2 시작 시 새 pod 필요.

### External
- **GPU 대기**: `ryeol-u` (RTX 5060 Ti 16GB, Ubuntu) — HRI duplex 프로젝트 주용도. 필요 시 capx 분석용으로도 사용 가능.

---

## 9. 관련 문서 링크

**Top-level (always update):**
- 이 문서 — 현재 현황 index
- `docs/capx-log.md` — 날짜별 append-only 행적
- `docs/capx-conventions.md` — 작성 규칙/루틴
- `docs/capx-status-report.html` → https://capx.ryeol.kim (2026-04-18 스냅샷)

**Superpowers categorized:**
- Phase 2 제안 (원안): `docs/superpowers/proposals/2026-04-18-curiosity-driven-next-steps.md`
- 방법론 리뷰: `docs/superpowers/reviews/2026-04-17-methodology-review.md`
- Group D 회고: `docs/superpowers/retrospectives/2026-04-17-group-d-retrospective.md`
- Phase 1 synthesis: `docs/superpowers/retrospectives/2026-04-24-phase-1-synthesis.md`

**Memory (cross-session):**
- 북극성: `~/.claude/projects/-Users-ryeol-13/memory/project_capx_northstar.md`
- 실험 상태: `~/.claude/projects/-Users-ryeol-13/memory/project_capx_experiment.md`

---

## 10. 다음 세션을 위한 작업 포인터

Phase 2.4/2.1 결과 기반 재조정:

1. **즉시 가능** ($0):
   - Dedup 알고리즘 재설계 spec 작성 — signature-based 대신 usage-based 또는 non-destructive (demote-only)
   - Group D retrospective 재작성 ("3-fix 패키지 → 실은 namespace seeding 단독")
   - `.capx_skills.c2.json` (14 skills) 또는 `c1.json` (16 skills) 를 "new best skill set" 으로 승격 여부 결정

2. **작은 재실행** (~$6–8): C1 (100%) 을 n=50 으로 재확인 — variance 제거

3. **Phase 2.3 held-out** (~$10, pod ~1hr): cube_stack prereq 체크 → C2 기반 (11 대신 14 skills) 으로 transfer 실험

4. **Phase 3.3 adversarial** (~$5): Phase 1.5/1.6 에서 도출된 가설 (docstring / 이름 bias) 실증
