> **Archive notice**: This file is an archival research trace, not the reader-first public explanation. For the current public framing, condition glossary, and paper package, start at [`docs/index.html`](index.html) / <https://capx.ryeol.kim/>. Internal run IDs such as `C2` and `P21_a` are preserved here for traceability.

# CaP-X Experiment Log

**Append-only chronological log of experiments, analyses, and project events.**

- 날짜 내림차순 (최신이 상단)
- 한 entry = 하나의 세션/실험/결정 단위
- 개별 세부는 링크된 문서에서. 여기는 "무슨 일이 있었는지" 한눈에 요약.
- 작성 규칙: `docs/capx-conventions.md`

---

## 2026-05-08 — capx project closing (paper v11 + Smoke phase 2 + 9 narrative reversals)

### 5/6 8th narrative reversal — metric correction (paper v9 → v10)
- Paper v9 의 5 conditions 가 *trial dir count* metric 으로 계산됨 → over-counting
- capx-defined "Task completed" (final attempt success in `summaries.txt`) 가 source-of-truth
- 정정: Claude P21_a 98%→86%, dedup_v3_k10 97.1%→91.4%, dedup_v3_k12 90%→87.1% 등 5 conditions
- Claude library effect: +2pp null → **+12pp detectable (P=0.5%)** — narrative 반전
- `feedback_capx_metric_definition.md` 신규 memory
- Artifact: observation `2026-05-06-b4-metric-correction.md` / paper v10 (11p / 445,052 bytes)

### 5/7 9th narrative reversal — paper §V.F mechanism self-correction (paper v10 → v10.1)
- subagent-driven Task 1 implementer 의 verification 단계에서 paper v10 §V.F mechanism 가설 *empirically refuted*
- 가설: "k=12 dip cause = docstring-less get_grasp_pose_for_mask"
- 현실 (3 errors):
  - get_grasp_pose_for_mask 는 k=12 *에 없음* (k=13 에서만 promoted)
  - k=12 의 docstring-less = select_top_grasp (q=0.900)
  - select_top_grasp 는 manual_11 (k=11) 에도 있음 *without dip* → mechanism cross-check 불일치
- §V.F mechanism 단락 *publication 전 self-correction* (paper v10.1, 11p / 446,953 bytes)
- B4 micro-test ($3, n=30) cancelled
- Artifact: observation `2026-05-07-9th-reversal-k12-mechanism-refuted.md`

### 5/8 Smoke phase 2 (3 micro-evaluations, ~$7) + paper v11
- Sanity: P21_a 6/10, manual_11 10/10 (env reproducibility OK)
- **Smoke #1 cube_stack_3 cross-task**: 0/15 baseline + 0/15 library — Fail (floor). *Mechanism*: sam3 mask fragmentation (>200 entries per cube). Library 가 *task* 못 살리지만 *code efficiency* 24-34% 개선 (5.73→4.13 blocks, 4.73→3.13 regens, 24% wall time 감소). 즉 library effect 가 *task-success* 와 *code-cost* 두 방향으로 분해됨.
- **Smoke #2 multi-session round-2**: 0 promotable new skills extractable from k=12 fail trials (modified design due to backup gap on c3v2 outputs). *Mechanism*: dense namespace → LLM imperative API-only code → no top-level function defs. **Namespace saturation regime** 발견.
- **Smoke #3 verbose mechanism (DeepSeek + Claude n=3 each)**: 결과는 paper v10 numbers 와 sample variance 일치. *Mechanism*: 둘 다 *partial-grip-and-drop* failures 발생, but Claude regenerates within 1-2 iterations to convergence; DeepSeek cycles through similar partial solutions. Library benefit on DeepSeek (+88pp) 가 *self-correction loop replacement* — 추론력이 아닌 iteration 능력 차이.
- Artifacts: observations `2026-05-08-smoke{1,2,3}-*.md`
- Pod cycle: $0.85 GPU + $5.75 LLM = ~$6.6 (planned $13, $6 saved by Smoke #2 trial-skip + B4 cancellation)
- R2 backup: outputs 106 MiB / state 1.1 MiB / logs 940 KiB → `r2:capx/{outputs,local_state,logs}/2026-05-08-closing/`

### 5/8 Paper v10.1 → v11
- §V.G new section "Smoke evaluations: cross-task, multi-session, mechanism" (3 sub-sections)
- Limitations: 4 new bullets (vision-pipeline ceiling on cube_lifting / multi-session saturation / backup-protocol gap on C3v2 outputs)
- Abstract + Conclusion: 한 단락씩 Smoke summary
- 13p / 460,482 bytes (v10.1 11p / 446,953 bytes 에서 +2 pages)
- arxiv-submission tar 갱신 (151,357 bytes), SHAs verified

### 9 narrative reversals — final summary
| # | 시점 | 그 때 결론 | 정정 사유 |
|---|---|---|---|
| 1 | 4/17 | "3-fix 다 덕분 (Group D 98%)" | 4/24 ablation |
| 2 | 4/24 | "namespace 단독 +67pp" | 4/26 dedup v2 |
| 3 | 4/26 | "library = hygiene only" | 5/1 P21_a boost |
| 4 | 5/1 | "library +23pp ROI" | 5/2 strict critique |
| 5 | 5/2 | "production library = baseline" | 5/3 C1/C2 boost |
| 6 | 5/3 | "no-dedup +19pp, smart dedup possible" | stable through paper v8 |
| 7 | 5/4 | "Q1-Q2 cube_lifting 단단" | v2 smoke — multi-backbone Δ asymmetry |
| 8 | 5/6 | "Claude library effect = null" | metric correction → +12pp detectable |
| 9 | 5/7 | "k=12 dip = docstring-less skill" | k=11 cross-check refutation |

본 9 reversals 모두 *publication 전* self-correction. *audit chain methodology* 가 effective working — *publication 직전 mechanical verification* 로 작동.

### Project state: closed
- paper v11 published-ready (11p PDF, arxiv tar synced)
- 누적 비용 ~$248 (sanity + smoke1 + smoke3 ~$7 + 이전 phase ~$241)
- 잔액 OpenRouter ~$38 + RunPod ~$17
- Pod terminate ✅
- arXiv re-submit: 사용자 직접 (Task 10)
- paper v3 candidates (vision pipeline upgrade / namespace saturation refactor / multi-backbone iteration mechanism) 는 future work

---

## 2026-05-04 — v2 smoke triplet execution + Phase 1+2 real run + paper v9

### 5/4 v2 smoke triplet (n=140, ~$8-10)
- Sanity: P21_a 80% / manual_11 100% — paper v8 reproducibility 검증
- Dedup v3 k=10/12/13 (n=15 each): 100/100/93% — algorithm robust 입증
- D2 closure smoke: P21_a × Claude=100% / × DeepSeek=0% — backbone asymmetry 첫 발견
- 11 environment fix 발견 → `feedback_capx_environment_audit.md` + `feedback_capx_backup_protocol.md` 신규 memory
- Manifest §8 backup_protocol 첫 적용 (R2 local_state)
- Artifacts: observation 2026-05-04-v2-smoke-triplet-results.md / retro 2026-05-04-v2-smoke-synthesis.md / environment 2026-05-04-v2-smoke-triplet.md

### 5/4 Phase 1+2 real run (n=440, ~9h pod, ~$64-69)
- Pod bootstrap 17min (vs v2-smoke 200min, 90% 단축 — 11 fix list 적용 효과)
- Sanity check + B3 DeepSeek 진단 + A2 D2 closure full + A1 Dedup v3 real
- **Multi-backbone library effect (D2 closure 완전 닫힘)**:
  - gpt-4.1: Δ = +19pp (paper v8)
  - Claude Sonnet 4: Δ = +2pp (null, baseline ceiling 98%)
  - DeepSeek v3: Δ = **+88pp** (94% vs 6%, CIs fully separated) — 가장 큰 single-task library benefit
- **Dedup v3 algorithmic robustness**: k∈{10,11,12,13} 모두 ≥90%, k=10/11/13 가 C3v2 (82.9%) 와 CI 분리. k=12 의 non-monotone dip 은 docstring-aware 메커니즘과 일치.
- Monitor v6+v7 = sleep 540 in Bash 패턴으로 *진짜 active* sub-agent monitor 첫 사례 (4h+3h20m 유지)
- Artifacts: observation 2026-05-04-phase12-real-results.md / runs file 2026-05-04-phase12-real.txt

### Paper v9 update
- abstract 보강 (D2 closure + Dedup v3 신규 결과)
- Table 2 expanded: 9 new rows (D2 closure 4 + Dedup v3 3 + 기존)
- §V.E 신규: Multi-backbone library effect (D2 closure)
- §V.F 신규: Dedup v3 algorithmic robustness
- Limitations §"Cross-backbone baseline delta": resolved → mechanism 후속만 open
- Conclusion: 두 신규 finding 명시
- 11 pages / 423650 bytes / arxiv tar 138672 bytes

### 비용 진행
- 누적 (paper v8 까지): ~$162
- v2 smoke (5/4): +$8-10
- Phase 1+2 real (5/4): +$64-69
- **현재 누적: ~$234-241**

### Artifacts added
- `docs/superpowers/observations/2026-05-04-v2-smoke-triplet-results.md`
- `docs/superpowers/observations/2026-05-04-phase12-real-results.md`
- `docs/superpowers/retrospectives/2026-05-04-v2-smoke-synthesis.md`
- `docs/superpowers/environments/2026-05-04-v2-smoke-triplet.md`
- `scripts/experiment/runs_2026_05_04_*.txt` (3 files)
- `scripts/experiment/generate_dedup_v3_skill_set.py`
- `.capx_skills.dedup_v3_k{10,12,13}.json`
- `env_configs/cube_lifting/franka_robosuite_cube_lifting_dedup_v3_k{10,12,13}.yaml`
- 3 new memory: `feedback_capx_environment_audit.md`, `feedback_capx_backup_protocol.md`, `feedback_capx_monitor_protocol.md`

---

## 2026-05-03 — v2 paper foundation: smoke triplet proposal

### 목적
Paper v8 게시 직후, 북극성 Q1–Q4 대비 *남은 gap* 을 정리하고 paper v2 의 foundation 실험 제안.

### Gap analysis 요약
- Q1/Q2: cube_lifting 단일 task × gpt-4.1 단일 backbone 으로 단단. Multi-backbone baseline (D2) / cross-task transfer 미검증.
- Q3 evolution: 0% 미관측 — paper v8 limitation 그대로.
- Q4 algorithmic: manual_11 = top-quality 11 (94.3%) 이지만 k 민감도 (k=10/12/13) 미검증.

### 6 후보 ranked
Tier 1 (cheap & high ROI): Dedup v3 k 민감도 / D2 closure. Tier 2: Phase 3.1 long-horizon. Tier 3: Medium-difficulty task / LIBERO 진단. Tier 4: Multi-session evolution.

### 제안: smoke triplet ($11)
1. Dedup v3 k 민감도 (k=10/12/13 each n=15) — ~$2
2. D2 closure (P21_a × Claude/DeepSeek each n=10) — ~$4
3. LIBERO floor 진단 (5 trial detailed logging + 분석) — ~$2
4. Pod overhead — ~$1 (단일 A40 pod ~3-4h)

### 핵심 발견 (planning 단계)
`manual_11` 자체가 dedup_v3(k=11) 의 instance — C2 promoted 14 skills 에서 quality_score 하위 3 (0.722 / 0.864 / 0.878) drop. 즉 paper v8 의 94.3% 가 dedup_v3 의 한 점이며, smoke 1 의 가치는 *robustness* (k 민감도) 입증.

### 후속 결정 tree
모든 smoke pass → real Tier 1 ($65, 1주) → paper v2 의 §VI.D + §V.E 두 절 채움.
부분 pass → 해당 real 만 진행.
모두 fail → Tier 4 (multi-session, $30) 또는 다른 프로젝트로 attention 이동.

### Artifacts added
- `docs/superpowers/proposals/2026-05-03-v2-foundation-smoke-triplet.md`

### 비용 진행 상태
- 누적 ~$162 (paper v8 까지)
- 본 proposal: 추가 ~$11 smoke / 결정 후 추가 $15–$65 real

---

## 2026-05-03 — Codex P0 arXiv readiness cleanup + final package verification

### 목적
Paper v8 이후 남아 있던 publication blockers/stale surfaces 정리:
- arXiv author/metadata가 Claude/Anthropic을 저자로 오해하지 않도록 정리
- paper figures와 captions가 최종 boosted verdicts(n=70/manual_11 포함)를 반영하도록 재생성
- bibliography metadata 보강
- stale public HTML status page가 최신 source of truth로 오해되지 않도록 banner 삽입
- arXiv tarball이 source-only로 실제 compile 되는지 검증

### 완료
- `docs/paper/capx-paper-2026-04.tex`
  - abstract/conclusion/limitations: no-dedup C1/C2는 baseline을 이기지만, C2 cross-backbone은 endpoint replication이고 baseline delta는 아직 future work라고 명시
  - `capx2026`, Code-as-Policies, OpenHands bibliographic metadata 보강
  - `capx2026` arXiv ID/DOI verified against official arXiv page: arXiv:2603.22435 / doi:10.48550/arXiv.2603.22435
  - Phase B/C/F/G/H' internal reports를 bibliography에 추가
  - long path refs는 `\path{...}`로 변경해 large overfull bibliography warning 제거
- `docs/paper/figures/figures-build.py`
  - Figure 3/4/6를 최종 boosted numbers로 갱신
  - regenerated 8 PDF figures
- `docs/paper/arxiv-submission/SUBMISSION_README.md`
  - Authors = `realkim93 (capx, independent)` 단독
  - arXiv metadata abstract = 1475 chars / 213 words
  - Comments = 10 pages / 8 figures / 8 references
  - tar command = `capx-paper-2026-04.tex figures/*.pdf` only
- `docs/capx-status-report.html`
  - 2026-04-28 stale snapshot banner 추가
- `docs/superpowers/retrospectives/2026-05-03-closing-retrospective.md`
  - 18일 research arc, six reversals, north-star Q1--Q4, next experiments, arXiv readiness 정리
- `docs/capx-current-status.md`
  - Codex P0 cleanup 완료 상태 및 compile/tar evidence 반영

### Verification
- Local compile: TinyTeX `pdflatex` 2-pass OK
  - `docs/paper/capx-paper-2026-04.pdf`
  - log: `Output written ... (10 pages, 441158 bytes)`
  - no undefined refs/citations/fatal errors
- arXiv package:
  - `docs/paper/arxiv-submission/capx-paper-arxiv.tar.gz` = 134KB
  - tar contents: main `.tex` + 8 used figure PDFs only
  - extracted tar in `/tmp`, ran `pdflatex` 2-pass OK → 10 pages / 441158 bytes
  - main/submission tex SHA256 identical; main/submission PDF SHA256 identical
- Bibliography count: 28 `\bibitem`s
- RunPod/cost guard: `runpodctl pod list` returned `[]`; stale local background waiter for an already-terminated pod was killed; no actionable CapX/runpod processes remain.
- Incremental cost: $0 GPU / $0 LLM external spend in this Codex cleanup pass

### Remaining honest gaps
- `P21_a` was not rerun on Claude/DeepSeek, so cross-backbone baseline delta remains future work.
- Public HTML page is bannered as stale, not fully regenerated.
- arXiv submission itself remains manual by user.

---

## 2026-05-03 — manual_11 n=70 boost + paper v8 + arXiv ready

### Strict review v3 (3rd review cycle)
사용자 5/3 요청 후 superpowers:code-reviewer 가 paper v6 점검:
- **Critical**: Table 2 stale (n=15 만), author affiliation Anthropic mis-attribute (publication blockers)
- **Important**: manual_11 mechanism 미명시, Δpp 일관성, group_a footnote, V4 single-point
- 권장: critical fixes + manual_11 n=70 boost ($8) → arXiv

### Critical fixes (paper v7, $0)
- Author: Claude → Acknowledgements (AI assistance disclosure footnote)
- Table 2: 모든 boost 결과 (n=70) + Phase F controls + Phase H' 행 추가
- manual_11 mechanism: Phase 1.5/1.1/3.3 의 mechanism alignment 명시
- Δpp: initial n=15 vs boosted n=70 magnitude 함께 표기
- Table 1: group_a_repro non-reproduction footnote
- LIBERO: "suite-wide" → "Spatial+Object 두 sub-suite"

### manual_11 n=20 → n=70 boost ($16)
- Pod (`nn4ld6e89nx10a`, A40 CA-MTL-1, ~$0.66)
- Boost (trial 21..70, n=50): **46/50 = 92%**
- Combined manual_11 n=70: **66/70 = 94.3%** Wilson [86.2%, 97.8%]
- vs C3v2 [72.4, 89.9]: **3.7pp CI overlap (boundary)** — strict separation 미달
- 그러나 P[X≥66 | n=70, p=0.83] ≈ **0.6%** → strong evidence manual_11 > C3v2
- 5/3 phase H' 의 100% (n=20) headline 은 sample regression to mean — **CI [86.2, 97.8]** 가 정확한 representation

### Paper v8 (10p, ~424KB)
- abstract: "100% at n=20" → "94.3% at n=70 [86.2, 97.8], CI partially overlap, P=0.6%"
- §VI.C: 동일 정정 + manual_11 mechanism 추가
- Table 2: manual_11 n=70 = 94.3%
- Conclusion: "100% at n=20" → "94.3% at n=70 (one-sided P≈0.6%)"

### arXiv submission tar 갱신
- `docs/paper/arxiv-submission/capx-paper-arxiv.tar.gz` (140KB) — paper v8 + figures + README
- 업로드 ready (사용자 manual)

### 비용
- Phase H' boost: pod $0.66 + LLM ~$15 = ~$16
- 누적 strict cycle: ~$116
- **전체: ~$162**

### V4 claim 의 honest scope
- "Smart dedup possible" 의 evidence: manual_11 94.3% vs C3v2 82.9% at n=70, one-sided P≈0.6%
- Strict CI separation 미달 (3.7pp overlap) — paper 가 정확히 disclose
- 6th reversal 부분: manual_11 의 100% 가 92% 로 떨어지면서 narrative 약화 — 단 V4 큰 그림은 holds (smart dedup possible, dedup algo flaw confirmed)

### 자세히
- review v3 결과 + critical fixes: 본 entry 위 부분
- arXiv submission tar: `docs/paper/arxiv-submission/`

---

## 2026-05-03 — Phase H': multi-backbone + manual_11 + LIBERO Object — **smart-dedup 존재 증명**

### 사용자 5/3 결정 + Phase H' design
사용자 "엄격하게 다시" + "smart dedup search 가능 아니냐?" 직관 → Phase H 제안 → strict review (n=15 power 부족, motivation 결함, 6th reversal risk) → Phase H' (cheap, focused) 으로 reduce.

4 conditions, A40 pod CA-MTL-1, $0.61 + LLM ~$15.

### 결과
| Cond | n | Succ | Rate | Wilson CI |
|---|---|---|---|---|
| c2_claude | 20 | 19 | **95.0%** | [76.4, 99.1] |
| c2_deepseek | 20 | 20 | **100.0%** | [83.9, 100] |
| **manual_11** | 20 | **20** | **100.0%** | [83.9, 100] |
| libero_object_0_priv | 10 | 0 | **0.0%** | [0, 27.8] |

### 핵심 finding — manual_11 = 100%, **사용자 직관 verified**
- C3v2 (Dedup v2 의 11) = 82.9% vs **manual_11 (다른 11) = 100%**
- 같은 size, *6 of 11 different selection*. 같은 size 인데 +17pp gap.
- P(20/20 if true rate 0.83) = 2.4% — 통계적으로 manual_11 > C3v2 강하게 시사
- → **Dedup -14pp cost = ALGORITHM 결함, NOT size limit**
- → Smart dedup search valid 한 방향 (5/3 사용자 직관 정확함)

### Multi-backbone perf — C2 robust
- gpt-4.1 (n=70): 97% / Claude (n=20): 95% / DeepSeek (n=20): 100%
- → C2 의 +19pp ROI gpt-4.1 specific 아님

### LIBERO Object 0 priv = 0/10 floor
- Spatial 0 priv 10% → Object 0 priv 0%
- → LIBERO suite-wide incompatibility (perception 외에도 control API/task semantics)
- → D full 진행 X 확정

### Paper v6 (10p, 411KB)
- abstract / §V.B / §VI.C / conclusion 모두 정정
- "smart dedup 가능" 명시 + manual_11 diagnostic 통합
- multi-backbone perf 추가 (Phase 3.3 doc 만 있었음)
- LIBERO suite-wide floor 명시
- arXiv submission tar.gz v2 generate 완료 (139KB)

### Process incident
- 첫 pod (l0yvlr0zsv6bm3) PUBLIC_KEY env 누락 sshd 미기동 → 즉시 terminate, ~$0.10 leak only
- GPU monitor 가 escalation 정확히 detect → controller intervention → sub-agent recovery
- → **GPU monitor + cost-leak guard 룰의 가치 한 번 더 입증**

### 비용
- Phase H': $16
- 누적 strict cycle: ~$100
- 전체: **~$146**

### 자세히
- `docs/superpowers/observations/2026-05-03-phase-Hprime-multi-backbone-and-manual-11.md`

---

## 2026-05-03 — Phase G: C1/C2 boost + LIBERO privileged smoke — **narrative settles**

### B (C1/C2 boost): 5/1 의 96% 는 **real**
- C1 n=70 = **95.7%** [88.1, 98.5]
- C2 n=70 = **97.1%** [90.2, 99.2]
- P21_a 상한 87.2% < C1/C2 하한 88.1/90.2% → **CI 분리** (+18-19pp)
- 5/2 의 "lucky sample" 가설 부분 기각

### Production library = baseline 그대로
- C3v2 n=70 = 82.9% [72.4, 89.9]
- vs P21_a 78%: CI 14pp overlap → null (5/2 verdict 그대로)

### Dedup cost 통계적으로 단단
- C2 (no dedup) 97.1% [90.2, 99.2] vs C3v2 (Dedup v2) 82.9% [72.4, 89.9]
- CI **딱 분리** (0.3pp gap) → **dedup -14pp cost real**

### D smoke (LIBERO Spatial 0 + privileged): 1/10 = 10% floor
- Privileged API (perception 우회) 도 floor → LIBERO incompatibility 가 perception 만의 문제 아님
- **D full LIBERO Spatial 0 진행 X**. 다른 환경 또는 task 후보:
  - LIBERO Object 0 + privileged smoke ($3)
  - LIBERO Goal 1 + privileged smoke ($3)
  - Robosuite NutAssembly + privileged ($3-5)

### Pod history
- 3 pods (2 broken-CUDA on EU-SE-1, 1 working on CA-MTL-1). 동일 GPU UUID 의 host-side CUDA 깨진 케이스 발견 — DC switch 으로 해결.

### Paper v5 → 정밀화된 narrative
- **Production-relevant verdict**: "ship C2 setup (namespace+gates, NO dedup), +19pp ROI"
- "Dedup 어떤 형태든 -14pp cost" — statistically separated
- LIBERO incompatibility 의 cause 가 perception 외 (control API / prompt / task semantics)
- 4 condition arms 모두 n=70 (C0/C3 만 n=15 prior)

### Layer 2 CEO 결정 — Final
- ✅ Ship namespace + gates + 14-16 skills (C2)
- ❌ DO NOT ship dedup (any form) — flag `enable_dedup: false` default
- ❓ 다른 환경 발굴 필요 (Q2 transfer dimension)

### 비용
- Phase G: $16 (pod $0.86 + LLM ~$15)
- 누적 (4/30~5/3 strict cycle): $84.5
- 전체 누적: ~$130

### 자세히
- `docs/superpowers/observations/2026-05-03-phase-G-c1c2-boost-libero-priv-smoke.md`

---

## 2026-05-02 — Phase F: strict confound closure — **narrative reversal AGAIN**

**Goal**: 사용자 "엄격하게 다시 생각해봐" 요청 후 5/1 paper v3 의 "+23pp ROI" claim 의 confounds 직접 측정.

### Setup
- 4 conditions: empty_ns (n=30, 16 stub bodies=pass), group_a_repro (n=30, default prompt + no img_diff), p21_a +n=20 (총 50), c3v2 +n=20 (총 70).
- Pod (sxsno9ievt0t8r, A40, $0.44/hr), 2시간, $0.88. LLM ~$28.

### 결과
| Cond | n | Succ | Rate | Wilson 95% CI |
|---|---|---|---|---|
| empty_ns | 30 | 13 | **43.3%** | [27.4, 60.8] |
| group_a_repro | 30 | 19 | **63.3%** | [45.5, 78.1] |
| **P21_a (boosted)** | 50 | 39 | **78.0%** | [64.8, 87.2] |
| **C3v2 (boosted)** | 70 | 58 | **82.9%** | [72.4, 89.9] |

### Verdicts
- **V1 (typed-scaffolding hypothesis 기각)**: empty_ns 43% vs C3v2 83% = +40pp content effect. Skill bodies 가 결정적.
- **V2 (production library = baseline)**: P21_a 78% vs C3v2 83% = +5pp. **CI 14pp overlap**. Statistically *non-separable*.
- **V3 (Group A 98% 재현 불가)**: group_a_repro 63%, Group A historical 98% 와 35pp gap. Lucky run + drift.
- **V4 (setup confound 진짜 존재 but small)**: group_a_repro 63% vs P21_a 78% = -15pp 차이. yaml 설정 만으로 +15pp.

### Paper v4 narrative
5/1 paper v3 ("library +23pp ROI") **반전**:
- Library content 효과 ✅ real (+40pp vs empty_ns)
- Production library (C3v2) ROI ❌ statistically null (CI overlap with baseline)
- Namespace seeding +63pp dominant ✅ holds (C0 33% → C1 96%)
- C1/C2 의 96% 가 진짜인지 lucky 인지 미검증 (n=70 boost 안 함) — limitation 추가

### Layer 2 CEO 결정 (다시 정정)
- 5/1: "ship namespace+gates+skills, no dedup → +23pp ROI"
- **5/2 strict 후**: "ship the namespace seeding fix; production library 의 net contribution 은 measurement bound 안에서 baseline 동등".

### 비용
- Phase F: $29
- 누적 (5/2 까지): ~$110+ (review-driven 4 cycle).

### review-driven process 의 가치 입증 (한 번 더)
- review v2 (4/30) → Phase B+C (5/1) → 첫 narrative reversal
- "엄격하게 다시 생각해봐" (5/2) → Phase F → reversal 의 reversal
- **각 cycle 마다 paper 가 더 honest 해짐**. 만약 5/1 에서 멈췄으면 잘못된 +23pp claim publish 했을 것.

### 자세히
- `docs/superpowers/observations/2026-05-02-phase-F-strict-confound-closure.md`

---

## 2026-05-01 — Phase B+C: baseline boost + multi-backbone — **paper narrative reversal**

**Goal**: review v2 §A2 (P21_a underpowered) + B1/F1 (single-backbone) 닫음.

### Setup
- 새 Pod (`bnlaudo9ol8qpm`, A40 secure, $0.44/hr).
- 코드 patches: `launch.py --resume-idx N` flag (boost mode), `run_serial_robust.sh` v3 (5번째 column MODEL, 6번째 RESUME_IDX), `_extract_code` defensive guard for None content.
- 5 conditions: P21_a boost (12-30, n=20 new), C1 boost (20-50, n=31 new), C2 boost (16-50, n=35 new), Claude doc_neutral_v1 (n=15), DeepSeek doc_neutral_v1 (n=15).

### Phase B 결과 — **narrative reversal**

| Cond | N | Succ | Rate | Wilson 95% CI |
|---|---|---|---|---|
| **P21_a (no skills)** | **30** | 22 | **73.3%** | **[55.6%, 85.8%]** |
| **C1 (namespace ON)** | **50** | 48 | **96.0%** | **[86.5%, 98.9%]** |
| **C2 (+ gates)** | **50** | 48 | **96.0%** | **[86.5%, 98.9%]** |
| C3v2 (full, n=50 prior) | 50 | 40 | 80.0% | [67.0%, 88.8%] |

**P21_a CI 상한 85.8% < C1/C2 CI 하한 86.5% — CI 분리됨**.
- 기존 paper claim "library 못 이긴다 (90% vs 80%, n=10 vs 50)" → *반전*.
- **새 결론**: library WITH namespace seeding (no dedup) 이 baseline 대비 **+23pp positive ROI**.
- C3v2 (dedup v2) 80% < C2 96% = **dedup cost -16pp** — production 권장: **dedup 끄고 ship**.
- review v2 §A2 의 "P21_a underpowered" 우려가 *그대로 적중*. n=10 의 9/10=90% 는 lucky sampling.

### Phase C 결과 — multi-backbone universality

| Backbone | Doc Share |
|---|---|
| gpt-4.1 | 100% (42:0) |
| Claude Sonnet 4 | 100% (49:0) |
| DeepSeek v3 | 94.8% (109:6) |

세 backbone 모두 ≥ 90% → **docstring causality = LLM-general mechanism**, gpt-4.1 specific 아님.

### Paper v3 통합
- Abstract: "ROI unmeasurable" → "library beats baseline by +23pp; dedup costs -16pp"
- §VI.C ROI: 위 정정 반영 + production 권장 (ship namespace+gates, NOT dedup)
- §V.B Phase 3.3 doc: multi-backbone paragraph + Limitations 의 single-backbone disclaimer 닫음
- Conclusion: "two-part" → "four-part" adversarial probe + "+23pp ROI / -16pp dedup cost" 반영

### 비용
- Pod: $0.91 (~2시간 4분).
- LLM: ~$30 (boost trials + multi-backbone).
- **합계: ~$31** (계획서 estimate $32 일치).
- 누적 잔액 RunPod ~$13.

### 자세히
- review v2: `docs/superpowers/reviews/2026-04-30-research-methodology-review-v2.md`
- plan v2: `docs/superpowers/plans/2026-04-30-confound-closure-experiments.md`
- observation B: `docs/superpowers/observations/2026-05-01-phase-B-baseline-boost.md`
- observation C: `docs/superpowers/observations/2026-05-01-phase-C-multi-backbone-doc.md`

---

## 2026-04-30 — Phase A: wording fix + neutral-name doc replication (A1 closure)

**Goal**: review v2 의 A1 (name-string confound) 닫고 paper v2 honest scope 까지 — Phase A 전체.

### Wording fix ($0)
- paper.tex: A1/A2/B1/B2/F1/H3/D1 8 fix. abstract/conclusion 에 "in our pipeline" qualifier. §VI.C ROI 결론을 *non-finding* 로 약화. §V.B Phase 3.3 doc 의 dead skill retroactive explanation 추가 (H3).
- narrative-v1.md: stale +20pp 표 footnote.
- I6 first-shot SR (4/17 review's Open issue) post-hoc 분석 — 모든 condition 의 first-shot SR 보고. interesting: regen_gap 모두 0% (LLM 이 첫 시도에서 success 면 full success).
- H2: robust runner v2 spec 문서화 (`docs/superpowers/specs/2026-04-30-robust-runner-v2-spec.md`).

### Neutral-name doc replication ($8.5)
- 새 RunPod (n1n1zq2chgs9zq, A40, $0.44/hr).
- Sub-agent 자율 (pod 생성 → setup → 실험 → wrap-up). 1h 5분, $0.46.
- 2 condition × n=15 counter-balance:
  - **v1**: `lift_object_a` (doc) vs `lift_object_b` (no doc) → **42:0** doc 100%
  - **v2**: `lift_object_a` (no doc) vs `lift_object_b` (doc) → **47:0** doc 100%
  - **merged 89:0** = 100% pure docstring effect.
- Letter neutrality: a=42 / b=47 calls (47:53). letter bias 없음.

### Paper v2 PDF
- Phase 3.3 doc subsection 에 neutral-name replication paragraph 추가, "consistent with causal" → **"causally responsible"** 격상.
- Conclusion 에 "two-part" → **"three-part"** adversarial probe.
- Limitations: doc subsection 의 confound disclaimer 제거 (이미 닫힘), n=30 pooled 89:0 alpha < 10⁻³ 명시.
- 9 pages, 413KB. commit + push.

### A1 closure verdict
**완전 닫힘**. 4/29 의 20:0 결과 = 순수 docstring 효과. `_undocumented` 자기 라벨링 의심 제거. Phase 1.5 → 3.3 chain 의 인과 mechanism 확정.

### 비용
- Phase A 총: $0 (wording fix) + $8.5 (실험) = **$8.5** (계획서 estimate 일치).
- 누적 잔액: RunPod ~$14, OpenRouter 미확인.

### 자세히
- review v2: `docs/superpowers/reviews/2026-04-30-research-methodology-review-v2.md`
- plan v2: `docs/superpowers/plans/2026-04-30-confound-closure-experiments.md`
- observation: `docs/superpowers/observations/2026-04-30-phase-3-3-doc-neutral.md`

---

## 2026-04-30 — Research methodology review v2 + 실험 계획서 v2

**Goal**: Phase 3.3 doc 의 강한 finding (20:0 인과) 직후, paper 가 arXiv honest scope 인지 + 추가 실험 무엇이 필요한지 *냉정하게* 점검.

### 산출
- `docs/superpowers/reviews/2026-04-30-research-methodology-review-v2.md` — 4/17 review 와 동등 layer 의 2차 review. 8 카테고리 (A~H) 각 1–4 건의 issue.
  - **Critical-leaning Important (1)**: A1 — doc on/off 의 name-string confound (`_undocumented` 라는 단어 자체의 self-labeling).
  - **Important (5)**: A2 P21_a sample power, B1 multi-backbone 미검증, B2 ceiling-결론 순환성, B4 phase 1.5→3.3 chain missing step, C2 narrative-v1 stale +20pp, F1 external validity overstated.
  - **Minor (4)**: D1/D2 통계 단위·검정 명시, G1 status doc stale date, H1/H2 sub-agent reproducibility.
  - **Underclaim 1**: H3 — paper 가 dead skill 2개의 mechanism 도 Phase 3.3 doc 결과로 retroactively 설명함을 명시 안 함 ($0 strengthening).
- `docs/superpowers/plans/2026-04-30-confound-closure-experiments.md` — 7 실험 (wording fix + 6 실험) 의 design + cost + dependency. Phase A ($8.5) → B (+$26) → C (+$5) → D (+$32) → E (+$1.7) phased budget.

### 4/17 의 15 issues 처리도
- Critical 3 중 2 closed (seed, config), 1 acknowledged (train/test contamination, held-out 모두 floor 로 더 진행 못 함).
- Important 7 중 ~3 closed, ~3 open: I6 first-shot SR, I8 Group C raw, I10 model snapshot drift.

### Verdict
- arXiv honest scope: 거의 OK. **3 wording fix + 1 strengthening** 으로 그 자리에서 게시 가능 (Phase A의 wording-fix 항목).
- 가장 가성비 큰 후속: doc neutral-name replication (~$8.5, A1 confound 닫음).
- 다음 실험 진행 전 **Phase A (wording fix + neutral-name) 가 prereq**.

### 비용 누적
- Pod: $0 (review 작업, GPU 사용 없음).
- LLM: $0 (review 는 본 세션 내 Claude 가 처리).

### 자세히
- review v2: `docs/superpowers/reviews/2026-04-30-research-methodology-review-v2.md`
- experiment plan: `docs/superpowers/plans/2026-04-30-confound-closure-experiments.md`

---

## 2026-04-29 — Phase 3.3 docstring on/off (causality)

**Goal**: Phase 1.5 의 docstring×8 호출률 상관을 인과로 검증.

### Setup
- 새 RunPod (5jtpgyf9d0xtid, A40, $0.44/hr).
- Sub-agent 자율 (pod 생성 → setup → 실험 → R2 backup → terminate). 22 min 가동, $0.16.
- Setup 빨라짐 (X11 libs 사전 install + HF token 사전 주입 + SAM3 사전 download).

### 결과 (n=15)
- **15/15 success (100%)**.
- Documented variant calls: **20** (13/15 trials).
- Undocumented variant calls: **0** (0/15 trials).
- **Doc share of calls: 100%** — Wilson 95% CI 약 [83%, 100%].
- 2/15 trial 은 `execute_grasp` path 만 사용 (어느 변종도 호출 안 함, docstring 무관).

### 해석
**Phase 1.5 의 8× 상관 = 인과**. Forced choice between functionally-identical 두 skill 에서 docstring 있는 쪽 100% 선택. **n=15 small 이지만 효과크기 (20:0) 가 절대적이라 충분**.

### Phase 3.3 decoy + doc 결합 mechanism
- Decoy (4/28): docstring + plausible name 만으로는 task-irrelevant skill 호출 안 됨 (6.7%) → **docstring 단독은 task fit 을 안 만듦**.
- Doc on/off (4/29): 두 task-relevant skill 중 docstring 있는 쪽 100% 선택 → **task fit 있을 때 docstring 이 결정적 tiebreaker**.
- Dedup v2 의 `(has_doc, sr, name)` survivor rule 정당화 — dedup v1 의 −27pp 가 본 실험으로 확정.

### 비용
- Pod $0.16 + LLM ~$4 = **~$4.16** total.
- 누적 잔액 ~$15 RunPod.

### 자세히
- `docs/superpowers/observations/2026-04-29-phase-3-3-docstring-causality.md`

---

## 2026-04-28 — Phase 3.3 adversarial decoy + LIBERO simpler smoke

**Goal**: 대기 중 후보 1, 2 진행. (1) Phase 3.3 — name/docstring bias 직접 검증 위한 decoy injection. (2) LIBERO Spatial 0 floor 가 task 난이도인지 환경 incompatibility 인지 분리.

### Setup
- 새 RunPod (mvpa1ygqaarl4v, A40 secure CA, $0.44/hr).
- 첫 실행 모두 garbage (SAM3 gated repo HF 인증 누락 + GraspNet pyrender X11 의존성 부재).
- 수정: HF_TOKEN 주입 + `apt-get install libxrender1 libx11-6 libxext6 ...`.
- 재실행 21 분 만에 25 trials 완료.

### Phase 3.3 결과
- C3v2 + 3 decoy (`optimize_grasp_with_priors`, `verify_workspace_safety`, `compute_optimal_lift_height`) 주입.
- 14/15 success (= C3v2 baseline). **Decoy invocation rate 6.7% (1/15)**.
- `verify_workspace_safety` 만 단일 trial 에서 8 회 호출 — verb-style "verify_*" pattern 의 mild bias.
- 다른 2 decoy 0 회 호출 → **plausible name 만으로는 LLM 거의 못 속임**.
- 자세히: `docs/superpowers/observations/2026-04-28-phase-3-3-adversarial-decoy.md`.

### LIBERO simpler 결과
- libero_spatial_2 (단일 bowl, no spatial reasoning), libero_object_0 (다른 suite) — **둘 다 0/5 reward 0**.
- 가설 강화: LIBERO floor 는 task 난이도가 아니라 환경 + perception pipeline incompatibility.
- 자세히: `docs/superpowers/observations/2026-04-28-libero-simpler-smoke.md`.

### 비용 + Wrap-up
- 누적 pod 시간 ~3.5h (대부분 setup + 첫 실패 + 재실행), ~$1.5 pod.
- LLM ~$5.
- 두 번 wrap-up 시도 (첫번 garbage 실험에 ALL DONE 트리거됨 → 정지). 최종 manual rsync + pod delete.
- R2 backup `r2:capx/outputs/api_injection/{phase33,libero_simpler}/`.

### Production / 메모리 변경
- `scripts/experiment/run_serial_robust.sh` 4번째 column (VENV) 지원 → 같은 runs.txt 에서 .venv / .venv-libero 혼용 가능.
- `.capx_skills.phase33_decoy.json` (14 promoted = C3v2 11 + 3 decoy) 추가.
- `FrankaControlApiReducedAutoSkills_PHASE33_DECOY` factory.

---

## 2026-04-28 — HTML status 재배포 + Paper PDF compile

**Goal**: 대기 중 후보 진행 — HTML status 페이지에 Phase 2.5 + paper 섹션 추가, paper PDF 로컬 빌드.

### HTML status 갱신
- Hero meta `2026-04-26 update` → `2026-04-28 update`. Badges 에 "2.4e n=50 + 2.5 LIBERO 완료", "arXiv 수준 paper 초안" 추가.
- TL;DR 한 줄: "+20pp 회복" → "+5–10pp marginal", 두 held-out floor 명시.
- TOC 에 `phase2-5` (Phase 2.4e + 2.5 LIBERO), `paper` (arXiv draft) 두 섹션 추가.
- Phase 2.4e n=50 표 + Phase 2.5 LIBERO smoke 표 + runner bash bug 설명 + 비용 (~$43).
- Paper section: tex 위치, 8 figures, 8 external refs, honest limitations, compile 명령어.
- Lightsail `/var/www/capx.ryeol.kim/` 배포. HTTP 200 확인.

### Paper PDF 빌드
- `brew install --cask basictex` 실패 (sudo 필요). 대안 → `TinyTeX` (yihui.org/tinytex, no-sudo).
- TinyTeX `~/Library/TinyTeX/bin/universal-darwin/` 설치 완료.
- `tlmgr install` 으로 추가 패키지: `ieeetran`, `algorithms`, `algorithmicx`, `caption`, `subfig`, `pgf`, `pgfplots`, `cite`, `courier`, `relsize`, `ifoddpage`, `fp`.
- `pdflatex capx-paper-2026-04.tex` ×3 회 → **8 pages, 386 KB PDF** (`docs/paper/capx-paper-2026-04.pdf`).
- 잔존 warnings: caption "Unknown document class" (harmless, IEEEtran 미인식), hyperref unicode token (harmless).

### 비용
- Pod 비용: $0 (모든 작업 로컬).
- LLM: 작업 분량만큼 (~$2 estimate).

---

## 2026-04-27 — LIBERO held-out (Phase 2.5) + Paper draft

**Goal**: cube_lifting → LIBERO transfer 검증 + arXiv 수준 paper 초안 작성.

**Local prep ($0)**:
- LIBERO base API 호환성 점검 — `FrankaLiberoApiReduced` 가 cube_lifting 의 모든 REQUIRED_BASE_METHODS expose 확인.
- `capx/integrations/franka/libero_reduced_auto_skills.py` 신규 wrapper 작성 (cube_lifting 패턴 mirror).
- factory 등록: `FrankaLiberoApiReducedAutoSkills`, `FrankaLiberoApiReducedAutoSkills_C3V2`.
- 3개 LIBERO YAML 파생 (smoke_a, heldout_a, helded_out_c3v2 — typo 그대로).
- `~/.libero/config.yaml` 미리 생성 (interactive prompt 회피).

**Pod 실험 (f3m8isk0mck194, A6000 $0.49/hr)**:
- 환경 setup (반복) + tmux 세션으로 서버 띄움 (이전 ssh-disown bug 회피).
- einops 0.4.1 → 0.8.2 업그레이드 (Molmo 의 `from einops import einsum` 호환).
- 2026-04-27T02:12 KST 실행 → auto wrap-up 02:23 KST 자동 terminate (11분).

**결과 (LIBERO)**:
- **smoke_a (no-skills): 5/5 reward 0.0, taskcompleted_0**. avg blocks 1.8, avg regens 0.8.
- **heldout_a, heldout_c3v2: 미실행** — runner bash bug (`set -e + grep -c "taskcompleted_1"` no-match 시 exit 1) 로 첫 condition 후 종료.
- LIBERO Spatial Task 0 ("pick up the black bowl between the plate and the ramekin") 가 cube_stack 처럼 **floor**. LLM 이 1-3 블록만 시도하고 abandon (cube_stack C3 패턴과 동일).

**Paper (subagents 디스패치)**:
- Paper writer agent 완료: `docs/paper/capx-paper-2026-04.tex` (447 → 451+ lines, IEEEtran 2-col, 8 external refs after internal-report cleanup)
- Figure generator agent 완료: 8 PDFs in `docs/paper/figures/` (skill-usage-matrix, unpromoted-pool, ablation-bar, c3v2-n50-wilson, retry-abstraction-downgrade, 3fix-narrative-rewrite, cubestack-failure-modes, skill-library-architecture) + `figures-build.py`
- Paper update agent: §VII.G LIBERO subsection 채움 + abstract / limitations / conclusion 에 floor finding 통합.

**Decisions**:
- **Skill library 의 cube_lifting → LIBERO 전이** unmeasured (n=5 baseline floor + runner bug). 현재 결정적 답 못 얻음. 두 held-out target (cube_stack + LIBERO Spatial 0) 모두 floor.
- 다음 후보: runner bug fix + 더 쉬운 LIBERO task 또는 단순 cube_lifting 변형.

**Artifacts added**:
- `capx/integrations/franka/libero_reduced_auto_skills.py`
- `capx/integrations/__init__.py` (LIBERO factory 추가)
- `env_configs/libero/franka_libero_spatial_0_{smoke_a,heldout_a,helded_out_c3v2}.yaml`
- `~/.libero/config.yaml` (pod local)
- `docs/paper/capx-paper-2026-04.tex`, `docs/paper/README.md`, `docs/paper/figures/*.pdf`
- `outputs/api_injection/libero/openrouter_openai_gpt-4.1/smoke_a/` (5 trials local)
- `/tmp/capx_wrap_up_libero.sh`

**Backup**: R2 `r2:capx/outputs/api_injection/libero/` (1.3 MiB), `r2:capx/paper/2026-04-26/` (194 KiB), `r2:capx/logs/2026-04-27-libero/`

**Pod state after**: 0 pods (auto-terminated). spend/hr=$0. balance ~$16.5.

**Cost actual**: Pod $0.09 + LLM ~$1.4 = **~$1.5**. 누적 (Phase 2.4 + 2.4c + 2.4 E + 2.5) **~$48**.

---

## 2026-04-26 — Production cleanup (A/B/C/D/E)

**Goal**: Phase 2.4c finding 을 제품 코드에 반영 + narrative 통합 + variance 검증.

**Local ($0)**:
- **A**: Dedup v2 production. `capx/skills/library.py::dedupe_promoted` winner = `(has_docstring, sr, name)`. Unit test 7/7 pass (2 신규). `.capx_skills.json` regenerate (1 swap).
- **C**: Phase 1+2 통합 narrative. `docs/superpowers/retrospectives/2026-04-26-capx-narrative-v1.md` (paper-shape draft).
- **B**: Status HTML 업데이트. Phase 2 섹션 추가, capx.ryeol.kim 재배포 (HTTP 200 확인).
- **D**: Cube_stack pipeline 디버그. p22_a 0% 원인 = scipy `from_dcm` deprecated (LLM bug) + SAM3 422 + green cube segmentation 약점. 환경 자체 정상. C3v2 만 unique 2/12 success. Observation: `2026-04-26-phase-2-4d-cubestack-debug.md`.

**Pod ($14)**:
- **E**: C3v2 n=50 cube_lifting (variance 검증). Pod `44nqp93za43byv` (A6000 $0.49/hr).
- 11:56 KST 시작 → auto wrap-up (`/tmp/capx_wrap_up_e.sh`) → 13:17 KST pod terminate. **1h21m, 무인.**

**결과 (E)**:
- C3v2 n=50: **40/50 = 80% (95% Wilson CI [67%, 89%])**
- n=15 의 93% 는 **lucky outlier** — 진짜 mean 은 ~80%
- C3 73% (n=15) 와의 차이 +7pp (CI overlap 큼)
- Avg regens 0.13 → 2.21, blocks 0.87 → 3.21 — n=50 이 더 typical pattern

**Decisions (서사 또 한 번 정정)**:
- "Dedup v2 +20pp" → **"Dedup v2 +5–10pp, marginal"**.
- C1/C2 100% (n=15) 도 lucky 가능성 — 진짜 mean 은 ~85–95% 추정.
- **Namespace seeding +67pp 만이 robust** (n=15 양쪽 33→100). 나머지 fix 는 noise 수준.
- Skill library 의 cube_lifting 효과 = **거의 ties with no-skills baseline** (C3v2 80% ≈ P21_a 90%).

**Artifacts added**:
- `capx/skills/library.py` (Dedup v2)
- `tests/test_skill_library.py` (+2 tests)
- `.capx_skills.json` (regenerated, 1 swap)
- `.capx_skills.json.pre_dedup_v2_20260426` (old backup)
- `env_configs/cube_lifting/franka_robosuite_cube_lifting_ablation_c3v2_n50.yaml`
- `docs/superpowers/retrospectives/2026-04-26-capx-narrative-v1.md`
- `docs/superpowers/observations/2026-04-26-phase-2-4d-cubestack-debug.md`
- `docs/superpowers/observations/2026-04-26-phase-2-4e-c3v2-n50.md`
- `docs/capx-status-report.html` (Phase 2 섹션 추가, redeployed)
- `outputs/api_injection/ablation/openrouter_openai_gpt-4.1/c3v2_n50/` (50 trials local)
- `/tmp/capx_wrap_up_e.sh`

**Backup**: R2 `r2:capx/outputs/api_injection/ablation/openrouter_openai_gpt-4.1/c3v2_n50/` (96 MiB / 977 obj). `r2:capx/logs/2026-04-26-c3v2-n50/`.

**Cost actual**: Pod ~$0.74 + LLM ~$13.5 = **~$14**. 누적 ~$46.

---

## 2026-04-25 ~ 26 — Phase 2.4c (Dedup v2 + cube_stack held-out)

**Goal**: Phase 2.4 의 가설 H1a (dedup algorithm wrong survivor) 검증 + cube_stack 일반화 테스트.

**Local prep ($0)**:
- Dedup v2 구현: `scripts/experiment/generate_c3v2_skill_set.py`. Survivor rule = (has_docstring, success_rate, name). 결과 단일 swap: `get_grasp_pose_for_mask` (no doc) → `plan_and_select_grasp` (doc=Y).
- Group D retrospective amendment: `docs/superpowers/retrospectives/2026-04-24-group-d-rewrite-amendment.md`. 원 retrospective 상단에 ban note 추가.

**Pod 실험 (hq3dca6hd83ufu, A40 $0.44/hr — A6000 재고 없어서 대체)**:
- 환경 setup 반복 (apt + EGL ICD + uv sync + 6 submodules + tensorflow + 4 servers)
- Run 1: cubelift_c3v2 (15 trials) ✅ + cubestack_p22_a (10) ✅
- **OpenRouter credits 고갈** ($80 / $80) 감지 — runner stalled in retry backoff
- 사용자 $30 추가 충전 → Resume runner with cubestack_c2/c3/c3v2 fresh
- Auto wrap-up: 폴링 + R2 backup + pod terminate (사용자 개입 없이 자동 종료)

**결과** (Task completed, any sandboxrc):
- **cubelift_c3v2: 14/15 = 93.3%** (vs C3 73%, +20pp) — Dedup v2 효과 확인
- cubestack_p22_a: 0/10 = 0%
- cubestack_c2: 0/10 = 0%
- cubestack_c3: 0/10 = 0% (avg blocks 0.10 = 즉시 abandon)
- cubestack_c3v2: 2/10 = 20% (유일하게 성공)

**Decisions**:
- **H1a (algorithm wrong survivor) 확정**: 단일 skill swap (docstring 추가) 으로 +20pp.
- Cube_stack 모두 fail — task 자체가 base API + gpt-4.1 조합에 너무 어려움 (p22_a 도 0%).
- **새 Group D 서사**: 3-fix 가 모두 좋은 게 아니라 namespace + gates 중립 + dedup-v2 (algorithm fix). 통합 +60pp.

**Artifacts added**:
- `capx/integrations/franka/_auto_skills_namespace.py` (enable_seeding param 추가, 이전 세션)
- `scripts/experiment/generate_c3v2_skill_set.py`
- `scripts/analysis/phase_2_4c_summary.py`
- `.capx_skills.c3v2.json`
- `env_configs/cube_lifting/franka_robosuite_cube_lifting_ablation_c3v2.yaml`
- `env_configs/cube_stack/franka_robosuite_cube_stack_{p22_a,ablation_c2,ablation_c3,ablation_c3v2}.yaml`
- `docs/superpowers/retrospectives/2026-04-24-group-d-rewrite-amendment.md`
- `docs/superpowers/observations/2026-04-25-phase-2-4c-dedup-v2-and-cubestack.md`
- `outputs/analysis/phase_2_4c/{per_trial.csv, condition_summary.csv}`
- `outputs/api_injection/cube_stack/openrouter_openai_gpt-4.1/{p22_a,ablation_c2,ablation_c3,ablation_c3v2}/` (로컬)
- `/tmp/capx_wrap_up.sh` (자동 wrap-up 스크립트, 재사용 가능)

**Backup**: R2 `r2:capx/outputs/api_injection/{ablation,cube_stack}/`, `r2:capx/logs/2026-04-25-phase24c/`, `r2:capx/skill_backups/ablation/`.

**Pod**: terminated 2026-04-26 01:04 KST (auto wrap-up). 모든 pod 0개. spend/hr=$0.

**Cost actual**: Pod $2.2 + LLM ~$8 = ~$10. 누적 (Phase 2.4 + 2.4c) ~$32.

---

## 2026-04-24 — Phase 0 + Phase 2.4 ablation + Phase 2.1 (신규 pod)

**Session goal**: Phase 0 infrastructure 완료 후 Phase 2.4 3-fix ablation + Phase 2.1 counter-factual pair 실행. 모든 실험 하나의 pod session 에서.

**Phase 0 (로컬 $0)**:
- `capx/envs/simulators/robosuite_cube_lift.py:110-112` seed propagation fix (`np.random.seed(seed)` 삽입)
- Config YAML: 기존 baseline 수정 금지 원칙 확립 (사용자 지시로 "서사 연속성" 원칙 도출). A/B YAML num_workers=12 복원. 신규 실험만 파생 YAML.
- VDM reasoning empty 원인 확정: gpt-4.1 이 OpenRouter `reasoning` 필드 미반환 + prompt 가 설명 요구 안 함 (decision-only instruction). `docs/superpowers/observations/2026-04-24-phase-0-3-vdm-reasoning-cause.md`
- Per-run metadata 로깅: `capx/envs/runner.py::_write_run_metadata()` — `run_metadata.json` (commit SHA, dirty flag, args, config)

**Phase 2.4 + 2.1 (pod k3dh7e3d4oixmz, A6000 $0.49/hr)**:
- 신규 pod 생성 → volume 100GB, tar+ssh 로 repo upload, 6개 git submodule clone (sam3/robosuite/contact_graspnet_pytorch/LIBERO-PRO/verl/libero_dependencies + curobo), `uv sync --extra robosuite`, tensorflow-cpu 재설치
- **EGL nvidia ICD 파일 누락** 해결 (`/usr/share/glvnd/egl_vendor.d/10_nvidia.json` 생성)
- 서버 5개 (proxy/SAM3/Molmo/graspnet/pyroki) 기동
- 4 conditions × 15 trials + P21_a 10 trials 직렬 실행 (1h 49m)

**결과** (Task completed, any sandboxrc):
- C0 (buggy): 5/15 = 33%
- C1 (ns only): 15/15 = **100%**
- C2 (ns + gates): 15/15 = **100%**
- C3 (full D): 11/15 = **73%** ← dedup 이 −27pp
- P21_a (A-like): 9/10 = 90%

**Decisions**:
- **3-fix 는 monotone 아님**: dedup 이 명백히 negative. Group D retrospective 의 "3-fix 패키지" 서사 재작성 필요.
- Namespace seeding 이 dominant fix (+67pp). Gates 는 중립, Dedup 은 해롭다.
- P21_a 90% > C3 73%: 현재 skill library 가 이 task 에서 **negative contribution**.

**Artifacts added**:
- `capx/envs/simulators/robosuite_cube_lift.py` (seed fix)
- `capx/envs/runner.py` (run_metadata)
- `capx/integrations/franka/_auto_skills_namespace.py` (enable_seeding toggle)
- `capx/integrations/franka/control_reduced_auto_skills.py` (enable_namespace_seeding param)
- `capx/integrations/__init__.py` (4 ablation factory)
- `env_configs/cube_lifting/franka_robosuite_cube_lifting_ablation_c{0,1,2,3}.yaml`
- `env_configs/cube_lifting/franka_robosuite_cube_lifting_p21_a.yaml`
- `scripts/experiment/generate_ablation_skill_sets.py`
- `scripts/analysis/phase_2_4_ablation_summary.py`
- `.capx_skills.c{0,1,2,3}.json`
- `docs/superpowers/observations/2026-04-24-phase-0-3-vdm-reasoning-cause.md`
- `docs/superpowers/observations/2026-04-24-phase-2-4-ablation.md`
- `docs/superpowers/observations/2026-04-24-phase-2-1-counterfactual.md`
- `docs/assets/ablation_{c0,c1,c2,c3,p21_a}/` (스크린샷 + video 샘플)
- `outputs/api_injection/ablation/`, `outputs/api_injection/openrouter_openai_gpt-4.1/p21_a/` (로컬 complete copy)

**Backup**: R2 `r2:capx/outputs/api_injection/ablation/` (130 MiB, 1283 obj), `r2:capx/outputs/api_injection/openrouter_openai_gpt-4.1/p21_a/` (22 MiB), `r2:capx/logs/2026-04-24-ablation/`, `r2:capx/skill_backups/ablation/`

**Pod**: terminated 15:13Z. 총 실험 pod 시간 ~3.5 hr × $0.49 = **$1.54**. OpenRouter LLM 비용 별도 (추정 ~$20 in 70 trials).

**Cost actual**: Pod $1.54 + LLM ~$20 = ~$22 (예산 $30 중).

---

## 2026-04-24 — Phase 1 2주차 + Synthesis

**Session goal**: Phase 1의 남은 관찰 4개 완료 + 9개 통합 retrospective.

**Completed**:
- Phase 1.3 Failure Event Sequence — trial_42 12-block 분석. Molmo base API 루프, **VDM reasoning 82% empty** (228/277) 신규 발견.
- Phase 1.9 Retry Behavior — 76 retry dirs 분류. **Abstraction downgrade 패턴**. REPLACE 성공률 66.7% > SAME_SET 44%.
- Phase 1.6 Invocation Context — 744 calls **arity 100% 일치**, docstring-usage 괴리 0건.
- Phase 1.7 Trial-Order Behavior — order effect null (\|r\| < 0.23).
- Phase 1 synthesis retrospective — 9개 관찰 → Q1/Q2/Q3/Q4 답 + Phase 2 재조정 제안 ($41 → $25–30).

**Artifacts added**:
- `docs/superpowers/observations/2026-04-24-phase-1-{3,6,7,9}.md`
- `docs/superpowers/retrospectives/2026-04-24-phase-1-synthesis.md`
- `scripts/analysis/phase_1_{3,6,7,9}_*.py`
- `outputs/analysis/phase_1_{3,6,7,9}/`
- `docs/capx-current-status.md` (신규 top-level index)
- `docs/capx-log.md`, `docs/capx-conventions.md` (documentation system 구축)

**Decisions**:
- Phase 2.4 3-fix ablation 최우선 ($19). 그 외 $25–30 범위.
- Phase 0 checklist에 "VDM reasoning empty 수정" 추가.
- Phase 2 실험은 5060 Ti 불가 (Molmo 12B fp16 24GB VRAM 요구) → pod 사용 확정.

**Cost**: $0 (로컬 분석만).

---

## 2026-04-22 — Pod backup + terminate

**Goal**: Pod 비용 중단, 데이터 손실 방지.

**Actions**:
- RunPod pod 2개 파악: `nj5h663sfnjg5h` (main), `j4k0lxazgdz4e6` (s3-v2 migration). 둘 다 EXITED, 100GB volume 씩 storage 비용 $0.056/hr.
- Main pod restart → pod `rclone.conf` 복사 → outputs/ dry-run diff.
- **발견**: pod outputs는 2.19 GiB, R2는 1.15 GiB. **9,486개 파일 (1.04 GiB) 미백업 상태**.
- `rclone copy outputs/` 수행 — 2.27 GiB / 23,869 objects 도달.
- 추가: skills JSON 4개 (`skill_backups/final_2026-04-22/`), 6 experiment logs + 5 server logs, molmo_server.py.
- 두 pod 모두 `runpodctl pod delete` — 비용 $0/hr 도달.

**State after**:
- R2 `r2:capx/` 전체 백업 완료: outputs 2.19 GiB + group_d 77 MiB + skill_backups + logs + artifacts + scripts.
- Pod 자원 $0. Phase 2 시작 시 새 pod 필요.

**Cost**: ~$0.30 (pod restart + sync 시간).

---

## 2026-04-19 — Phase 1 1주차

**Goal**: Phase 1 첫 4개 관찰 (zero-cost).

**Completed**:
- Phase 0 공통 파서 (`scripts/analysis/parsers.py`) — trial_dir 파싱, AST call 추출, skills.json 로드.
- Phase 1.1 Skill Usage Matrix — 89×11 매트릭스. **Dead skill 2개 확정** (`select_top_grasp`, `grasp_pose_to_ik`). Top-4 skill이 전체 호출의 76%.
- Phase 1.2 Unpromoted Pool Audit — 330 unpromoted 분류. **Generality gate가 91%의 탈락 사유**. 18개가 promoted와 structural-hash 일치 (near-duplicate).
- Phase 1.5 Prompt Position Audit — position null. **Docstring이 primary factor** (8× 차이, Dead 2 모두 no-docstring).
- Phase 1.4 + 1.8 Survival + Gate Sensitivity — 16 pre-GD promoted 전부 경계선. Hard-fail + dedup이 93% filter.

**Artifacts added**:
- `docs/superpowers/observations/2026-04-19-phase-1-{1,2,5,4-and-1-8}.md`
- `scripts/analysis/parsers.py`, `phase_1_{1,2,5,4_and_1_8}_*.py`
- `outputs/analysis/phase_1_*/` (CSV data)

**Cost**: $0.

---

## 2026-04-18 — Curiosity-driven pivot + HTML status report 배포

**Goal**: 프로젝트 방향 재정의 + 공개 리포트 배포.

**Actions**:
- NVIDIA CaP-X 논문 (arXiv:2603.22435) 확인: skill 추출은 **자동** (regex + LLM), 사람 수동 아님. 이 사실에 기반해 Group B (수동 9개) 제외 결정.
- 프로젝트 방향 재정의: publication-driven → **curiosity-driven**. 4 core questions 확립.
- North star 메모리 rewrite: "Group C 성공 목표" → "LLM 스킬 emergence 관찰".
- Phase 0/1/2/3 제안 작성 (v2, 리뷰 사이클 2회 후).
- HTML status report (`docs/capx-status-report.html`, 46KB) 작성 → Lightsail `/var/www/capx.ryeol.kim/` 배포.
- Nginx SSL 설정 (Cloudflare Origin wildcard cert), 443 SSL + 80→443 redirect.
- https://capx.ryeol.kim 가동.

**Artifacts added**:
- `docs/superpowers/proposals/2026-04-18-curiosity-driven-next-steps.md`
- `docs/capx-status-report.html`

---

## 2026-04-17 — Group D 50-trial + Methodology review + Retrospective

**Goal**: Group C 실패 (16%) 를 rescue하는 3-fix 적용 후 50-trial 재실행.

**Implementation (3 fixes)**:
1. 2-pass namespace seeding (`capx/integrations/franka/_auto_skills_namespace.py`) — NameError 해결.
2. Quality Gate 4/5/6 with hard_fail (unresolved deps / no-op / vision 서버).
3. Called-name + attribute-aware dedup.

**Result**: Group D 49/50 = 98% (Group A baseline 98%, Group C 16% → 98%).

**Methodology review (독립 리뷰)**:
- 3 CRITICAL (seed 버그, config 불일치, train/test contamination) + 7 IMPORTANT + 5 MINOR.
- 현재 데이터로 "auto-discovery outperforms no skills" 주장 불가 (ceiling effect).

**Artifacts added**:
- `docs/superpowers/retrospectives/2026-04-17-group-d-retrospective.md`
- `docs/superpowers/reviews/2026-04-17-methodology-review.md`
- `docs/superpowers/plans/2026-04-17-group-d-implementation.md`
- `docs/superpowers/specs/2026-04-17-group-c-rescue-design.md`
- Commit HEAD `470801f`, skills regenerated at `5ebca41` (11 promoted).

---

## 2026-04-11 ~ 2026-04-16 — Group A/B/C 초기 실행

- Group A (baseline): 49/50 ≈ 98%
- Group B (manual 9 skills): retrospective 에서 "~48/50" 추정 (OpenRouter 402 interruption 으로 unverifiable). **프로젝트에서 제외**.
- Group C (초기 auto-pipeline): 8/50 = 16%. 원인: namespace isolation NameError, promotion criterion too loose.

**Artifact**: R2 `r2:capx/outputs/api_injection/openrouter_openai_gpt-4.1/group_{a_baseline,b_capx_9,c_auto_16_clean,c_final}/`.

---

## Earlier experiments (pre-April)

- Gate 1 Control C: A(93%) > B(88%) > C(82%)
- Dose-response: 비단조 U-curve (0=80%, 3=42%, 5=38%, 8=78%, 16=74%)
- Soft prompt: 효과 없음
