# capx project final closing retrospective (2026-05-08)

**Phase:** capx project closing — paper v11 published-ready, project state: closed / passive maintenance
**Spec:** `docs/superpowers/specs/2026-05-06-capx-closing-design.md` (Option Y)
**Plan:** `docs/superpowers/plans/2026-05-07-capx-closing.md` (subagent-driven-development)
**Cumulative cost:** ~$248 (original ~$162 paper v8 + $79 v9 D2/Dedup v3 + $7 closing Smoke phase 2)
**Closing duration:** 2026-05-06 brainstorm → 2026-05-08 paper v11 (3일)

---

## §1 — Final empirical verdict (paper v11)

### Cube_lifting under gpt-4.1 (paper v8 base, n=50/70)
| Condition | Rate | Wilson 95% CI | vs P21_a |
|---|---|---|---|
| `P21_a` (baseline, no library) | 39/50 = 78.0% | [64.8%, 87.2%] | (reference) |
| `C1` (no-dedup 16-skill, namespace) | 67/70 = 95.7% | [88.1%, 98.5%] | +18pp ✅ |
| `C2` (no-dedup 14-skill, gates) | 68/70 = 97.1% | [90.2%, 99.2%] | +19pp ✅ |
| `C3v2` (production structural-dedup) | 58/70 = 82.9% | [72.4%, 89.9%] | +5pp (null) |
| `manual_11` (k=11 quality-rank) | 66/70 = 94.3% | [86.2%, 97.8%] | +16pp ✅ |
| `empty_ns` (typed empty stubs) | 13/30 = 43.3% | — | content effect verified |

### Multi-backbone library effect (D2 closure, paper v9, n=50)
| Backbone | Baseline (P21_a) | Library (manual_11) | Δ |
|---|---|---|---|
| gpt-4.1 | 78.0% | 94.3% | +16pp |
| Claude Sonnet 4 | 86.0% | 98.0% | **+12pp (P=0.5%)** |
| DeepSeek v3 | 6.0% | 94.0% | **+88pp (P~10⁻⁵⁴)** |

### Dedup v3 algorithmic robustness (paper v9, n=70 each)
- k=10: 91.4%; k=11: 94.3%; k=12: 87.1% (not separated from C3v2); k=13: 97.1%
- Recipe: top-k-by-quality_score, k near 11 or 13 (avoid 12)

### Smoke phase 2 (paper v11 §V.G, n=3-30)
- **Cross-task `cube_stack_3`**: Floor (0/15 baseline + 0/15 library); vision-pipeline mask-fragmentation localised. *Library still confers 24-34% code-efficiency gain at floor.*
- **Multi-session round-2**: 0 promotable new skills (namespace saturation regime).
- **Verbose mechanism**: DeepSeek partial-grip-and-drop ↔ Claude robust completion (1-2 regen convergence). *Library = self-correction loop substitute*, not reasoning enhancement.

---

## §2 — 9 narrative reversals (timeline)

| # | 시점 | 결론 (그 때) | 정정 사유 | Lesson |
|---|---|---|---|---|
| 1 | 4/17 | "3-fix 다 덕분 (Group D 98%)" | 4/24 ablation | Combined-effect attribution은 isolated control 후에만 |
| 2 | 4/24 | "namespace 단독 +67pp" | 4/26 dedup v2 | Dedup algorithm 의 sensitivity 인식 |
| 3 | 4/26 | "library = hygiene only" | 5/1 P21_a boost | Sample size 작을 때 effect underestimate |
| 4 | 5/1 | "library +23pp ROI" | 5/2 strict critique | Strict review 가 sample variance 와 effect size 정렬 |
| 5 | 5/2 | "production library = baseline" | 5/3 C1/C2 boost | Production condition 의 size-vs-quality 분리 |
| 6 | 5/3 | "no-dedup +19pp, smart dedup possible" | stable through paper v8 | (positive) 안정 |
| 7 | 5/4 | "Q1-Q2 cube_lifting 단단" | v2 smoke — multi-backbone Δ asymmetry | Single-backbone claim 의 generalisation 한계 |
| 8 | 5/6 | "Claude library effect = null" | metric correction → +12pp detectable | Metric definition 일관성 = source-of-truth |
| 9 | 5/7 | "k=12 dip = docstring-less skill" | k=11 cross-check refutation | Mechanism attribution 은 cross-condition 검증 후에만 |

**Generalisable rule**: *Single-run retrospective causal claims must be treated as hypotheses pending controlled cross-condition verification.* — 9번 입증, paper 본문에서도 위반 가능성을 audit chain 로 차단.

본 9 reversals 모두 *publication 전 self-correction*. Audit chain 의 가장 큰 가치 = *publication 직전 mechanical verification*.

---

## §3 — Generalisable methodological contributions

### 1. *capx-defined success metric* as source-of-truth
- `summaries.txt` 의 `Task completed` (final attempt success) 가 paper-comparable 단위
- trial dir count direct analysis 는 over-counting 위험 (8th reversal 의 lesson)
- Memory: `feedback_capx_metric_definition.md`

### 2. 11 environment fix list + `feedback_capx_environment_audit.md`
- paper v8 setup 의 묵시적 지식을 explicit checklist 로 추출
- Bootstrap time 200min → 17min (90% 단축)
- New pod 마다 audit 가능

### 3. Pod-side file-based monitor + parent active polling
- Sub-agent monitor v6 (sleep 540 in Bash) 패턴
- Memory: `feedback_capx_monitor_protocol.md`
- 2-tier monitoring (pod-side state file + main session SSH polling)

### 4. R2 5-step backup_protocol (with gap finding)
- Memory: `feedback_capx_backup_protocol.md`
- 5단계: outputs / venv lock / runtime snapshot / submodule SHAs / logs+state
- **본 phase 발견 gap**: condition rename 시 cross-reference 손실 → C3v2 raw outputs 누락 (Smoke #2 에서 발견됨, paper v11 §VII.limitations 에 honest mention)

### 5. Audit chain methodology
- 9 narrative reversals 모두 publication 전 catch
- 핵심: *paper v_n 작성 시 paper v_{n-1} 의 mechanism claims 를 mechanical re-verification*
- Audit 가 *single-run causal claim* 들을 hypothesis 로 강등 → cross-condition verification 후 paper 본문 등록

---

## §4 — Limitations (carry-over to paper v3 candidates)

### Vision-pipeline ceiling on cube_lifting
- cube_lifting 의 90%+ rates 는 perception-easy regime (single chromatically distinct cube)
- cube_stack_3 floor 가 *vision* 에서 발생 — sam3 mask consolidation 한계
- **Paper v3 candidate**: per-cube SAM prompt with class-conditioned anchors / contact-graspnet instance-aware pose / grounded-segment-anything

### Multi-session mining 의 saturation regime
- k=12 dense library 에서 fail trial code 가 imperative API-only → 0 new function defs
- **Paper v3 candidate**: (a) round-2 mining on harder task (vision pipeline 수정 후 cube_stack_3) (b) semantic-similarity clustering of API-call sequences

### Multi-backbone iteration mechanism
- DeepSeek 의 6% baseline 은 reasoning 아닌 *self-correction failure*
- **Paper v3 candidate**: reflection-style regeneration prompting on DeepSeek → library effect 감소 예측

### Q3 long-horizon evolution
- Single-snapshot library data 만 있음
- **Paper v3 candidate**: n=100-200 trials with library snapshots every N trials → evolution 측정 가능

### Backup-protocol output retention gap
- C3v2 raw outputs 부재 (Smoke #2 에서 발견)
- **Process item**: paper-referenced 모든 condition 의 raw outputs 를 condition-specific R2 prefix 로 archive

---

## §5 — Followups / future work

### Paper v3 candidates (priority order)
1. **Vision pipeline upgrade** (cube_stack_3 floor → measurable transfer regime)
2. **DeepSeek iteration mechanism test** (falsifiable prediction on library effect modulation)
3. **Long-horizon Q3 evolution** (n=100-200 trials, library snapshots)
4. **Cross-task generalization fixed** (cube_stack_3 + LIBERO with vision fix)

### Pre-paper v3 process improvements
- Per-condition R2 archive convention (avoid C3v2-style gap)
- Capx-defined metric in CI/regression suite (avoid 8th reversal-style metric drift)
- Standard mechanism-attribution checklist (cross-condition verification required before paper integration)

---

## §6 — Cost ledger (final)

| Phase | Pod cost | LLM cost | Total | Notes |
|---|---|---|---|---|
| Paper v8 base (4/17 - 5/3) | ~$60 | ~$102 | ~$162 | 9 phases (D, ablation, baseline boost, etc.) |
| Paper v9 D2/Dedup v3 (5/4) | ~$5 | ~$74 | ~$79 | n=440 trials, 9h pod |
| Closing Smoke phase 2 (5/8) | ~$1.32 | ~$5.75 | ~$7 | Smoke #2 trial-skip + B4 cancel saved ~$6 |
| **Final cumulative** | ~$66 | ~$182 | **~$248** | OpenRouter $190 cap + RunPod $50 cap |

OpenRouter 잔액: $190 - $151.98 = **$38** at closing
RunPod 잔액: ~$17 (잔여 GPU credit)

---

## §7 — Project state at closing

| Item | Status |
|---|---|
| Paper v11 PDF | ✅ 13p / 460,482 bytes |
| arxiv-submission tar | ✅ 151,357 bytes (SHA matches main PDF) |
| 9 reversals | ✅ All caught pre-publication |
| Smoke phase 2 (3 micro-eval) | ✅ Complete (1 floor + 1 saturation + 1 mechanism finding) |
| All pods | ✅ Terminated (wwseqhbvhuz3vi 5/8) |
| R2 backup (5/8) | ✅ outputs 106 MiB / state 1.1 MiB / logs 940 KiB |
| Memory updates | ✅ project_capx_experiment.md (closing state) |
| HTML status banner | ✅ Updated to "paper v11 closing 2026-05-08" |
| capx-current-status.md | ✅ Closing entry added (4 entries: paper v9 → v10 → v10.1 → v11) |
| capx-log.md | ✅ Closing log entry (5/8) added with 9-reversal table |
| arXiv re-submit | ⏳ 사용자 직접 (Task 10) |
| Project work | 🛑 Idle — paper v11 published-ready, attention 다른 프로젝트 (HRI duplex 등) 로 이동 |

---

## §8 — Final reflections

**가장 큰 lesson**: *9 narrative reversals 가 모두 publication 전 self-correction* 됐다는 사실이 *audit chain methodology* 의 진정한 가치. 만약 publication *후* 잡혔다면 retraction / correction 필요. *audit chain* 이 *publication 직전 mechanical verification* 로 작동.

**가장 큰 surprise**: Smoke #2 의 namespace saturation 발견 (0 new skills extractable). Spec 의 pre-registered Pass/Marginal/Fail 가 *trial run 결과* 로 가정했는데, 실제는 *mining stage 전에 already-decided* — *stronger negative* 였음.

**가장 큰 missed opportunity**: cube_stack_3 floor 의 *vision pipeline mechanism* 은 파악했지만 *fix attempt* 는 시간/예산상 paper v3 candidate 로 deferred. 만약 *per-cube SAM prompt* 를 시도했다면 cube_stack_3 가 measurable transfer regime 에 들어올 수 있었을 가능성.

**가장 큰 process win**: subagent-driven-development skill 의 spec compliance + code quality 두 단계 review 가 9th reversal 발견의 *직접 원인*. Implementer subagent 의 verification 단계에서 paper v10 §V.F mechanism 가설 empirically refuted. 본인이 직접 구현했다면 catch 못했을 가능성.

**Project closing 가 의미하는 것**: paper v11 publish 가 끝이 아닌 *milestone*. paper v3 candidates (vision pipeline / multi-session refactor / multi-backbone iteration / Q3 long-horizon) 모두 *open question*. capx project 는 *closed but extensible* — future researcher 또는 본인 future-self 에게 paper v3 path 가 open 됨.

---

## §9 — Acknowledgements

- 9 reversals 의 audit chain 은 user 의 *솔직 평가 요청* (5/6) 와 user 의 직접 의사결정들 ("진행" / "권장대로 갈게" / "A path") 위에서 작동
- Subagent-driven-development skill 의 spec/code/quality review 가 9th reversal 발견의 *trigger*
- Memory 시스템 (`feedback_capx_*`) 가 multi-day phase 간 context preservation 의 *backbone*
- RunPod 의 즉시-spawn A40 가 budget-bounded experimentation 의 *enabler*
- OpenRouter 의 multi-backbone proxy 가 D2 closure + Smoke #3 mechanism 의 *prerequisite*

---

**Project signed-off:** 2026-05-08

paper v11 (closing) supersedes paper v8 / v9 / v10 / v10.1.
arXiv re-submit pending user action (Task 10).
Closing tag: `closing-v1` (to be created at user's arXiv submission completion).
