Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct / Thinking checkpoints these capabilities are complementary but misaligned: the Instruct model is concise and tool-disciplined, while the Thinking model offers stronger planning and recovery but often over-deliberates and degrades agent performance.
We present CRANE, a training-free parameter-editing method that treats the Thinking–Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired checkpoints it delivers strong gains over either individual model while preserving Instruct-level efficiency — consistently outperforming alternative merging strategies across three agentic coding benchmarks at two model scales.
denoise the delta → gate it for tool-safety → project out format directions
The desired endpoint is not a symmetric average. It is an Instruct-style agent that keeps the deployed tool interface of θinst while selectively importing the problem-solving ability of θthink. A coordinate is useful only when moving along the actual Thinking–Instruct delta improves reasoning while staying compatible with tool-use preservation — the challenge is not generic fusion but behaviour-conditioned directional editing.
why neither checkpoint alone is enough
three agentic coding benchmarks × two model scales
Reading the cost column. Alongside success, each table reports Total Token Count (TTC) — a single provider-cost proxy that weights token types by their typical pricing: TTC = Nin + 0.1·Ncached + 5·Nout. It captures the real economics of a rollout (output tokens are the expensive part) in one number, so lower is better (↓). Because the models are served locally, TTC is a budget estimate rather than billed spend. CRANE reaches the highest success while keeping TTC at or below the Instruct endpoint — the gains are not bought with extra tokens.
| Method | Qwen3-30B-A3B | Qwen3-Next-80B-A3B | ||||||
|---|---|---|---|---|---|---|---|---|
| pass@1 | pass@3 | pass_all | TTC (M)↓ | pass@1 | pass@3 | pass_all | TTC (M)↓ | |
| Instruct (ref) | 46.7 | 64.1 | 32.3 | 181.1 | 72.8 | 87.2 | 53.3 | 89.6 |
| Thinking (ref) | 34.9 | 52.8 | 17.9 | 146.9 | 35.4 | 49.7 | 22.6 | 109.5 |
| AIM-TA (Best alternative) | 46.7 | 64.6 | 29.2 | 212.6 | 80.5 | 87.7 | 66.2 | 100.0 |
| Task Arithmetic | 47.2 | 61.0 | 33.3 | 208.1 | 78.5 | 88.7 | 67.7 | 93.1 |
| TIES | 47.2 | 66.2 | 29.2 | 208.9 | 79.0 | 88.2 | 62.1 | 89.0 |
| SLERP | 43.6 | 58.5 | 29.7 | 214.6 | 73.3 | 86.7 | 60.5 | 97.6 |
| AIM-TIES | 45.1 | 61.5 | 29.2 | 211.3 | 76.4 | 90.8 | 61.0 | 96.0 |
| LEWIS | 44.6 | 63.1 | 27.7 | 194.3 | 79.5 | 90.3 | 62.1 | 95.9 |
| RAIN | 39.5 | 54.4 | 21.5 | 140.2 | 46.2 | 58.5 | 25.6 | 113.2 |
| CRANE | 66.2 | 83.1 | 44.1 | 120.9 | 81.5 | 90.3 | 71.3 | 89.2 |
| Method | Qwen3-30B-A3B | Qwen3-Next-80B-A3B | ||||
|---|---|---|---|---|---|---|
| Resolved | % | TTC (B)↓ | Resolved | % | TTC (B)↓ | |
| Instruct (ref) | 108 | 21.6 | 12.0 | 168 | 33.6 | 5.9 |
| Thinking (ref) | 47 | 9.4 | 14.3 | 125 | 25.0 | 14.2 |
| AIM-TA (Best alternative) | 113 | 22.6 | 8.2 | 172 | 34.4 | 5.5 |
| Task Arithmetic | 109 | 21.8 | 8.2 | 169 | 33.8 | 5.5 |
| TIES | 110 | 22.0 | 8.0 | 162 | 32.4 | 5.9 |
| SLERP | 110 | 22.0 | 7.8 | 169 | 33.8 | 5.5 |
| AIM-TIES | 111 | 22.2 | 8.9 | 169 | 33.8 | 5.3 |
| LEWIS | 110 | 22.0 | 7.8 | 173 | 34.6 | 5.5 |
| RAIN | 58 | 11.6 | 13.7 | 120 | 24.0 | 13.7 |
| CRANE | 122 | 24.4 | 5.7 | 180 | 36.0 | 5.2 |
| Method | Qwen3-30B-A3B | Qwen3-Next-80B-A3B | ||||
|---|---|---|---|---|---|---|
| pass@1 | pass@5 | TTC (M)↓ | pass@1 | pass@5 | TTC (M)↓ | |
| Instruct (ref) | 5.4 | 10.1 | 112.6 | 13.5 | 22.5 | 52.6 |
| Thinking (ref) | 5.9 | 13.5 | 108.6 | 6.7 | 13.5 | 115.0 |
| AIM-TIES (Best alternative) | 5.6 | 13.5 | 77.9 | 14.2 | 24.7 | 348.5 |
| Task Arithmetic | 5.4 | 14.6 | 69.9 | 13.0 | 24.7 | 310.2 |
| TIES | 6.1 | 13.5 | 80.2 | 13.3 | 25.8 | 59.5 |
| SLERP | 5.4 | 14.6 | 73.0 | 13.5 | 27.0 | 55.6 |
| AIM-TA | 5.6 | 13.5 | 60.3 | 13.7 | 22.5 | 54.5 |
| LEWIS | 5.2 | 11.2 | 60.6 | 14.2 | 25.8 | 54.1 |
| RAIN | 5.6 | 10.1 | 99.3 | 7.9 | 15.7 | 109.3 |
| CRANE | 7.6 | 17.9 | 58.1 | 14.8 | 30.3 | 51.8 |
(Best alternative) marks the strongest non-CRANE merge on the headline metric; press Show all in any table to reveal every baseline. TTC↓ is lower-is-better. Terminal-Bench TTC is derived from recorded tokens as Nin+0.1Ncached+5Nout; the Task Arithmetic and AIM-TIES 80B values are inflated by lower prefix-cache hit rates in that audited sweep, as the paper notes.
block-wise injection strength, derived automatically from calibration losses
The Authors acknowledge the National Artificial Intelligence Research Resource (NAIRR) Pilot and Red Hat Research, the Mass Open Cloud (MOC), and IBM Research for contributing to this research result.