CRANE

Constrained Reasoning Injection for Code Agents via Nullspace Editing

Mingzhi Zhu1   Michele Merler2   Raju Pavuluri2   Stacy Patterson1
1Rensselaer Polytechnic Institute    2IBM Research

Abstract

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct / Thinking checkpoints these capabilities are complementary but misaligned: the Instruct model is concise and tool-disciplined, while the Thinking model offers stronger planning and recovery but often over-deliberates and degrades agent performance.

We present CRANE, a training-free parameter-editing method that treats the Thinking–Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired checkpoints it delivers strong gains over either individual model while preserving Instruct-level efficiency — consistently outperforming alternative merging strategies across three agentic coding benchmarks at two model scales.

Method

denoise the delta → gate it for tool-safety → project out format directions

CRANE three-stage merge pipeline
The CRANE pipeline. (1) Magnitude thresholding sparsifies the delta δ = θthink − θinst and discards low-confidence coordinates. (2) Conservative Taylor Gate sets per-block injection strength so only directions first-order beneficial to both reasoning transfer and tool-use are retained. (3) Graduated Sigmoidal Projection attenuates updates along format-critical subspaces (tool delimiters, schema, chat-template tokens), protecting the deployed agent protocol.

The desired endpoint is not a symmetric average. It is an Instruct-style agent that keeps the deployed tool interface of θinst while selectively importing the problem-solving ability of θthink. A coordinate is useful only when moving along the actual Thinking–Instruct delta improves reasoning while staying compatible with tool-use preservation — the challenge is not generic fusion but behaviour-conditioned directional editing.

The endpoint trade-off

why neither checkpoint alone is enough

Qualitative Roo-Eval trace comparing Instruct, Thinking, and CRANE
A qualitative Roo-Eval trace on the python/scale-generator task. The same problem, three models — only CRANE reads the spec, fixes the bug, recovers, and passes all 17 tests.

Results

three agentic coding benchmarks × two model scales

66.2%
Roo-Eval pass@1 · 30B
+19.5 over Instruct
+14
SWE-bench-Verified · 30B
more resolved vs Instruct
30.3%
Terminal-Bench v2 pass@5 · 80B
+3.3 over best baseline
TTC versus pass-rate frontier across three benchmarks at two scales
Total-token-cost vs. success across all six settings. CRANE (★) sits on the upper-left frontier — higher success at a lower rollout footprint. The gains are not bought with extra tokens or longer wall time.

Reading the cost column. Alongside success, each table reports Total Token Count (TTC) — a single provider-cost proxy that weights token types by their typical pricing: TTC = Nin + 0.1·Ncached + 5·Nout. It captures the real economics of a rollout (output tokens are the expensive part) in one number, so lower is better (↓). Because the models are served locally, TTC is a budget estimate rather than billed spend. CRANE reaches the highest success while keeping TTC at or below the Instruct endpoint — the gains are not bought with extra tokens.

Table 1 — Roo-Eval · five-language in-IDE suite · pass rates (%) and Total Token Count
MethodQwen3-30B-A3BQwen3-Next-80B-A3B
pass@1pass@3pass_allTTC (M)↓pass@1pass@3pass_allTTC (M)↓
Instruct (ref)46.764.132.3181.172.887.253.389.6
Thinking (ref)34.952.817.9146.935.449.722.6109.5
AIM-TA (Best alternative) 46.764.629.2212.680.587.766.2100.0
Task Arithmetic47.261.033.3208.178.588.767.793.1
TIES47.266.229.2208.979.088.262.189.0
SLERP43.658.529.7214.673.386.760.597.6
AIM-TIES45.161.529.2211.376.490.861.096.0
LEWIS44.663.127.7194.379.590.362.195.9
RAIN39.554.421.5140.246.258.525.6113.2
CRANE66.283.144.1120.981.590.371.389.2
Table 2 — SWE-bench-Verified · 500-instance issue resolution · resolved count, % and Total Token Count
MethodQwen3-30B-A3BQwen3-Next-80B-A3B
Resolved%TTC (B)↓Resolved%TTC (B)↓
Instruct (ref)10821.612.016833.65.9
Thinking (ref)479.414.312525.014.2
AIM-TA (Best alternative) 11322.68.217234.45.5
Task Arithmetic10921.88.216933.85.5
TIES11022.08.016232.45.9
SLERP11022.07.816933.85.5
AIM-TIES11122.28.916933.85.3
LEWIS11022.07.817334.65.5
RAIN5811.613.712024.013.7
CRANE12224.45.718036.05.2
Table 3 — Terminal-Bench v2 · long-horizon shell workflows · pass@1 / pass@5 (%) and Total Token Count
MethodQwen3-30B-A3BQwen3-Next-80B-A3B
pass@1pass@5TTC (M)↓pass@1pass@5TTC (M)↓
Instruct (ref)5.410.1112.613.522.552.6
Thinking (ref)5.913.5108.66.713.5115.0
AIM-TIES (Best alternative) 5.613.577.914.224.7348.5
Task Arithmetic5.414.669.913.024.7310.2
TIES6.113.580.213.325.859.5
SLERP5.414.673.013.527.055.6
AIM-TA5.613.560.313.722.554.5
LEWIS5.211.260.614.225.854.1
RAIN5.610.199.37.915.7109.3
CRANE7.617.958.114.830.351.8

(Best alternative) marks the strongest non-CRANE merge on the headline metric; press Show all in any table to reveal every baseline. TTC↓ is lower-is-better. Terminal-Bench TTC is derived from recorded tokens as Nin+0.1Ncached+5Nout; the Task Arithmetic and AIM-TIES 80B values are inflated by lower prefix-cache hit rates in that audited sweep, as the paper notes.

Per-language Roo-Eval pass@1 heatmap across methods at both scales
Per-language Roo-Eval pass@1 across methods. CRANE leads on Python, JavaScript, and Go at 30B and stays among the top methods at 80B; the residual Java / Rust gap reflects asymmetric coverage in the underlying Thinking-model training.

What the gate learns

block-wise injection strength, derived automatically from calibration losses

Conservative Taylor Gate importance heatmap on Qwen3-30B-A3B
Conservative Taylor Gate importance across layers and components on Qwen3-30B-A3B. Late-layer attention, mid-depth experts, and the routing gate dominate; layer norms and the LM head receive near-zero injection. The ordering is stable: across five resampled calibration sets it holds a Spearman correlation above 0.99.

Citation

 
@misc{zhu2026crane, title = {CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing}, author = {Zhu, Mingzhi and Merler, Michele and Pavuluri, Raju and Patterson, Stacy}, year = {2026}, eprint = {2605.14084}, archivePrefix= {arXiv}, primaryClass = {cs.SE}, url = {https://arxiv.org/abs/2605.14084} }

Acknowledgements

The Authors acknowledge the National Artificial Intelligence Research Resource (NAIRR) Pilot and Red Hat Research, the Mass Open Cloud (MOC), and IBM Research for contributing to this research result.