CRANE — Constrained Reasoning Injection for Code Agents via Nullspace Editing

Abstract

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct / Thinking checkpoints these capabilities are complementary but misaligned: the Instruct model is concise and tool-disciplined, while the Thinking model offers stronger planning and recovery but often over-deliberates and degrades agent performance.

We present CRANE, a training-free parameter-editing method that treats the Thinking–Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired checkpoints it delivers strong gains over either individual model while preserving Instruct-level efficiency — consistently outperforming alternative merging strategies across three agentic coding benchmarks at two model scales.

Method

denoise the delta → gate it for tool-safety → project out format directions

CRANE three-stage merge pipeline — The CRANE pipeline. **(1) Magnitude thresholding** sparsifies the delta δ = θ_think − θ_inst and discards low-confidence coordinates. **(2) Conservative Taylor Gate** sets per-block injection strength so only directions first-order beneficial to *both* reasoning transfer and tool-use are retained. **(3) Graduated Sigmoidal Projection** attenuates updates along format-critical subspaces (tool delimiters, schema, chat-template tokens), protecting the deployed agent protocol.

The desired endpoint is not a symmetric average. It is an Instruct-style agent that keeps the deployed tool interface of θinst while selectively importing the problem-solving ability of θthink. A coordinate is useful only when moving along the actual Thinking–Instruct delta improves reasoning while staying compatible with tool-use preservation — the challenge is not generic fusion but behaviour-conditioned directional editing.

The endpoint trade-off

why neither checkpoint alone is enough

Qualitative Roo-Eval trace comparing Instruct, Thinking, and CRANE — A qualitative Roo-Eval trace on the python/scale-generator task. The same problem, three models — only CRANE reads the spec, fixes the bug, recovers, and passes all 17 tests.

INSTRUCTActs fast, but blind. Edits before reading the test file, then loops on the same failed tool call. Fails in 905s and 28k tokens.
THINKINGDeliberates, but never retests. Spends 98.5k characters in a single inner-monologue block, emits a malformed payload, and times out. Fails in 905s and 30k tokens.
CRANEReads first, recovers, passes. Reads the spec, applies the root fix (tonic.capitalize()), recovers after a partial failure, and passes all 17 tests — in 226s and 8.8k tokens.

Results

three agentic coding benchmarks × two model scales

66.2%

Roo-Eval pass@1 · 30B

+19.5 over Instruct

+14

SWE-bench-Verified · 30B

more resolved vs Instruct

30.3%

Terminal-Bench v2 pass@5 · 80B

+3.3 over best baseline

TTC versus pass-rate frontier across three benchmarks at two scales — Total-token-cost vs. success across all six settings. CRANE (★) sits on the upper-left frontier — higher success at a **lower** rollout footprint. The gains are not bought with extra tokens or longer wall time.

Reading the cost column. Alongside success, each table reports Total Token Count (TTC) — a single provider-cost proxy that weights token types by their typical pricing: TTC = N_in + 0.1·N_cached + 5·N_out. It captures the real economics of a rollout (output tokens are the expensive part) in one number, so lower is better (↓). Because the models are served locally, TTC is a budget estimate rather than billed spend. CRANE reaches the highest success while keeping TTC at or below the Instruct endpoint — the gains are not bought with extra tokens.

Table 1 — Roo-Eval · five-language in-IDE suite · pass rates (%) and Total Token Count
Method	Qwen3-30B-A3B				Qwen3-Next-80B-A3B
Method	pass@1	pass@3	pass_all	TTC (M)↓	pass@1	pass@3	pass_all	TTC (M)↓
Instruct (ref)	46.7	64.1	32.3	181.1	72.8	87.2	53.3	89.6
Thinking (ref)	34.9	52.8	17.9	146.9	35.4	49.7	22.6	109.5
AIM-TA (Best alternative)	46.7	64.6	29.2	212.6	80.5	87.7	66.2	100.0
Task Arithmetic	47.2	61.0	33.3	208.1	78.5	88.7	67.7	93.1
TIES	47.2	66.2	29.2	208.9	79.0	88.2	62.1	89.0
SLERP	43.6	58.5	29.7	214.6	73.3	86.7	60.5	97.6
AIM-TIES	45.1	61.5	29.2	211.3	76.4	90.8	61.0	96.0
LEWIS	44.6	63.1	27.7	194.3	79.5	90.3	62.1	95.9
RAIN	39.5	54.4	21.5	140.2	46.2	58.5	25.6	113.2
CRANE	66.2	83.1	44.1	120.9	81.5	90.3	71.3	89.2

Table 2 — SWE-bench-Verified · 500-instance issue resolution · resolved count, % and Total Token Count
Method	Qwen3-30B-A3B			Qwen3-Next-80B-A3B
Method	Resolved	%	TTC (B)↓	Resolved	%	TTC (B)↓
Instruct (ref)	108	21.6	12.0	168	33.6	5.9
Thinking (ref)	47	9.4	14.3	125	25.0	14.2
AIM-TA (Best alternative)	113	22.6	8.2	172	34.4	5.5
Task Arithmetic	109	21.8	8.2	169	33.8	5.5
TIES	110	22.0	8.0	162	32.4	5.9
SLERP	110	22.0	7.8	169	33.8	5.5
AIM-TIES	111	22.2	8.9	169	33.8	5.3
LEWIS	110	22.0	7.8	173	34.6	5.5
RAIN	58	11.6	13.7	120	24.0	13.7
CRANE	122	24.4	5.7	180	36.0	5.2

Table 3 — Terminal-Bench v2 · long-horizon shell workflows · pass@1 / pass@5 (%) and Total Token Count
Method	Qwen3-30B-A3B			Qwen3-Next-80B-A3B
Method	pass@1	pass@5	TTC (M)↓	pass@1	pass@5	TTC (M)↓
Instruct (ref)	5.4	10.1	112.6	13.5	22.5	52.6
Thinking (ref)	5.9	13.5	108.6	6.7	13.5	115.0
AIM-TIES (Best alternative)	5.6	13.5	77.9	14.2	24.7	348.5
Task Arithmetic	5.4	14.6	69.9	13.0	24.7	310.2
TIES	6.1	13.5	80.2	13.3	25.8	59.5
SLERP	5.4	14.6	73.0	13.5	27.0	55.6
AIM-TA	5.6	13.5	60.3	13.7	22.5	54.5
LEWIS	5.2	11.2	60.6	14.2	25.8	54.1
RAIN	5.6	10.1	99.3	7.9	15.7	109.3
CRANE	7.6	17.9	58.1	14.8	30.3	51.8

(Best alternative) marks the strongest non-CRANE merge on the headline metric; press Show all in any table to reveal every baseline. TTC↓ is lower-is-better. Terminal-Bench TTC is derived from recorded tokens as N_in+0.1N_cached+5N_out; the Task Arithmetic and AIM-TIES 80B values are inflated by lower prefix-cache hit rates in that audited sweep, as the paper notes.

Per-language Roo-Eval pass@1 heatmap across methods at both scales — Per-language Roo-Eval pass@1 across methods. CRANE leads on Python, JavaScript, and Go at 30B and stays among the top methods at 80B; the residual Java / Rust gap reflects asymmetric coverage in the underlying Thinking-model training.

The Authors acknowledge the National Artificial Intelligence Research Resource (NAIRR) Pilot and Red Hat Research, the Mass Open Cloud (MOC), and IBM Research for contributing to this research result.

Constrained Reasoning Injection for Code Agents via Nullspace Editing

Abstract

Method

The endpoint trade-off

Results

What the gate learns

Citation

Acknowledgements