cmuertz.dev
Published on

SWE-Star: Scaling Agentic Coding Distillation

Authored by
Christian Mürtz, Mark Niklas Müller

Mark and I wrote this blog post during my master’s thesis. You can also find it on the LogicStar blog.

This project was built using MareNostrum 5 ACC, one of Europe’s largest operational GPU clusters with 4,480 H100s. All European researchers and startups can apply for 5,000–50,000 H100-hours via EuroHPC AI Factory calls to reproduce, extend, and improve this work. The grant process is fast and straightforward!

Today we’re excited to share the SWE-Star family of models: a 7B, a 14B, and a 32B model trained on Qwen2.5-Coder-32B, together with a dataset of roughly 250,000 agent trajectories distilled from Devstral-2-Small using SWE-Smith environments and tasks. Our largest model, SWE-Star-32B, reaches 51% Pass@1 on SWE-Bench Verified, setting a new state of the art among open-data models in this size class. The 14B model reaches 48.0% at less than half the parameter count, while the 7B model reaches 30%.

ocean

All data generation, training and evaluation were run on MareNostrum 5 (MN5), a European public supercomputer with 4,480 H100 GPUs. In this post, we describe how we scaled agentic data generation, training, and evaluation on a highly restricted HPC environment — no Docker, no outbound internet, and massive parallelism — and how these constraints shaped the system design. We also open-source our full agent scaffold, data generation pipeline, and training infrastructure so other researchers can build on this work on similar clusters.

Scaling, Scaling, Scaling

Ever since the original scaling laws paper, “just scale it” has been the dominant recipe for improving models — first through larger pretraining runs and more data, and more recently through heavier post-training and reinforcement learning. Large open-weight models are now approaching 70% on SWE-Bench Verified, likely driven by significant RL investment.

We were wondering: how much of this agentic capability can we distill into smaller, cheaper, and easier-to-deploy models using supervised data alone?

Distilling from strong teacher models is attractive because it is sample-efficient and allows smaller models to imitate long-horizon reasoning and tool-use behaviors learned by larger models through extended RL. Over the past year, several works have built SWE-style environments to enable this. Most notably, SWE-Smith introduced a scalable pipeline for injecting bugs into real codebases and back-translating them into realistic but synthetic issues. Using this pipeline, they created over 50k tasks and distilled 5k successful trajectories using Claude Sonnet 3.7. With more trajectories, they observed almost perfect log-linear scaling, pushing Qwen2.5-Coder-32B from under 10% to 40.2%:

ocean

Because their work relied on cloud training credits and subsidized expert inference, they stopped scaling even though their issue generation pipeline had already produced over 50k issues.

So we asked: what happens if we scale this further?

An H100 hour typically costs between $4–10, and a single training run on 100k trajectories consumes roughly 4,500 H100-hours. Since we wanted to run large-scale ablations and generate scaling curves (see our next post), we expected to need well over 10,000 H100-hours. As a small startup with no commercial training budget, we turned to EuroHPC, an EU initiative that provides free access to Europe’s largest supercomputers for researchers and startups, including MareNostrum 5.

Making Use of MN5

Access to MN5 is powerful, but it comes with some unique constraints. For historical reasons common in HPC environments, the cluster has no outbound internet access, and the only interface is SSH access to two login nodes. The system is managed by SLURM, and compute jobs run in a highly restricted user mode. This is very different from typical cloud VMs, where you have full system control.

This implies:

  • Setting up dependencies, models, and datasets is non-trivial (no outbound internet).
  • Both the agent and the inference engine must run entirely on the cluster.
  • Standard Docker setups are unavailable; only restricted user-mode Podman is allowed.
  • Most existing agent scaffolds assume internet access, Docker, or do not scale to hundreds of parallel environments.

We therefore built a custom agent scaffold, forked from mini-swe-agent, that supports OpenHands tooling and scales efficiently under MN5’s constraints. Expert models are hosted via SGLang, data generation is orchestrated through SLURM submissions, and post-training is done with torchtune. The pipeline supports massive parallel data generation and hundreds of concurrent training runs for systematic scaling studies.

Our Agent Scaffold

OpenHands is currently the most popular open-source ReAct-style agent scaffold, providing basic tools for editing and browsing codebases as well as context condensation. While large proprietary models perform reasonably well with minimal tooling, smaller models with limited context windows (e.g., 32k tokens) struggle without structured editing and condensation.

Our design mirrors OpenHands in both tooling and condensation. The agent has access to four tools: think, execute_bash, str_replace_editor, and submit. When the context limit is reached, older observations are masked until the condensed context fits back into the model’s window while preserving space for reasoning and tool calls. We use XML-style tool calls for simplicity, since Qwen2.5-Coder does not support native tool-calling tokens.

Due to MN5’s restricted user mode, each agent runs inside a single-UID Podman container, communicating through two interactive Bash sessions. This differs from common execution-server designs, which require privileged container builds. We translate all str_replace_editor calls into equivalent Bash operations (e.g., first reading a file, editing the file on the host side, and writing it back via cat -). A dedicated long-running Bash session handles all execute_bash commands (see the fullthesis):

Generating a Massive Dataset

Because everything had to run on MN5, we self-hosted our teacher models. Shortly before our project began, Mistral released Devstral-2-Small, a 24B model achieving up to 68% on SWE-Bench Verified with their own agent scaffold. In our offline OpenHands setup, we achieved around 60%, which still provides a strong margin over the ~40% baseline we aimed to surpass. Our ablations also suggested that teacher strength is secondary during early scaling.

Devstral-2-Small fits efficiently on a single 4×H100 node (256 GB VRAM) using SGLang. In agentic workloads, the main bottleneck is KV cache memory. With up to ~100 turns per trajectory, re-prefilling the same prefix repeatedly severely degrades throughput. A full 32k context occupies ~5.4 GB, and we found ~40 parallel agents per node to be a good trade-off between cache reuse and decode batch size. We further used N-gram speculative decoding, which proved highly effective due to repetitive code patterns.

Each node can unroll roughly 200–300 tasks per hour. Sequentially, generating 250k trajectories would take over a month — so we parallelized aggressively. With ~200 nodes, the entire dataset can be generated in under five hours. Each node operates independently, making job scheduling and dataset partitioning straightforward:

ocean

Training with Torchtune

We filtered the 250k trajectories to retain only those that passed the final SWE-Smith tests. Because Devstral-2-Small supports contexts up to 256k tokens while Qwen2.5-Coder is trained on 32k, we segmented long traces into approximations of what the agent would observe under context condensation:

ocean

We chose torchtune for supervised fine-tuning due to its simplicity, memory efficiency, and FSDP2 support. Each H100 provides only 64 GB of VRAM, so we trained across four nodes (16 GPUs total) with full sharding of weights, gradients, and optimizer state in bf16. All models used a learning rate of 5e-5 with a cosine schedule. Activation checkpointing and offloading were necessary to support full 32k context training under these memory constraints.

Results

With our infrastructure, scheduling hundreds of training jobs to generate scaling curves is straightforward. The results are shown below:

ocean

The scaling is not log-linear at higher performance levels. All model sizes exhibit rapid early gains as they adapt to the task and scaffold. Beyond ~40% Pass@1, the 32B model begins to saturate and continues improving only slowly. Interestingly, the 14B model performs surprisingly close to the 32B model despite more than a 2× parameter reduction. The 7B model lags behind but still shows unsaturated scaling.

Final Thoughts

Although we outscaled SWE-Smith by over 10% points on SWE Bench Verified, the near log-linear scaling observed in earlier work transitions into diminishing returns at higher performance levels. Models adapt quickly to the scaffold and task format, but improvements beyond ~40% become increasingly incremental, even with substantial additional data.

If you find yourself wondering: Is masking observations really necessary? Is rejection sampling actually helpful? Are we bottlenecked by environment diversity or trajectory quality? Does unrolling each task multiple times help or hurt? — these are exactly the questions we explore next.

In Part 2, we present extensive ablations showing that the early rapid scaling phase is remarkably robust to many design choices. Some results may be surprising: repository diversity appears less critical than commonly assumed, rejection sampling is not strictly necessary for strong gains, and several presumed bottlenecks turn out to have weaker effects than expected.

We hope our work helps demystify large-scale agentic coding distillation and encourages more open experimentation in this space.