cmuertz.dev
  • Published on
    Today we’re excited to share the SWE-Star family of models: a 7B, a 14B, and a 32B model trained on Qwen2.5-Coder-32B, together with a dataset of roughly 250,000 agent trajectories distilled from Devstral-2-Small using SWE-Smith environments and tasks. Our largest model, SWE-Star-32B, reaches 51% Pass@1 on SWE-Bench Verified, setting a new state of the art among open-data models in this size class. The 14B model reaches 49.6% at less than half the parameter count, while the 7B model reaches 30%.
  • Published on
    Evaluating coding agents shouldn’t feel like watching paint dry. Yet with SWE-Bench Verified, it often does—hundreds of Docker images totaling 240 GiB, throttled by rate limits*, turn the first setup on a new machine into a 30-hour ordeal. By restructuring layers, trimming unnecessary files, and compressing the results, we shrank SWE-Bench Verified from 240 GiB to just 5 GiB. Now it downloads in under a minute, making large-scale evaluation and trace generation on cloud machines fast and painless.