cmuertz.dev
  • Published on
    Evaluating coding agents shouldn’t feel like watching paint dry. Yet with SWE-Bench Verified, it often does—hundreds of Docker images totaling 240 GiB, throttled by rate limits*, turn the first setup on a new machine into a 30-hour ordeal. By restructuring layers, trimming unnecessary files, and compressing the results, we shrank SWE-Bench Verified from 240 GiB to just 5 GiB. Now it downloads in under a minute, making large-scale evaluation and trace generation on cloud machines fast and painless.