We present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings.
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs).
Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification.
To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings.
Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning.
The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts.
Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models.
By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.
System Workflow
The overall workflow of Build-bench, illustrated in Fig. 2, consists of three major stages: (1) Input & Diagnosis Context, (2) LLM-driven Repair Process, and (3) Verification & Evaluation.
Cross-ISA Build Tasks
As shown in Table 1, we evaluate both the effectiveness and efficiency of model-driven repair. The following metrics are adopted in Build-bench:
-
Build Success Rate: the percentage of packages that are successfully built on the target architecture within N_max iterations.
-
Average Repair Time (min): the average time a package takes until successful build or termination.
-
Average Token Consumption (K): the average number of input and output tokens the model consumes for each package during the entire repair process.
This formulation provides a clear and measurable framework for assessing whether LLMs can understand, adapt, and repair software packages in cross-ISA migration scenarios, emphasizing their reasoning depth, contextual utilization, and cost-effectiveness.
Overall, the results reveal a trade-off between reasoning depth and efficiency. Models that maintain longer reasoning chains and richer tool interactions tend to achieve higher success rates but consume more time and tokens.
Conversely, faster models often exhibit insufficient context retention or under-exploration of repair strategies.
Table.2 summarizes the effect of iterative feedback on cumulative repair success across three iterations in Build-bench.
Each iteration reuses the latest build log and prior repair output as contextual feedback, allowing the model to refine its reasoning and avoid repeating ineffective edits.
Across both migration directions, iterative feedback consistently improves repair outcomes, confirming that LLMs benefit from exposure to updated diagnostic information.
Tool Invocation Behavior Across LLMs
To further understand the behavioral characteristics of different models during cross-ISA repair, we analyze their tool invocation patterns.
The bars indicate the total number of invocations for each tool across all 268 packages (including both migration directions), while the gray line represents the average number of tool calls per iteration, averaged over all repair attempts of each package.
These findings reveal that LLMs differ not only in their linguistic reasoning capabilities but also in their operational strategies during tool-assisted repair, which is an important factor contributing to their divergent success rates in cross-ISA build repair.
BibTeX citation
@misc{zhao2025languagemodelscodingassessing,
title={Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems},
author={Chenyu Zhao and Shenglin Zhang and Zeshun Huang and Weilin Jin and Yongqian Sun and Dan Pei and Chaoyun Zhang and Qingwei Lin and Chetan Bansal and Saravan Rajmohan and Minghua Ma},
year={2025},
eprint={2511.00780},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.00780},
}