Build-bench

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

Chenyu Zhao, Shenglin Zhang, Zeshun Huang, Weilin Jin, Yongqian Sun, Dan Pei,
Chaoyun Zhang, Qingwei Lin, Chetan Bansal, Saravan Rajmohan, Minghua Ma

NanKai University, Microsoft

📄 Paper 📦 Code

We present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings.

Fig. 1. Comparison of different large language models (LLMs) in cross-ISA build repair tasks. (a) shows the success rates (%) achieved on four migration scenarios (x86_64→aarch64 (F), x86_64→aarch64 (P), aarch64→x86_64 (F), and aarch64→x86_64 (P)), where F denotes Full File Generation and P denotes Patch Generation. (b) summarizes the overall success rates across all tasks for each model.

Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification.

To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts.

Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.

System Workflow

The overall workflow of Build-bench, illustrated in Fig. 2, consists of three major stages: (1) Input & Diagnosis Context, (2) LLM-driven Repair Process, and (3) Verification & Evaluation.

Fig. 2. The automatic cross-ISA repair and build pipeline of Build-bench. If the build fails and the maximum iteration 𝑁_max = 3 is not reached, the process repeats with the updated build log as well as the previous repair content.

Stage 1: Build-bench collects essential contextual artifacts from each failed package directory to support diagnosis and repair.
Stage 2: The second stage leverages an LLM-driven repair module based on the Model Context Protocol (MCP), which allows dynamic interaction between the model and a suite of external tools for information extraction and content modification. To assess how different editing granularities affect model performance, the repair process produces two experimental variants:
(1) Full File Generation, where the model regenerates the entire faulty file while preserving its structure and minimal edits, and
(2) Patch Generation, where explicit line-level modifications are output in a diff-like format automatically applied on the relevant file by Build-bench.
Stage 3: The updated package is rebuilt on the Open Build Service (OBS) to verify whether the repair succeeds. If the rebuild fails and the maximum iteration threshold is not reached, the process repeats with the updated inputs. This iterative workflow enables reproducible, end-to-end evaluation of LLM-based repair performance across heterogeneous ISAs.

Cross-ISA Build Tasks

As shown in Table 1, we evaluate both the effectiveness and efficiency of model-driven repair. The following metrics are adopted in Build-bench:

Build Success Rate: the percentage of packages that are successfully built on the target architecture within N_max iterations.
Average Repair Time (min): the average time a package takes until successful build or termination.
Average Token Consumption (K): the average number of input and output tokens the model consumes for each package during the entire repair process.

This formulation provides a clear and measurable framework for assessing whether LLMs can understand, adapt, and repair software packages in cross-ISA migration scenarios, emphasizing their reasoning depth, contextual utilization, and cost-effectiveness.

Table. 1. Performance of LLMs on cross-ISA build failures in both migration directions. For each model, Success indicates the number of packages that are successfully built; Success Rate corresponds to Build Success Rate; Avg Time (min) corresponds to Average Repair Time; Avg Tokens (K) corresponds to Average Token Consumption.

Overall, the results reveal a trade-off between reasoning depth and efficiency. Models that maintain longer reasoning chains and richer tool interactions tend to achieve higher success rates but consume more time and tokens. Conversely, faster models often exhibit insufficient context retention or under-exploration of repair strategies.

Table.2 summarizes the effect of iterative feedback on cumulative repair success across three iterations in Build-bench. Each iteration reuses the latest build log and prior repair output as contextual feedback, allowing the model to refine its reasoning and avoid repeating ineffective edits. Across both migration directions, iterative feedback consistently improves repair outcomes, confirming that LLMs benefit from exposure to updated diagnostic information.

Table.2 Iteration-wise improvement in build success rate on Build-bench. Iter-1, Iter-2, and Iter-3 denote the cumulative build success rates after the first, second, and third iterations, respectively. Δ(3–1) represents the improvement between the first and third iterations.

Tool Invocation Behavior Across LLMs

To further understand the behavioral characteristics of different models during cross-ISA repair, we analyze their tool invocation patterns. The bars indicate the total number of invocations for each tool across all 268 packages (including both migration directions), while the gray line represents the average number of tool calls per iteration, averaged over all repair attempts of each package. These findings reveal that LLMs differ not only in their linguistic reasoning capabilities but also in their operational strategies during tool-assisted repair, which is an important factor contributing to their divergent success rates in cross-ISA build repair.

Fig. 3. Comparison of tool invocation behavior across LLMs. The bars represent the total number of invocations for each tool per LLM, while the gray line indicates the average number of tool calls per iteration

BibTeX citation

@misc{zhao2025languagemodelscodingassessing,
      title={Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems}, 
      author={Chenyu Zhao and Shenglin Zhang and Zeshun Huang and Weilin Jin and Yongqian Sun and Dan Pei and Chaoyun Zhang and Qingwei Lin and Chetan Bansal and Saravan Rajmohan and Minghua Ma},
      year={2025},
      eprint={2511.00780},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2511.00780}, 
}