BoLT is the first BBO benchmark grounded in real LLM experiments, accessible without large-scale compute.
Critical decisions in the modern LLM pipeline, framed as optimization problems.
| Problem | Dim | Challenges |
|---|---|---|
| HPO | 7 | mixed variables |
| HPO-MF-Cont | 8 | mixed variables, multi-fidelity (continuous) |
| HPO-MF-Disc | 8 | mixed variables, multi-fidelity (discrete) |
| Problem | Dim | Challenges |
|---|---|---|
| DMO | 6 | simplex constraint |
| DMO-MO | 6 | simplex constraint, multi-objective |
| DMO-Het | 6 | simplex constraint, heteroscedastic noise |
| Problem | Dim | Challenges |
|---|---|---|
| PO-128 | 128 | high-dimensional |
| PO-256 | 256 | high-dimensional |
| PO-512 | 512 | high-dimensional |
| PO-768 | 768 | high-dimensional |
Backed by 20k+ real LLM experiments, queried in milliseconds.
Emulators fitted on real LLM runs reproduce the objective landscape at negligible query cost, validated via Spearman rank correlation on held-out test sets (ρ ≥ 0.72 across all tasks).
2D slices of the DMO emulator landscape across instruction-following, math, and code data proportions. The varied terrain reflects real trade-offs in the underlying LLM experiments.
BO methods consistently outperform baselines.
We benchmark 15+ Bayesian optimization (BO) and black-box methods across all 10 problems. Three key findings are presented here.
Evaluate on BoLT.

BoLT is built for BBO researchers. Every problem subclasses BoTorch’s BaseTestProblem, so your existing code plugs straight in. Emulator weights and tabular data are fetched automatically from HuggingFace on first use.
The benchmark is not limited to the 10 bundled problems. You can construct new optimization settings (constrained, contextual, etc.) using the public emulators or underlying data, without running any new LLM experiments.
Install the Python package, point your optimizer at a BoLT problem, and you have a reproducible, grounded evaluation. See the documentation to get started.
@article{chew2026bolt,
title = {BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive {LLM} Tasks},
author = {Chew, Ruth Wan Theng and Chen, Zhiliang and Hemachandra, Apivich and Low, Bryan Kian Hsiang},
journal = {arXiv preprint arXiv:2605.17000},
year = {2026}
}