BoLT logo

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

1National University of Singapore
Teaser figure
BoLT replaces slow LLM calls with a fast emulator. Without an emulator (left), a BBO loop must query real LLM runs which are slow and costly. With BoLT (right), the BBO loop queries a lightweight emulator trained on real LLM experiments, returning emulated observations almost instantly and at negligible cost.

BoLT is the first BBO benchmark grounded in real LLM experiments, accessible without large-scale compute.

LLMs involve expensive, derivative-free decisions — hyperparameters, data mixtures, prompts — that black-box optimization (BBO) is built to handle. Yet most BBO research validates on synthetic functions that miss the real structure of LLM tasks. BoLT provides surrogate-based benchmarks grounded in real LLM experiments, so everyone can evaluate BBO methods against realistic objectives.
3 task families
10 LLM tasks
20k+ real experiments
4 emulators
Hyperparameter optimization
LoRA fine-tuning on Qwen3-4B/8B, evaluated on MATH-500.
ProblemDimChallenges
HPO7mixed variables
HPO-MF-Cont8mixed variables, multi-fidelity (continuous)
HPO-MF-Disc8mixed variables, multi-fidelity (discrete)
Data mixture optimization
Search over instruction-following, math, and code proportions from TULU-3, evaluated on IFEval, MATH-500, and MBPP+.
ProblemDimChallenges
DMO6simplex constraint
DMO-MO6simplex constraint, multi-objective
DMO-Het6simplex constraint, heteroscedastic noise
Prompt optimization
Discrete search over 5,014 pre-embedded prompts, using Matryoshka embeddings at four truncation dimensions (128–768).
ProblemDimChallenges
PO-128128high-dimensional
PO-256256high-dimensional
PO-512512high-dimensional
PO-768768high-dimensional

Emulators fitted on real LLM runs reproduce the objective landscape at negligible query cost, validated via Spearman rank correlation on held-out test sets (ρ ≥ 0.72 across all tasks).

IFEval emulator landscape MATH-500 emulator landscape MBPP+ emulator landscape

2D slices of the DMO emulator landscape across instruction-following, math, and code data proportions. The varied terrain reflects real trade-offs in the underlying LLM experiments.

We benchmark 15+ Bayesian optimization (BO) and black-box methods across all 10 problems. Three key findings are presented here.

HPO
HPO legend HPO results
GP-based methods (LogNEI, MES, GIBBON) consistently outperform HPO baselines (TPE, CMA-ES), and any additional compute overhead is negligible compared to real LLM training costs.
DMO-MO
DMO-MO legend DMO-MO results
NEHVI matches NSGA2 using 50× fewer evaluations, with substantially tighter confidence intervals.
PO-768
PO legend PO-768 results
Trust-region methods (dTuRBO, dBAxUS) adapted for discrete candidate sets are essential, reaching near-optimal solutions within 200 iterations.

BoLT API example

BoLT is built for BBO researchers. Every problem subclasses BoTorch’s BaseTestProblem, so your existing code plugs straight in. Emulator weights and tabular data are fetched automatically from HuggingFace on first use.

The benchmark is not limited to the 10 bundled problems. You can construct new optimization settings (constrained, contextual, etc.) using the public emulators or underlying data, without running any new LLM experiments.

Install the Python package, point your optimizer at a BoLT problem, and you have a reproducible, grounded evaluation. See the documentation to get started.

Citation

@article{chew2026bolt,
  title     = {BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive {LLM} Tasks},
  author    = {Chew, Ruth Wan Theng and Chen, Zhiliang and Hemachandra, Apivich and Low, Bryan Kian Hsiang},
  journal   = {arXiv preprint arXiv:2605.17000},
  year      = {2026}
}