BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

BoLT is the first BBO benchmark grounded in real LLM experiments, accessible without large-scale compute.

LLMs involve expensive, derivative-free decisions — hyperparameters, data mixtures, prompts — that black-box optimization (BBO) is built to handle. Yet most BBO research validates on synthetic functions that miss the real structure of LLM tasks. BoLT provides surrogate-based benchmarks grounded in real LLM experiments, so everyone can evaluate BBO methods against realistic objectives.

3 task families

10 LLM tasks

20k+ real experiments

4 emulators

Critical decisions in the modern LLM pipeline, framed as optimization problems.

Hyperparameter optimization

LoRA fine-tuning on Qwen3-4B/8B, evaluated on MATH-500.

Problem	Dim	Challenges
HPO	7	mixed variables
HPO-MF-Cont	8	mixed variables, multi-fidelity (continuous)
HPO-MF-Disc	8	mixed variables, multi-fidelity (discrete)

Data mixture optimization

Search over instruction-following, math, and code proportions from TULU-3, evaluated on IFEval, MATH-500, and MBPP+.

Problem	Dim	Challenges
DMO	6	simplex constraint
DMO-MO	6	simplex constraint, multi-objective
DMO-Het	6	simplex constraint, heteroscedastic noise

Prompt optimization

Discrete search over 5,014 pre-embedded prompts, using Matryoshka embeddings at four truncation dimensions (128–768).

Problem	Dim	Challenges
PO-128	128	high-dimensional
PO-256	256	high-dimensional
PO-512	512	high-dimensional
PO-768	768	high-dimensional

Backed by 20k+ real LLM experiments, queried in milliseconds.

Emulators fitted on real LLM runs reproduce the objective landscape at negligible query cost, validated via Spearman rank correlation on held-out test sets (ρ ≥ 0.72 across all tasks).

2D slices of the DMO emulator landscape across instruction-following, math, and code data proportions. The varied terrain reflects real trade-offs in the underlying LLM experiments.

BO methods consistently outperform baselines.

We benchmark 15+ Bayesian optimization (BO) and black-box methods across all 10 problems. Three key findings are presented here.

HPO

GP-based methods (LogNEI, MES, GIBBON) consistently outperform HPO baselines (TPE, CMA-ES), and any additional compute overhead is negligible compared to real LLM training costs.

DMO-MO

NEHVI matches NSGA2 using 50× fewer evaluations, with substantially tighter confidence intervals.

PO-768

Trust-region methods (dTuRBO, dBAxUS) adapted for discrete candidate sets are essential, reaching near-optimal solutions within 200 iterations.

Evaluate on BoLT.

BoLT API example

BoLT is built for BBO researchers. Every problem subclasses BoTorch’s BaseTestProblem, so your existing code plugs straight in. Emulator weights and tabular data are fetched automatically from HuggingFace on first use.

The benchmark is not limited to the 10 bundled problems. You can construct new optimization settings (constrained, contextual, etc.) using the public emulators or underlying data, without running any new LLM experiments.

Install the Python package, point your optimizer at a BoLT problem, and you have a reproducible, grounded evaluation. See the documentation to get started.

Citation