📄 Read the challenge announcement post by Alignment Research Center
📕 Make your first submission with the starter kit
✨Try out the the interactive visualisation WhestBench Explorer
🔍 Overview
How do you test a system for behaviour it might never show you?
The obvious approach is to run it many times and look for the failure. But when the behaviour is rare, or the system is capable enough to dodge obvious test cases, brute-force sampling becomes too expensives and we learn too little. This is a real concern for AI safety: A sufficiently capable AI system is unlikely to fall for "honey-pots", so questions like whether such a system would undermine human control in unusual situations are unlikely to be reliably answered by running it on many inputs. But a neural network isn't a black box, we have its weights, and an algorithm that exploits them should do better than one that only observes outputs. This motivates white-box approaches that leverage access to a model's internals.
The ARC White-Box Estimation Challenge (WhestBench), organized by the Alignment Research Center (ARC) and AIcrowd, is a contest in compute-efficient white-box estimation. Given the weights of a neural network, can you predict its expected per-neuron activations more accurately than running it on a comparable number of sampled inputs.
Producing white-box methods for trained networks is the ultimate goal, and remarkably hard. WhestBench starts in the base case of randomly-initialized networks, a tractable setting in which to develop the algorithmic toolkit that mechanistic estimation, and the safety questions that motivate it, will eventually need. A recent paper from ARC produced white-box methods that outperform black-box sampling for MLPs at large width, but they break down as the depth grows, and the authors, who are also organizing this challenge, believe they can be significantly improved.
Participants build an estimator - an algorithm that takes a network's weights and returns the per-neuron activation estimates within a fixed compute budget. Submissions are ranked by accuracy against a high-precision reference, and the contest is set up so that algorithm design, not hardware or implementation tricks, decides the winner.
The best possible algorithms are expected to be mechanistic, avoiding black-box sampling entirely, but the leaderboard will decide. Although "white-box" is in the name, any compliant method - black-box sampling, white-box estimation or hybrid approach, is allowed.
🎯 The Task
For each submission, your estimator will run on multiple fixed MLP configurations and weights. For each evaluation MLP Mθ, your estimator returns an L×n matrix Y^ of per-neuron post-ReLU activation means. The activations are defined layer by layer as
h(0)=X,h(ℓ)=ReLU(W(ℓ)h(ℓ−1))for ℓ=1,…,L
and the target for entry (ℓ,i) of Y^ is the expected value of the corresponding post-ReLU neuron under standard-normal inputs:
Y^ℓ,i≈EX∼N(0,In)[hi(ℓ)(X)]
The expected value on the right-hand side has no closed form, so we approximate it for each MLP by running the network on a large pool of standard-normal samples and averaging the post-ReLU activations layer by layer. The resulting Monte-Carlo reference target Y is used as the ground truth reference.
The primary leaderboard score is computed on the final-layer row Y^L,⋅ against this reference. The earlier rows are reported as a diagnostic that reveals where approximation error accumulates across layers.
Example configuration:
| Parameter | Value |
|---|---|
| Width n | 256 |
| Hidden layers L | 8 |
| Weight initialization | He-Gaussian, variance 2/n |
| Input distribution | X∼N(0,In) |
| Analytical FLOP budget B | ≈3.4×1010 per MLP |
Each row of Y^ therefore has 256 entries; the full prediction is an 8×256 matrix.
🧮 Compute Model & Constraints
The contest is decided by an analytical FLOP budget rather than wall-clock time, so faster hardware does not confer an advantage. The accounting library is flopscope, a NumPy-compatible drop-in that counts every floating-point operation it executes.
In practice, your estimator imports flopscope.numpy in place of numpy, and the FLOPs of every call are tallied automatically. For example, the bundled mean-propagation estimator looks like:
Anything done through fnp.* or flops.* is analytically FLOP-counted as Fm. Anything done in plain numpy, in a Python for loop over scalars, or via uninstrumented libraries is not FLOP-counted; instead, its wall-clock time Rm is charged back to your budget at an unfavorable conversion rate λ. The effective compute used on MLP m is therefore
Cm=Fm+λ⋅Rm
Flopscope's own dispatch overhead is excluded, so calling many small fnp ops does not hurt you. It also tracks matrix symmetry, so operations on symmetric arrays (covariance updates, Gram matrices) are charged at the cheaper symmetric-matrix cost rather than the full general cost. Submissions must keep Cm≤Bm on every MLP in the suite; if they don't, that MLP's prediction is replaced with zeros.
Hardware. Submissions run in an isolated, standardized CPU-only environment with a pinned dependency set and disabled network access. A wall-clock guard backstops the FLOP budget to keep the evaluation queue reliable, but the FLOP budget, not wall time, is the binding constraint.
Rules highlights (see the Rules page for details):
- Submissions are executable Python code, not prediction files. Your
estimator.pymust conform to the contract published in the official starter kit. - All weights, lookup tables, and precomputed artifacts must be bundled inside the submission tarball; no network access is available at evaluation time.
- Do not attempt to modify flopscope, read private seeds, or access grader-internal state. Doing so is grounds for disqualification.
📊 Evaluation & Scoring
For each evaluation MLP, the grader computes the final-layer MSE between your prediction and the Monte-Carlo reference:
MSEfinalm=1n∑i(Y^L,i−YL,i)2
The per-MLP leaderboard score multiplies this by a compute-usage factor that rewards staying under the budget, capped at a bonus factor so that extremely cheap but inaccurate estimators cannot dominate:
sm=MSEfinalm⋅max(0.5,Cm/Bm)
The overall leaderboard score is the average of sm across the private evaluation suite. Lower is better.
Secondary diagnostic. All-layer MSE (MSEallm, averaged across all L×n neurons) is reported alongside the primary score but does not enter ranking.
Fallback for failed runs. If a submission exceeds the budget, raises an exception, returns invalid shapes or non-finite values, exhausts memory, or trips an operational guard on a given MLP, the grader substitutes a zero prediction for that MLP and continues evaluating the rest of the suite. No compute discount is applied to the fallback.
Final private re-run. The public leaderboard during Phase 1 / Phase 2 is not the final ranking. After the Phase 2 deadline, the grader re-executes each team's one designated final submission against a fresh, unseen MLP suite generated from a private seed that was not used during the open phases. Prize ranking is decided exclusively from this re-run. Submissions that overfit to specific MLPs or seeds will be penalized accordingly.
Tie-breaking. Statistically close finishes are first resolved by generating additional private MLPs to reduce uncertainty; any remaining ties are broken using all-layer MSE and effective compute usage.
🚀 How to Participate
The starter kit is structured as a six-stage ladder, where each stage adds one more layer of harness rigor. You can start with pure local Python at Stage 1 and climb as far as you need.
That last command runs your estimator.py against a Monte-Carlo convergence harness on a locally generated MLP, printing FLOPs used and MSE against ground truth. To climb the ladder:
- Iterate locally:
uv run python estimator.py. The math. Estimator vs. Monte Carlo. - Validate the contract:
uv run whest validate --estimator estimator.py. Contract correctness (shapes, types). - Run locally:
uv run whest run --estimator estimator.py --runner local. Real scoring, in-process,pdb-debuggable. - Subprocess runner:
… --runner subprocess. Isolation; closer to the grader environment. - Docker runner:
… --runner docker. Production-equivalent grader env. - Package your submission:
uv run whest package -o submission.tar.gz. Submission tarball.
Submission format. You upload a single tarball produced by whest package. Per-team submission caps per phase will be published on the challenge site. Every submission counts against the shared team budget regardless of which member initiated it. Before the Phase 2 deadline, each team designates one valid submission to carry forward into the final private re-run. That designation is what determines your prize ranking, not your best public-leaderboard score.
Interactive visualization. The hosted WhestBench Explorer lets you generate random MLPs at chosen (width, depth) and view their per-neuron Monte-Carlo ground truth as a heatmap.
🤖 Use of LLMs
LLM-assisted development is explicitly permitted: contestants are encouraged to use LLMs to whatever extent helps them improve their submissions. Two kinds of prize are planned (see the Prizes section):
- A best-performing-submission prize on the final private leaderboard.
- A best-explanation prize for the clearest write-up of a submission that performs above a threshold.
Especially for the latter, contestants may benefit from having a good understanding of any LLM-written code themselves, but the rules do not require this. Exploring LLM-assisted progress on a well-specified algorithmic problem is itself one of the motivations for running WhestBench: hill-climbing on a clear metric is an emerging research mode, and the contest is partly a test of how far it goes.
As a word of caution, the flopscope FLOP-counting utility is hackable in ways that would be very unambiguously hacking once pointed out, such as by modifying constants or counts held in memory. Contestants are responsible for ensuring that their submissions do not hack flopscope, regardless of whether or how they choose to use LLMs.
🏆 Prizes And Recognition
The challenge carries a prize pool of $100,000. The planned prize categories for Phase 2 are:
🏆 Best Score
- 🥇1st place: $50,000
- 🥈2nd place: $20,000
- 🥉3rd place: $10,000
📝 Best Algorithmic Contribution
- 🥇1st place: $20,000
Prizes for Best Score will be determined solely based on grader performance.
The Best Algorithmic Contribution prize will recognise the submission that most advances ARC’s research agenda on mechanistic estimation, taking into account both grader score and the algorithmic ideas presented in the accompanying technical write-up.
Additional prizes and prize pool increases may be introduced in either Phase 1 or Phase 2. Stay tuned for further updates.
🗓 Timeline
| Milestone | Date |
|---|---|
| Warm-up Round opens | May 28, 2026 |
| Phase 1 (Open Competition) | June 18 – July 31, 2026 |
| Phase 2 (Final Submission) | August 1 – September 19, 2026 |
| Final evaluations & due diligence | September 20 – 30, 2026 |
| Results announced | October 1, 2026 |
| NeurIPS Competition Track workshop | December 11 – 12, 2026 (TBD) |
The warm-up round exercises the end-to-end pipeline (packaging, validation, grader execution, FLOP accounting, leaderboard reporting). Scores from the warm-up round do not carry forward to the final leaderboard. After Phase 2 starts, only operational fixes applied consistently to all submissions are allowed.
📖 Citing the Challenge
If you use WhestBench in academic work, please cite the companion paper:
Wilson Wu, Victor Lecomte, Michael Winer, George Robinson, Jacob Hilton, and Paul Christiano. "Estimating the expected output of wide random MLPs more efficiently than sampling." arXiv:2605.05179, 2026.
A post-competition results report covering the final leaderboard, baseline comparisons, and a taxonomy of submitted methods will be published by the organizers and linked here.
👥 Organising Committee
- Paul Christiano, Alignment Research Center
- Jacob Hilton, Alignment Research Center
- Wilson Wu, Alignment Research Center
- Sharada Mohanty, AIcrowd
- Dipam Chakraborty, AIcrowd
📃 References
🤝 Links and Contact
The ARC White-Box Estimation Challenge is run by the Alignment Research Center in partnership with AIcrowd.
- 💬 Forum: Questions about the task, rules, or your submissions → AIcrowd discussion forum.
- 🐛 GitHub Issues: Bugs in flopscope, the starter kit, or the evaluation harness → file on the relevant repo under AIcrowd/whest-starterkit.
- Discord channel: https://discord.gg/4gyQvzWPJ
- 📧 Private / administrative matters:
arc-whestbench@aicrowd.com.
Good luck to all contestants!
The challenge got you curious?
Sign up to solve the problem
Already Have an Account? Log In
Sign Up with Email