βοΈ Global Chess Challenge 2025
The Global Chess Challenge uses chess as a clean, rigorous testbed for studying reasoning in language models. Classical chess engines like Stockfish reach superhuman strength through heuristics, deep search and precise calculation. Large language models, by contrast, operate very differently and often struggle with basic issues such as move legality, tactical consistency, or planning several moves ahead. Rather than viewing this gap as a weakness, this challenge treats it as an opportunity: to understand how structured reasoning can be learned, constrained, and evaluated inside language models.
This challenge frames chess as a text-only problem. Models receive a symbolic description of the position and must decide what to play without access to boards, search procedures, or external tools. Every position is fully observable, every move can be checked for legality, and move quality can be evaluated objectively using Stockfish. This makes chess unusually well suited for controlled experimentation: the rules are fixed, the state space is precise, and progress can be measured reliably.
For the chess community, it points toward a new kind of learning experience. Instead of outputting only the best move, models are required to explain their choice in simple, human-readable language. By articulating ideas, plans, and trade-offs, these models resemble a commentator or coach rather than a silent engine. This opens the door to more intuitive and conversational analysis tools built directly on top of playersβ own games.
For the AI research community, it offers a transparent and reproducible environment for studying text-based reasoning under strict constraints. Participants can combine large public chess datasets with engine-based verification to explore a wide range of approaches, from supervised finetuning to reinforcement learning with verifiable rewards. All within a domain that is both rich and precisely defined. The result is a shared benchmark that connects human learning, language-based reasoning, and the enduring complexity of chess.
The Global Chess Challenge 2025 is a global hybrid competition organized by AGI House and sponsored by Amazon Web Services (AWS), with platform and leaderboard infrastructure provided by AIcrowd. The Challenge asks whether small language models can make strong, legal chess decisions from text-only inputs, under strict execution constraints, while also producing a short, human-readable explanation of their intent.
π» What is the Global Chess Challenge?
You will build a text-only chess agent that does two things for every position:
- Outputs exactly one legal move (in UCI format)
- Outputs a one-sentence rationale explaining the idea behind the move
All submissions are executed independently by the Organizers on controlled infrastructure. At inference time, your model must behave as a standalone language model: it must decide moves solely via token-level prediction conditioned on the provided text input.
β Allowed: training with open datasets; offline preprocessing; finetuning; RL using verifiable rewards; using Stockfish during training to label / score data. \ β Not allowed at inference: external tools, function calling, heuristic search procedures, retrieval systems, embedded chess engines, or any auxiliary decision system beyond the submitted modelβs own forward pass.
The goal is to build reliable structured reasoning and strong play within a constrained, reproducible evaluation setting.
π§ͺ Suggested Approaches
The following are representative approaches we encourage participants to explore as part of this Challenge. They are not rigid tracks, and teams are explicitly encouraged to try any other methods they believe fit within the Challenge constraints.
The only requirement is that the submitted artifact respects the inference-time constraints: at evaluation time, the model must operate as a standalone language model with no tools, search, or external systems.
1οΈβ£ Data-centric finetuning (SFT)
Train a model to map text positions to high-quality moves (and short explanations).
Possible ingredients:
- Open chess corpora such as the Lichess Open Database / puzzle sets (as permitted by their licenses)
- Tuples like: \
{FEN, side_to_move, legal_moves_uci, move_played, optional Stockfish labels} - Offline Stockfish annotations for training labels (best move / PV / eval) used during training only
2οΈβ£ RLVR (Reinforcement Learning with Verifiable Rewards)
Use Stockfish as a verifier during training to generate rewards (e.g., legality + evaluation improvement + top-K alignment), and optimize with RL methods such as PPO / GRPO.
Key point: the submitted model must still run without tools/search at inference time.
π₯ Submission Format
Participants submit a language model via a gated Hugging Face repository. Once a submission is accepted, the Organizers pull and run the model on their own controlled infrastructureβparticipants never run code inside the evaluation environment.
Submissions interact with the provided environment purely through text, using standardized prompt templates. Teams may submit up to 20 entries per day.
At inference time:
- The model is loaded from the submitted Hugging Face repo
- Inputs are provided via a prompt template with predefined variables
- The model must respond with a move and a short rationale, following a strict output format
Details of the prompt templates and available variables are documented in the starter kit: https://github.com/AIcrowd/global-chess-challenge-2025-starter-kit/tree/master/player_agents
π§© Model input
For every turn, the agent can receive a prompt which uses multiple variables as described in the docs, including:
- Position as a FEN string (e.g.
r1bk3r/p2pBpNp/n4n2/1p1NP2P/6P1/3P4/P1P1K3/q5b1) - Side to move (White / Black)
- List of legal moves in UCI format (e.g.,
e2e4,g1f3,e7e8q)
You do not need to implement chess rules or generate legal movesβthis is handled by the environment.
π§© Model output
For each input position, the agent must return:
- Exactly one UCI move, chosen from the provided legal move list, wrapped in:
<uci_move>...</uci_move>
- A one-sentence rationale (required but not scored), typically wrapped in:
<rationale>...</rationale>
Important: Evaluation is based exclusively on the UCI move inside <uci_move> tags. Any text outside <uci_move> is ignored for scoring.
If your submission does not provide a valid UCI move in the correct tags (missing tags, malformed output, or illegal move), the evaluator will retry up to three (3) times for the same position. If it still fails, the model is treated as having resigned, and the game is recorded as a loss.
π Data & Environment
The Challenge provides a shared environment for standardized evaluation.
- Uses
[python-chess](https://github.com/AIcrowd/chess-env)for board representation, FEN/PGN, and legality checks - Uses local Stockfish for baseline opponents and for post-game analysis
- Produces game logs (PGNs) and evaluation summaries
Participants can use the starter kit to run local games and verify output format, legality, and end-to-end execution.
π Evaluation & Metrics (What the leaderboard actually uses)
Evaluation proceeds in multiple stages.
β Round 1 & Round 2: Baseline Evaluation (Leaderboard)
In both rounds, every submission is evaluated against fixed Stockfish opponents to create a stable, comparable baseline.
Each submission plays:
- 50 games vs Stockfish Skill 0 (Depth 1)
- 50 games vs Stockfish Skill 0 (Depth 5)
Primary leaderboard score: Average Centipawn Loss (ACPL)
- ACPL is computed by analyzing played games using Stockfish Level 20 (Depth 20) as the reference evaluator.
- Lower ACPL is better.
Secondary score: Win Rate
- Win rate is computed across all baseline games and is used as a secondary metric (e.g., for tie-breaking and analysis).
β Eligibility for the Final Tournament
At the end of Round 2, only submissions with ACPL lower than the official baseline model defined by the Organizers are considered eligible submissions and advance to the Final Tournament.
π Final Tournament: Swiss-Style Competition (Determines winners)
Eligible submissions compete in a Swiss-system tournament.
- Final ranking is based only on game outcomes:
- Win = 1 point
- Draw = 0.5 points
- Loss = 0 points
- ACPL is not used during the Swiss tournament for scoring or ranking.
The final winners and the prize allocations are decided based on the results of this tournament.
π Execution Constraints
All models are run in a controlled environment with standardized resources (a trn1.2xlarge instance) and no external network access.
At inference time, submissions:
- Must not call external tools or APIs
- Must not use function calling, retrieval, or heuristic search procedures
- Must not embed or invoke chess engines or auxiliary decision systems
- Must produce decisions solely through language model inference over the provided text prompt
π¦ Eligible Models, Backends, and Size Limit
Because evaluation runs on AWS Trainium with a specific runtime stack, only a subset of model families and execution backends are supported.
Participants must use only the architectures and backends documented here: \ https://github.com/AIcrowd/global-chess-challenge-2025-starter-kit/blob/master/docs/neuron-and-vllm-tuning.md#supported-model-types-backends
Model size restriction
Only models with a total parameter count of strictly fewer than 8,000,000,000 (8B) parameters are eligible for leaderboard ranking, Final Tournament qualification, and prizes.
Parameter count is determined from the model weights at inference time (excluding optimizer state).
βοΈ AWS Trainium
This Challenge runs on AWS Trainium, using the AWS Neuron software stack and supported model execution backends.
Resources to get started
- AWS Trainium overview: https://aws.amazon.com/ai/machine-learning/trainium/
- Neuron SDK docs: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html
- Hugging Face Optimum Neuron: https://huggingface.co/docs/optimum-neuron/index
- AWS Neuron workshops repo: https://github.com/aws-neuron/neuron-workshops
- Run Trainium in your own AWS account (AWS re:Post): https://repost.aws/articles/ARgiH8VXXuQ22iSUmwX7ffiQ
- YouTube reference: https://www.youtube.com/watch?v=CyTCTuq1z0Q
- Live Workshop Recording: https://discourse.aicrowd.com/t/recording-for-live-workshop/17661
π Prizes
Cash prize pool: USD 17,000
Compute credits: USD 8,000
- π₯ First Place: USD 10,000 + USD 5,000 credits
- π₯ Second Place: USD 5,000 + USD 2,000 credits
- π₯ Third Place: USD 2,000 + USD 1,000 credits
(Prize eligibility is subject to the Official Rules.)
π Timeline
- Launch & Registration Opens: December 4, 2025
- Round 1 Submissions Close: December 31, 2025 (23:55 UTC)
- Team Freeze Last Date: 15th January, 2025 (23:55 UTC)
- Round 2 Submissions Close: January 31, 2026 (23:55 UTC)
- Final Tournament: Feb 1, 2026 β Feb 7, 2026
- Winners Announced: Feb 15, 2026
π Starter Kit
Make your first submission using the starter kit: https://github.com/AIcrowd/global-chess-challenge-2025-starter-kit
It includes:
- A ready-to-run environment
- Example agents
- A template submission
- Documentation for supported model/backends on Trainium
Sign Up with Email