πΉοΈ Imagine you ask your agent to gather wood, craft torches, and survive the first night in a video. In controlled demos, it succeeds. In a real game with partial observability, shifting UI, and long goals, it stalls or forgets. Orak exposes that gap and pushes you to build agents that plan, adapt, and finish the job.
π’ Starter Kit Update: Run `git pull origin master --rebase`. Read the announcement for more information.
Note: Street Fighter III is not part of this instance of the competition due to operational reasons.
π To make your fist submisson with easy, check out the starter kit.
πͺ Claim compute credits to improve your solution. Click here to learn how.
π Overview
Orak is an open benchmark to test agentic LLMs in real games. You will submit an MCP-connected agent that consumes textual and visual state across five iconic titles such as: Super Mario, PokΓ©mon, StarCraft II, and 2048.
The challenge measures planning, adaptability, efficiency, and success rate. Build an agent that goes beyond toy-grid tricks and proves reliable in real games. To understand the problem in depth and design a solution, check out the Orak Benchmark Paper.
π― The Task
Agents that shine in toy setups often falter in real games with partial observability, shifting UI, and long goals. Orak evaluates the skills that close this gap: visual grounding, long-horizon control, tool use and memory, and cross-genre generalization. Your goal is to build an agent that completes real-game objectives within defined time and resource limits.
Create one MCP-connected agent that operates on text-only or text-plus-vision. Document the playbook so others can follow β how it prompts, plans step by step, recovers from mistakes, writes and reads memory, and selects and calls tools. Package it for replay with a runnable, config-driven setup that any team can execute to reproduce your results.
The Competition evaluates participant-created LLM agents on five free video games using the Orak benchmark:
- Super Mario (SM): Action game requiring spatial reasoning and timing.
- PokΓ©mon (Pkmon): Turn-based RPG with strategic decision making.
- StarCraft II (SC2): Real-time strategy game with resource management.
- 2048: Puzzle game requiring mathematical reasoning.

π Tracks
Track 1. Lightweight (SLM) Total parameter budget up to 8 billion, including adapters and LoRA. Focus on efficient, fast, reproducible designs.
Track 2. Open No parameter limit. Separate leaderboard and sponsored credits.
Both tracks share evaluation logic and cash prize structure, with separate leaderboards and separate sponsored credit pool.
π§ͺ Evaluation modes
Single-player. The evaluation uses standardised scenarios with fixed seeds, and the resulting scores drive the leaderboard.
Freeform MCP play. The evaluation is open ended, focused on strategy and tool use. Results are reviewed qualitatively and highlighted, and the leaderboard emphasis remains on single-player.
A single agent implementation must connect to all five and play each game in the same session without manual intervention.
π Evaluation Metrics
The Competition uses the Orak benchmark metrics for each game.
Final ranking is determined by the weighted average of the five game scores using the following weights:
Game weights
| Game | Weight |
|---|---|
| Super Mario (SM) | 15% |
| PokΓ©mon (Pkmon) | 30% |
| StarCraft II (SC2) | 30% |
| 2048 | 15% |
Ranking process
- For each game, compute the official game score using Orak's standard metrics.
- Compute the weighted average of the five game scores to obtain the final score. Teams are ranked by final score. For details on tie-breaker, refer to the rules.
π How to participate
- Install and configure each game using the per-game setup guides. All five titles are free.
- Connect your agent to MCP using the provided client and game servers in the starter-kit.
- Run the local evaluation to verify environment parity and confirm logging.
- Submit your run to the leaderboard with your config, logs, and action trajectories.
π₯οΈ Local run and packaging requirements (TBD)
-
You run everything locally on your machine. You can choose your own compute.
-
Submit either a repository link or a zip file. We must be able to clone or unzip and run end to end with no edits.
-
Use the official starter kit. Fork it, make changes on top, keep it runnable locally, and submit the same starter kit structure back.
-
Reproducibility is required. Pin dependencies and include a clear README with a single command to run.
-
Add a git tag that maps the submission to a specific iteration. Example:
teamname-v1.2orteamname-iter-3. -
Use the provided helper script to log submission metadata before packaging.
π¦ Submission format
The competition follows a local execution model. During the competition, you run your agent on your own machine using the evaluation helper scripts provided in the starter kit. These scripts interact with the game servers and report your scores to the leaderboard. You do not upload any code to the leaderboard during the competition.
At the end of the competition, winners will be required to share their full source code and reproduction materials with Krafton for due diligence.
Leaderboard submissions (During competition)
Each time you run the evaluation helper scripts, a submission is recorded on the leaderboard. We recommend you track the following locally for each submission ID to help you iterate:
- Agent version β the specific commit or version of your code used.
- Run config β the model/API, prompts, tool definitions, seeds, and time limits used.
- Artifacts β the structured logs and action traces generated by the helper scripts.
Note: The starter kit includes a reference Agent interface and tools. You are free to adapt or replace these as needed, provided your agent allows the helper scripts to execute the evaluation flow.
Final verification package (For winners)
After the competition, track winners must submit a verification package to Krafton containing:
- Agent code β the full source code, including all scripts and dependencies needed to reproduce the winning results.
- Winning config β the exact configuration (seeds, models, prompts) used for the winning leaderboard entry.
- Dataset declaration β a list of any datasets used for training or fine-tuning, with specific details for the winning model.
- Optional report β a summary of design choices and ablations.
Submission Limits
- Each team may make up to 5 submissions per 24 hour period.
- Teams may designate up to 1 final submission for judging before the final submission deadline.
Track 1 model requirements
- Parameter limit β max 8B total parameters (active + frozen, embeddings, adapters, LoRA).
- Release date β model must be released before 1 Nov 2025.
- Fine-tuning β only on public or team-created data. If fine-tuned, submit the dataset or provide licensing/provenance docs; if you cannot share data, contact organizers in advance for an alternative verification process.
- Examples β Qwen3 (<8B), LLaMA-3.1 (<7B), Minitron (<8B), or any model meeting the limits.
Evaluation phases
- Phase 1 Β· Live (Leaderboard) β lightweight scripts make live LLM calls via MCP to the host game servers; the leaderboard updates in real time. Latency and throughput limits will be announced before submissions open.
- Phase 2 Β· Final (Reproduction) β after the deadline, submit everything needed to reproduce your Phase 1 results: model weights, custom agent logic and prompts, inference/integration code, and a 2-page PDF covering architecture, training (if any), and reproduction steps. Submissions that cannot be reproduced may be disqualified.
Final submission package (for verification and awards)
- Model artifacts β weights (β€8B where applicable), license/usage notes, run instructions; if fine-tuned, include data/provenance or documentation; note closed/proprietary weights with provider/version/specs.
- Agent code β runnable Python agent(s) compatible with all five games and the Orak framework, with local and organizer run instructions.
- Design & training doc β 2-page PDF: architecture, training, data (high level), key implementation details.
- Reproducibility β scripts, dependencies, seeds, Dockerfile/YAML (or equivalent).
- Submission meta β team name, contacts, members, short README, git tag identifying the submitted iteration, and metadata file produced by the helper script.
Required final artifacts (tie-breaks & checks)
- Model declaration β for each model:
name,version,provider,parameter_count(numeric) or organizer tier; mixed-model entries use call-weighted mean parameters for tie-breaks. - Evaluation summary (JSON/CSV) β at minimum:
total_inference_calls,total_tokens(or raw requests),evaluation_episodes,mean_calls_per_episode, andmean_tokens_per_episode. - Raw requests / re-tokenizable text β per-call request texts (JSONL/ZIP) for organizer re-tokenization, or per-call token counts computed with the organizer-designated tokenizer. Include a README if any content is redacted or encoded.
- Optional per-episode breakdown β
episode_id,game_name,seed,inference_calls,tokens,final_score.
π Scoring
Automatic evaluation
- Per-game metrics include completion, score, win rate, or task-specific objectives.
- Both text-only and text-plus-vision agents are supported.
- The aggregate score is the average across games, with variance reported.
- Runs use fixed seeds, versioned environments, and config-driven execution.
Qualitative review
- Freeform MCP plays are reviewed for reliability and recovery.
ποΈ Timeline
- Competition launch: 7 Nov 2025
- Submissions open: 28 Nov 2025
- Team registration deadline: 12 Dec 2025
- Final submission deadline: 1 Feb 2026
- Offline evaluation complete: 14 Feb 2026
- Winner announcement: 14 Feb 2026
π Prizes
The challenge carries prize pool of USD 20,000 across two tracks.
Track 1: Lightweight
- π₯ 1st Place: USD 6,000
- π₯ 2nd Place: USD 3,000
- π₯3rd Place: USD 1,000
Track 2: Open
- π₯ 1st Place: USD 6,000
- π₯ 2nd Place: USD 3,000
- π₯3rd Place: USD 1,000
π Sponsored credits
-
Track 1 Β· Lightweight: NVIDIA Brev credits USD 15,000
-
Track 2 Β· Open: AWS Bedrock credits USD 20,000 and OpenAI API credits USD 10,000
Each track has a separate leaderboard and award pool. Awards may be adjusted or reallocated if participation is uneven or rules are violated.
Claim compute credits to improve your solution. Click here to learn how.
π§° Starter kit and resources
To make your fist submisson with easy, check out the starter kit. The starter kit includes:
- Game setup guides
- MCP client and sample tools
- Baselines and logging templates
- Leaderboard submission guide
π Resources and references
Orak benchmark paper
Read the Orak Benchmark Paper here.
Amazon Bedrock
- Setup and usage: Example code for calling FMs in Amazon Bedrock (Python) https://docs.aws.amazon.com/code-library/latest/ug/python_3_bedrock-runtime_code_examples.html
- IAM policy configuration Β· Refer to section (b) to allow only necessary API calls https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html
- Runtime API operations https://docs.aws.amazon.com/bedrock/latest/APIReference/API_Operations_Amazon_Bedrock_Runtime.html
- Supported models by region https://docs.aws.amazon.com/ko_kr/bedrock/latest/userguide/models-regions.html
- Service quotas https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html
NVIDIA Brev
- Setup and usage guideline here.
- Credits usage and limits over here.
π€ Sponsors
βοΈ Eligibility and rules
- Max team size: 5
- Submission limit: up to 5 per 24 hours
- Model cutoff: models must be released before 1 Nov 2025
- Fine-tuning: allowed only on public or team-created datasets, which must be shared if used
- Final selection: each team may mark one submission for judging
- Reproducibility: packages required for final evaluation
- Multiple accounts: disqualification
- Age and jurisdiction: open to participants 18+ and subject to Korean law
Participants



Sign Up with Email