AIcrowd | Orak Game Agent Challenge 2025

Phase 1: 4 days left

Krafton AI

20.7k

484

117

544

⌛️ Deadline extended to 8 Feb, 2026 23:55 UTC + Guideline for Submission Package.

‼️ Clarifications on Prompts, Fine-tuning, Tool Usage, and Hidden Test Cases

📹 Orak Game Agent Townhall Recording

Note: Street Fighter III is not part of this instance of the competition due to operational reasons.

📕 To make your fist submisson with easy, check out the starter kit.
🪙 Claim compute credits to improve your solution. Click here to learn how.

🔍 Overview

Orak is an open benchmark to test agentic LLMs in real games. You will submit an MCP-connected agent that consumes textual and visual state across five iconic titles such as: Super Mario, Pokémon, StarCraft II, and 2048.

The challenge measures planning, adaptability, efficiency, and success rate. Build an agent that goes beyond toy-grid tricks and proves reliable in real games. To understand the problem in depth and design a solution, check out the Orak Benchmark Paper.

🎯 The Task

Agents that shine in toy setups often falter in real games with partial observability, shifting UI, and long goals. Orak evaluates the skills that close this gap: visual grounding, long-horizon control, tool use and memory, and cross-genre generalization. Your goal is to build an agent that completes real-game objectives within defined time and resource limits.

Create one MCP-connected agent that operates on text-only or text-plus-vision. Document the playbook so others can follow — how it prompts, plans step by step, recovers from mistakes, writes and reads memory, and selects and calls tools. Package it for replay with a runnable, config-driven setup that any team can execute to reproduce your results.

The Competition evaluates participant-created LLM agents on five free video games using the Orak benchmark:

Super Mario (SM): Action game requiring spatial reasoning and timing.
Pokémon (Pkmon): Turn-based RPG with strategic decision making.
StarCraft II (SC2): Real-time strategy game with resource management.
2048: Puzzle game requiring mathematical reasoning.

Claude playing Baba Is You

🏁 Tracks

Track 1. Lightweight (SLM) Total parameter budget up to 8 billion, including adapters and LoRA. Focus on efficient, fast, reproducible designs.

Track 2. Open No parameter limit. Separate leaderboard and sponsored credits.

Both tracks share evaluation logic and cash prize structure, with separate leaderboards and separate sponsored credit pool.

🧪 Evaluation modes

Single-player. The evaluation uses standardised scenarios with fixed seeds, and the resulting scores drive the leaderboard.

Freeform MCP play. The evaluation is open ended, focused on strategy and tool use. Results are reviewed qualitatively and highlighted, and the leaderboard emphasis remains on single-player.

A single agent implementation must connect to all five and play each game in the same session without manual intervention.

📐 Evaluation Metrics

The Competition uses the Orak benchmark metrics for each game.

Final ranking is determined by the weighted average of the five game scores using the following weights:

Game weights

Game	Weight
Super Mario (SM)	15%
Pokémon (Pkmon)	30%
StarCraft II (SC2)	30%
2048	15%

Ranking process

For each game, compute the official game score using Orak's standard metrics.
Compute the weighted average of the five game scores to obtain the final score. Teams are ranked by final score. For details on tie-breaker, refer to the rules.

🚀 How to participate

Install and configure each game using the per-game setup guides. All five titles are free.
Connect your agent to MCP using the provided client and game servers in the starter-kit.
Run the local evaluation to verify environment parity and confirm logging.
Submit your run to the leaderboard with your config, logs, and action trajectories.

🖥️ Local run and packaging requirements (TBD)

You run everything locally on your machine. You can choose your own compute.
Submit either a repository link or a zip file. We must be able to clone or unzip and run end to end with no edits.
Use the official starter kit. Fork it, make changes on top, keep it runnable locally, and submit the same starter kit structure back.
Reproducibility is required. Pin dependencies and include a clear README with a single command to run.
Add a git tag that maps the submission to a specific iteration. Example: teamname-v1.2 or teamname-iter-3.
Use the provided helper script to log submission metadata before packaging.

📦 Submission format

The competition follows a local execution model. During the competition, you run your agent on your own machine using the evaluation helper scripts provided in the starter kit. These scripts interact with the game servers and report your scores to the leaderboard. You do not upload any code to the leaderboard during the competition.

At the end of the competition, winners will be required to share their full source code and reproduction materials with Krafton for due diligence.

Leaderboard submissions (During competition)

Each time you run the evaluation helper scripts, a submission is recorded on the leaderboard. We recommend you track the following locally for each submission ID to help you iterate:

Agent version — the specific commit or version of your code used.
Run config — the model/API, prompts, tool definitions, seeds, and time limits used.
Artifacts — the structured logs and action traces generated by the helper scripts.

Note: The starter kit includes a reference Agent interface and tools. You are free to adapt or replace these as needed, provided your agent allows the helper scripts to execute the evaluation flow.

Final verification package (For winners)

After the competition, track winners must submit a verification package to Krafton containing:

Agent code — the full source code, including all scripts and dependencies needed to reproduce the winning results.
Winning config — the exact configuration (seeds, models, prompts) used for the winning leaderboard entry.
Dataset declaration — a list of any datasets used for training or fine-tuning, with specific details for the winning model.
Optional report — a summary of design choices and ablations.

Submission Limits

Each team may make up to 5 submissions per 24 hour period.
Teams may designate up to 1 final submission for judging before the final submission deadline.

Track 1 model requirements

Parameter limit — max 8B total parameters (active + frozen, embeddings, adapters, LoRA).
Release date — model must be released before 1 Nov 2025.
Fine-tuning — only on public or team-created data. If fine-tuned, submit the dataset or provide licensing/provenance docs; if you cannot share data, contact organizers in advance for an alternative verification process.
Examples — Qwen3 (<8B), LLaMA-3.1 (<7B), Minitron (<8B), or any model meeting the limits.

Evaluation phases

Phase 1 · Live (Leaderboard) — lightweight scripts make live LLM calls via MCP to the host game servers; the leaderboard updates in real time. Latency and throughput limits will be announced before submissions open.
Phase 2 · Final (Reproduction) — after the deadline, submit everything needed to reproduce your Phase 1 results: model weights, custom agent logic and prompts, inference/integration code, and a 2-page PDF covering architecture, training (if any), and reproduction steps. Submissions that cannot be reproduced may be disqualified.

Final submission package (for verification and awards)

Model artifacts — weights (≤8B where applicable), license/usage notes, run instructions; if fine-tuned, include data/provenance or documentation; note closed/proprietary weights with provider/version/specs.
Agent code — runnable Python agent(s) compatible with all five games and the Orak framework, with local and organizer run instructions.
Design & training doc — 2-page PDF: architecture, training, data (high level), key implementation details.
Reproducibility — scripts, dependencies, seeds, Dockerfile/YAML (or equivalent).
Submission meta — team name, contacts, members, short README, git tag identifying the submitted iteration, and metadata file produced by the helper script.

Required final artifacts (tie-breaks & checks)

Model declaration — for each model: name, version, provider, parameter_count (numeric) or organizer tier; mixed-model entries use call-weighted mean parameters for tie-breaks.
Evaluation summary (JSON/CSV) — at minimum: total_inference_calls, total_tokens (or raw requests), evaluation_episodes, mean_calls_per_episode, and mean_tokens_per_episode.
Raw requests / re-tokenizable text — per-call request texts (JSONL/ZIP) for organizer re-tokenization, or per-call token counts computed with the organizer-designated tokenizer. Include a README if any content is redacted or encoded.
Optional per-episode breakdown — episode_id, game_name, seed, inference_calls, tokens, final_score.

📊 Scoring

Automatic evaluation

Per-game metrics include completion, score, win rate, or task-specific objectives.
Both text-only and text-plus-vision agents are supported.
The aggregate score is the average across games, with variance reported.
Runs use fixed seeds, versioned environments, and config-driven execution.

Qualitative review

Freeform MCP plays are reviewed for reliability and recovery.

🗓️ Timeline

Competition launch: 7 Nov 2025
Submissions open: 28 Nov 2025
Team registration deadline: 12 Dec 2025
Final submission deadline: ~~1 Feb 2026~~ Extended to 8th Feb 2026
Offline evaluation complete: 14 Feb 2026
Winner announcement: 14 Feb 2026

🏆 Prizes

The challenge carries prize pool of USD 20,000 across two tracks.

Track 1: Lightweight

🥇 1st Place: USD 6,000
🥈 2nd Place: USD 3,000
🥉3rd Place: USD 1,000

Track 2: Open

🥇 1st Place: USD 6,000
🥈 2nd Place: USD 3,000
🥉3rd Place: USD 1,000

🎁 Sponsored credits

Track 1 · Lightweight: NVIDIA Brev credits USD 15,000
Track 2 · Open: AWS Bedrock credits USD 20,000 and OpenAI API credits USD 10,000

Each track has a separate leaderboard and award pool. Awards may be adjusted or reallocated if participation is uneven or rules are violated.
Claim compute credits to improve your solution. Click here to learn how.

🧰 Starter kit and resources

To make your fist submisson with easy, check out the starter kit. The starter kit includes:

Game setup guides
MCP client and sample tools
Baselines and logging templates
Leaderboard submission guide

📚 Resources and references

Orak benchmark paper

Read the Orak Benchmark Paper here.

Amazon Bedrock

Setup and usage: Example code for calling FMs in Amazon Bedrock (Python) https://docs.aws.amazon.com/code-library/latest/ug/python_3_bedrock-runtime_code_examples.html
IAM policy configuration · Refer to section (b) to allow only necessary API calls https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html
Runtime API operations https://docs.aws.amazon.com/bedrock/latest/APIReference/API_Operations_Amazon_Bedrock_Runtime.html
Supported models by region https://docs.aws.amazon.com/ko_kr/bedrock/latest/userguide/models-regions.html
Service quotas https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html

NVIDIA Brev

Setup and usage guideline here.
Credits usage and limits over here.

🤝 Sponsors

⚖️ Eligibility and rules

Max team size: 5
Submission limit: up to 5 per 24 hours
Model cutoff: models must be released before 1 Nov 2025
Fine-tuning: allowed only on public or team-created datasets, which must be shared if used
Final selection: each team may mark one submission for judging
Reproducibility: packages required for final evaluation
Multiple accounts: disqualification
Age and jurisdiction: open to participants 18+ and subject to Korean law