AIcrowd | Integrating Contextual Dialogue and Task Execution (Hybrid)

Warm-Up Round: Completed

Round 1: Completed

Round 2: 24 days left #knowledge_graph #llm Weight: 1.0

Sony Group Corporation

2626

250

584

‼️ Round 2 is now live!
🙋‍♀️ New to the challenge? 🤔 Want to make your first submission?

⚙️ Access the Starter-Kit here.

✨ This challenge is a shared task of the Wordplay - EMNLP 2025 Workshop 📕

🕵️ Introduction

To sustain coherent and engaging conversations, dialogue agents must consider the personas of listeners to produce utterances that cater to their interests. They must also maintain a consistent speaker persona, worldview, and role so that their counterparts feel engaged in a realistic conversation and immersed in that world.

However, for true immersion, it is necessary not only to have natural small talk that aligns with the worldview and the NPCs’ personas but also to enable task-oriented dialogues that reflect actions connected to that world.

In this task, we seek dialogue response generation systems that can:

Appropriately utilize role-based knowledge and available functions.
Accurately represent and incorporate personas grounded in commonsense.
Behave according to the assigned worldview and role.

We evaluate whether models can perform both persona-based dialogue and task-oriented functions effectively within a single system.

📑 The Task

In this task, participants will submit a dialogue response generation system.

We evaluate whether a single model can:

Engage in natural, human-like conversations based on personas.
Perform necessary tasks based on roles.

Submitting to Task 3 will automatically result in evaluation under both Task 1 and Task 2, and their evaluation results will contribute to this task.

Participants must prepare a model/system that meets the requirements of both tasks.

GPU and API Tracks

We provide two separate settings for participants to choose from: the GPU Track and the API Track.

GPU Track

In this track, participants are provided with access to a GPU node, allowing them to fine-tune and submit their own LLMs tailored to this task.

API Track

In the API Track, participants are given access to the OpenAI API, enabling them to test their prompt engineering skills with a powerful LLM.

💾 Evaluation Data

There is no dedicated evaluation dataset for Task 3. Performance will be assessed comprehensively based on the evaluation results of Task 1 and Task 2, and participants will compete on the Task 3 leaderboard.

The submitted systems will be evaluated using dialogue datasets based on personas and roles within the game, following the same evaluation format as Task 1 and Task 2.

The evaluation dataset will include:

Persona and worldview information as common knowledge.
Available function definitions and role-specific knowledge for task execution.

Format

In this challenge, the model itself acts as an NPC and is expected to interact with the player, who serves as the conversation partner. The model can utilize several pieces of information:

Worldview
Basic information about the player
Detailed persona settings of the NPC
Knowledge known to the NPC
The state of environment and/or NPC
Dialogue history
Function definitions

You can gain a general understanding of each piece of information from the provided training data.

Example

We provide an illustrative example of the training data.

Policy

The test set for the CPD Challenge will remain closed. Participants will not have access to it at any stage, even outside the challenge. This ensures a fair comparison of all submissions.

The test set was created by Sony Group Corporation specifically for the evaluation of the CPD Challenge and is therefore confidential. It will not be shared with anyone outside the challenge’s organizing team.

🎓 Evaluation Metrics

Automatic Evaluation Metrics

The ranking will be displayed on the leaderboard based on automatic evaluation results. Submitted systems (models) will be evaluated using a closed evaluation dataset prepared specifically for the CPD Challenge.

Note:

Systems must be self-contained, functioning without dependencies on external services or network access.
Systems should generate responses within a reasonable time to support natural conversations.
The metrics listed here are not the only automatic evaluation metrics that will be used.

📕 Baselines

We provide an illustrative baseline model for this task, which is an un-tuned LLaMA-3.1-8B-Instruct model, which you can find in the starter kit.

✍️ Submission Format and Compute Constraints

GPU Track

Your model will be run on an AWS g6e.2xlarge node. This node has 8 vCPUs, 64 GB RAM, and one Nvidia L40s GPU with 48 GB VRAM. timeout per turn: 7s

API Track

Your model will be run on an AWS m5.large node. This node has 2 vCPUs, 8 GB RAM.

API Usage Constraints

A maximum of 2 API calls per utterance is allowed.
Input token limit per turn : 2,000 tokens
Output token limit per turn : 200 tokens
Only gpt-4o-mini is allowed and available on the Servers
Fine-tuned API models are not allowed
Network access is expected to be blocked for OpenAI API usage
Timeout per turn: 7s

📅 Timeline

The challenge will take place across three rounds, each using a different evaluation dataset for ranking the systems.

Warm-up Round: 9th April 2025
Round 1: 20th April 2025
Round 2: 25th May 2025
Challenge End: 30th June 2025

🏆 Prizes

The prize pool is a total of 20,000 USD, divided among six tracks. Participating teams are eligible to win prizes across multiple leaderboards in both tracks.

Task 3: Hybrid Evaluation of Task 1 and Task 2 (12,000 USD)

GPU Track
- 🥇 First place: 3,000 USD
- 🥈 Second place: 2,000 USD
- 🥉 Third place: 1,000 USD
API Track
- 🥇 First place: 3,000 USD
- 🥈 Second place: 2,000 USD
- 🥉 Third place: 1,000 USD

Please refer to the Challenge Rules for more details about the open-sourcing criteria for each leaderboard to be eligible for the associated prizes.

This challenge is a shared task of the Wordplay Workshop at EMNLP 2025; participants will get a chance to submit a technical report in the form of a paper, with the exact submission format and venue to be confirmed.

🔗 Reference

PeaCoK: PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives - ACL Anthology (ACL2023 Outstanding Paper Award)
ComperDial dataset: Sony/ComperDial
CPDC 2023 : Commonsense Persona-grounded Dialogue (CPD) Challenge
BFCL Leaderboard : Berkeley Function-Calling Leaderboard

📱 Challenge Organizing Committee

Hiromi Wakaki (Sony)
Antoine Bosselut (EPFL)
Silin Gao (EPFL)
Yuki Mitsufuji (Sony)
Yoshinori Maeda (Sony)
Yukiko Nishimura (Sony)
Keiichi Yamada (Sony)
Shiva Sundaram (Sony)
Sergey Bashkirov (Sony)
Prithviraj Ammanabrolu (UCSD)

If you have queries or feedback or are looking for teammates, drop a message on AIcrowd Community. Don’t forget to hop onto the Discord channel to collaborate with fellow participants & connect directly with the organisers. Share your thoughts, spark collaborations and get your queries addressed promptly.