AIcrowd | Context-Aware Dialogue (Task 2)

Warm-Up Round: Completed

Round 1: Completed

Round 2: Completed #knowledge_graph #llm Weight: 1.0

Sony Group Corporation

3566

323

1586

‼️Select your final submissions for evaluation by 3rd July 2025, 12:00 UTC

🙋‍♀️ New to the challenge? 🤔 Want to make your first submission?

⚙️ Access the Starter-Kit here.

✨ This challenge is a shared task of the Wordplay - EMNLP 2025 Workshop 📕

🕵️ Introduction

To sustain coherent and engaging conversations, dialogue agents must consider the personas of listeners to produce utterances that cater to their interests. They must also maintain a consistent speaker persona, worldview, and role so that their counterparts feel engaged in a realistic conversation and immersed in that world.

However, to behave according to their role, dialogue agents must engage in task-oriented dialogue that connects with the world (e.g., game space) and reflects actions. Additionally, they must be able to call functions to execute the given role's capabilities, check information, and utilize the knowledge possessed by that role.

In this task, we seek dialogue response generation systems that can appropriately utilize role-based knowledge and available functions while behaving in accordance with the assigned worldview and role.

📑 The Task

In this task, participants will submit a dialogue response generation system. We evaluate whether appropriate responses can be generated based on the input evaluation data (see "Evaluate②" in the figure). Additionally, we assess whether the necessary function calls for information retrieval and task execution (e.g., actions within the game) can be generated before response generation (see "Evaluate①" in the figure).

Two interlocutors are first assigned background profiles that describe their personas in the dialogue. Each background profile contains basic information (e.g., name, age, etc.) and persona descriptions consisting of elements deemed necessary to represent characters appearing in a game, in addition to perspectives similar to those in the PeaCoK† knowledge graph.

Based on the assigned personas of the two interlocutors, the task is to develop a dialogue model that generates one (orange) interlocutor’s response to their (blue) counterpart, given the dialogue history between these two interlocutors. The generated responses must maintain consistency with the dialogue history and knowledge while acting in accordance with the conversation partner's intentions and expectations. Additionally, the model must execute knowledge and functions according to its assigned role.

It is assumed that the dialogue model represents an NPC in the game, while the conversation partner is a player. Although, during our evaluation data collection, an objective of what a player wants to do is provided to the player side, this information is not disclosed during dialogue generation, as it is typically unavailable for use.

Note: A training dataset is provided as a reference, but its use is not mandatory. Participants may use any other datasets of their choice. To help identify issues with this task, a baseline model that can be tested with the provided training dataset is available in the starter kit.

GPU and API Tracks

We provide two separate settings for participants to choose from: the GPU Track and the API Track.

GPU Track

In this track, participants are provided with access to a GPU node, allowing them to fine-tune and submit their own LLMs tailored to this task.

API Track

In the API Track, participants are given access to the OpenAI API, enabling them to test their prompt engineering skills with a powerful LLM.

💾 Evaluation Data

The submitted systems will be evaluated using dialogue datasets based on personas and roles within the game. The evaluation data includes persona and worldview information as common knowledge, along with available function definitions and role-specific knowledge. Participants must use this information to call functions when necessary and incorporate the results into response generation.

Format

In this challenge, the model itself acts as an NPC and is expected to interact with the player, who serves as the conversation partner. The model can utilize several pieces of information:

Worldview
Basic information about the player
Detailed persona settings of the NPC
Knowledge known to the NPC
The state of the environment and/or NPC
Dialogue history
Function definitions

You can gain a general understanding of each piece of information from the provided training data.

Example

We provide an illustrative example of the training data.

Policy

The test set for the CPD Challenge will remain closed. Participants will not have access to it at any stage, even outside the challenge. This ensures a fair comparison of all submissions. The test set was created by Sony Group Corporation specifically for the evaluation of the CPD Challenge and is therefore confidential. It will not be shared with anyone outside the challenge’s organizing team.

🎓 Evaluation

Evaluation Protocols

In this task, we assume a multi-turn dialogue and evaluate response generation at each turn.

The model is required to output response sentences. We check whether the responses are natural in terms of context and role. In doing so, it is expected that the responses behave as an NPC in that world, based on predefined roles, personas, and knowledge.

In the previous CPDC2023, we initially used two metrics: Word F1 and BLEU. Later, we additionally utilized three more metrics: CPDScore, USEScore, and BERTScore. Through analysis, we understood that each metric has its weaknesses.

For this CPDC2025, we aim to conduct automatic evaluations in a way that minimizes the emphasis on the weaknesses of the evaluation metrics as much as possible. Specifically, we plan to combine multiple different metrics and avoid displaying the raw scores of each model, in the hope of preventing excessive optimization toward specific metrics. We will further explore a much better evaluation via this challenge.

Automatic metrics alone are not fully reliable for evaluating dialogue systems (Liu et al., 2016; Novikova et al., 2017). Therefore, we will also conduct human evaluations on the dialogue responses. However, due to the high workload, human evaluation will not be conducted for all submitted systems. Instead, pairwise comparisons will be made between the top-performing systems selected by automatic evaluation.

In human evaluation, we will comprehensively consider Fluency, Coherence, Consistency, Engagingness, and Humanness. While the evaluation will be fundamentally similar to the previous one, we will also take into account aspects such as Persona, Worldview, Knowledge, and Role, considering the consistency required for NPC behavior.

Automatic Evaluation Metrics

The ranking will be displayed on the leaderboard based on automatic evaluation results. Submitted systems (models) will be evaluated using a closed evaluation dataset prepared specifically for the CPD Challenge.

Human Evaluation Criteria

Only the top-performing systems selected through automatic evaluation will undergo human evaluation. These criteria will be considered (but are not limited to):

Note:

Systems must be self-contained, functioning without dependencies on external services or network access.
Systems should generate responses within a reasonable time to support natural conversations.
The metrics listed here are not the only automatic evaluation metrics that will be used.

📕 Baselines

We provide an illustrative baseline model for this task, which is an un-tuned LLaMA-3.1-8B-Instruct model. Please find it in the starter kit.

✍️ Submission Format and Compute Constraints

GPU Track

Your model will be run on an AWS g6e.2xlarge node. This node has 8 vCPUs, 64 GB RAM, and one Nvidia L40s GPU with 48 GB VRAM. timeout per turn: 7s

API Track

Your model will be run on an AWS m5.large node. This node has 2 vCPUs, 8 GB RAM.

API Usage Constraints

A maximum of 2 API calls per utterance is allowed.
Input token limit per turn : 2,000 tokens
Output token limit per turn : 200 tokens
Only gpt-4o-mini is allowed and available on the Servers
Fine-tuned API models are not allowed
Network access is expected to be blocked for OpenAI API usage
Timeout per turn: 7s

📅 Timeline

The challenge will take place across three rounds, each using a different evaluation dataset for ranking the systems.

Warm-up Round: 9th April 2025
Round 1: 20th April 2025
Round 2: 25th May 2025
Challenge End: 30th June 2025

🏆 Prizes

The prize pool is a total of 20,000 USD, divided among six tracks. Participating teams are eligible to win prizes across multiple leaderboards in both tracks.

Task 1: Task-Oriented Dialogue Response Generation (4,000 USD)

GPU Track
- 🥇 First place: 1,000 USD
- 🥈 Second place: 500 USD
- 🥉 Third place: 500 USD
API Track
- 🥇 First place: 1,000 USD
- 🥈 Second place: 500 USD
- 🥉 Third place: 500 USD

Please refer to the Challenge Rules for more details about the open-sourcing criteria for each leaderboard to be eligible for the associated prizes.

This challenge is a shared task of the Wordplay Workshop at EMNLP 2025; participants will get a chance to submit a technical report in the form of a paper, with the exact submission format and venue to be confirmed.

🔗 Reference

PeaCoK: PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives - ACL Anthology (ACL2023 Outstanding Paper Award)
ComperDial dataset: Sony/ComperDial
CPDC 2023 : Commonsense Persona-grounded Dialogue (CPD) Challenge
BFCL Leaderboard : Berkeley Function-Calling Leaderboard

📱 Challenge Organizing Committee

Hiromi Wakaki (Sony)
Antoine Bosselut (EPFL)
Silin Gao (EPFL)
Yuki Mitsufuji (Sony)
Yoshinori Maeda (Sony)
Yukiko Nishimura (Sony)
Keiichi Yamada (Sony)
Shiva Sundaram (Sony)
Sergey Bashkirov (Sony)
Prithviraj Ammanabrolu (UCSD)

If you have queries or feedback or are looking for teammates, drop a message on AIcrowd Community. Don’t forget to hop onto the Discord channel to collaborate with fellow participants & connect directly with the organisers. Share your thoughts, spark collaborations and get your queries addressed promptly.