AIcrowd | Task-Oriented Dialogue (Task 1)

Warm-Up Round: Completed

Round 1: Completed

Round 2: Completed #knowledge_graph #llm Weight: 1.0

Sony Group Corporation

4690

305

2105

‼️Select your final submissions for evaluation by 3rd July 2025, 12:00 UTC

🙋‍♀️ New to the challenge? 🤔 Want to make your first submission?

⚙️ Access the Starter-Kit here.

✨ This challenge is a shared task of the Wordplay - EMNLP 2025 Workshop 📕

🕵️ Introduction

To sustain coherent and engaging conversations, dialogue agents must consider the personas of listeners to produce utterances that cater to their interests. They must also maintain a consistent speaker persona, worldview, and role so that their counterparts feel engaged in a realistic conversation and immersed in that world.

However, to behave according to their role, dialogue agents must engage in task-oriented dialogue that connects with the world (e.g., game space) and reflects actions. Additionally, they must be able to call functions to execute the given role's capabilities, check information, and utilize the knowledge possessed by that role.

In this task, we seek dialogue response generation systems that can appropriately utilize role-based knowledge and available functions while behaving in accordance with the assigned worldview and role.

📑 The Task

In this task, participants will submit a dialogue response generation system. We evaluate whether appropriate responses can be generated based on the input evaluation data (see "Evaluate②" in the figure). Additionally, we assess whether the necessary function calls for information retrieval and task execution (e.g., actions within the game) can be generated before response generation (see "Evaluate①" in the figure).

Two interlocutors are first assigned background profiles that describe their personas in the dialogue. Each background profile contains basic information (e.g., name, age, etc.) and persona descriptions consisting of elements deemed necessary to represent characters appearing in a game, in addition to perspectives similar to those in the PeaCoK† knowledge graph.

Based on the assigned personas of the two interlocutors, the task is to develop a dialogue model that generates one (orange) interlocutor’s response to their (blue) counterpart, given the dialogue history between these two interlocutors. The generated responses must maintain consistency with the dialogue history and knowledge while acting in accordance with the conversation partner's intentions and expectations. Additionally, the model must execute knowledge and functions according to its assigned role.

It is assumed that the dialogue model represents an NPC in the game, while the conversation partner is a player. Although, during our evaluation data collection, an objective of what a player wants to do is provided to the player side, this information is not disclosed during dialogue generation, as it is typically unavailable for use.

Note: A training dataset is provided as a reference, but its use is not mandatory. Participants may use any other datasets of their choice. To help identify issues with this task, a baseline model that can be tested with the provided training dataset is available in the starter kit.

GPU and API Tracks

We provide two separate settings for participants to choose from: the GPU Track and the API Track.

GPU Track

In this track, participants are provided with access to a GPU node, allowing them to fine-tune and submit their own LLMs tailored to this task.

API Track

In the API Track, participants are given access to the OpenAI API, enabling them to test their prompt engineering skills with a powerful LLM.

💾 Evaluation Data

The submitted systems will be evaluated using dialogue datasets based on personas and roles within the game. The evaluation data includes persona and worldview information as common knowledge, along with available function definitions and role-specific knowledge. Participants must use this information to call functions when necessary and incorporate the results into response generation.

Format

In this challenge, the model itself acts as an NPC and is expected to interact with the player, who serves as the conversation partner.
The model can utilize several pieces of information:

Worldview
Basic information about the player
Detailed persona settings of the NPC
Knowledge known to the NPC
The state of environment and/or NPC
Dialogue history
Function definitions

You can gain a general understanding of each piece of information from the provided training data.

Example

We provide an illustrative example of the training data.

Policy

The test set for the CPD Challenge will remain closed. Participants will not have access to it at any stage, even outside the challenge. This ensures a fair comparison of all submissions. The test set was created by Sony Group Corporation specifically for the evaluation of the CPD Challenge and is therefore confidential. It will not be shared with anyone outside the challenge’s organizing team.

🎓 Evaluation Metrics

Evaluation Protocols

In this task, we assume a multi-turn dialogue and evaluate the following two points for each turn.

(1) Function Generation

Tool functions and action functions are defined and provided. For each turn, generate and call functions as needed.

Specifically:

Call action functions for predetermined actions within the game. There are no return values.
You can call tool functions to obtain the information necessary for the response. The obtained results by function calling can be used to generate the response.

We will evaluate whether the correct functions are called with the necessary arguments for each turn by comparing with gold labels. In some turns it is correct that there is no function to call. If unnecessary functions are called, the score will decrease, and if functions are called appropriately at the right timing (turn), the score will increase.

There are two types of function calls. The first is action functions. These need to be executed at certain timiming according to predetermined roles, conditions, and settings. These are assumed to be functions necessary for NPCs to operate within the game, and there are no return values. The second is tool functions. Information that needs to be included in the response may require calling functions to obtain it, and in such cases, it is expected that tool functions will be called. This will be determined as necessary to execute based on the flow of the conversation.

(2) Response Generation

In each turn, a response needs to be output. We will evaluate whether the response is natural in the context and role. At that time, it is expected that the response will act as an NPC in that world based on predetermined roles, personas, and knowledge. Information that needs to be included in the response naturally may require calling functions to obtain it, and in such cases, the return values obtained by calling tool functions can be used. Additionally, it is expected to generate utterances with the necessary content at the necessary timing in relation to action function calls.

We will not disclose the specific metrics used for evaluation, but all evaluations are automated. (1) and (2) are evaluated with different metrics, and these are integrated into a single score for evaluation.

Automatic Evaluation Metrics

The ranking will be displayed on the leaderboard based on automatic evaluation results. Submitted systems (models) will be evaluated using a closed evaluation dataset prepared specifically for the CPD Challenge.

Note:

Systems must be self-contained, functioning without dependencies on external services or network access.
Systems should generate responses within a reasonable time to support natural conversations.
The metrics listed here are not the only automatic evaluation metrics that will be used.

📕 Baselines

We provide an illustrative baseline model for this task, which is an un-tuned LLaMA-3.1-8B-Instruct model. Please find that in the starter kit.

✍️ Submission Format and Compute Constraints

GPU Track

Your model will be run on an AWS g6e.2xlarge node. This node has 8 vCPUs, 64 GB RAM, and one Nvidia L40s GPU with 48 GB VRAM. timeout per turn: 7s

API Track

Your model will be run on an AWS m5.large node. This node has 2 vCPUs, 8 GB RAM.

API Usage Constraints

A maximum of 2 API calls per utterance is allowed.
Input token limit per turn : 2,000 tokens
Output token limit per turn : 200 tokens
Only gpt-4o-mini is allowed and available on the Servers
Fine-tuned API models are not allowed
Network access is expected to be blocked for OpenAI API usage
Timeout per turn: 7s

📅 Timeline

The challenge will take place across three rounds, each using a different evaluation dataset for ranking the systems.

Warm-up Round: 9th April 2025
Round 1: 20th April 2025
Round 2: 25th May 2025
Challenge End: 30th June 2025

🏆 Prizes

The prize pool is a total of 20,000 USD, divided among six tracks. Participating teams are eligible to win prizes across multiple leaderboards in both tracks.

Task 1: Task-Oriented Dialogue Response Generation (4,000 USD)

GPU Track
- 🥇 First place: 1,000 USD
- 🥈 Second place: 500 USD
- 🥉 Third place: 500 USD
API Track
- 🥇 First place: 1,000 USD
- 🥈 Second place: 500 USD
- 🥉 Third place: 500 USD

Please refer to the Challenge Rules for more details about the open-sourcing criteria for each leaderboard to be eligible for the associated prizes.

This challenge is a shared task of the Wordplay Workshop at EMNLP 2025; participants will get a chance to submit a technical report in the form of a paper, with the exact submission format and venue to be confirmed.

🔗 Reference

PeaCoK: PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives - ACL Anthology (ACL2023 Outstanding Paper Award)
ComperDial dataset: Sony/ComperDial
CPDC 2023 : Commonsense Persona-grounded Dialogue (CPD) Challenge
BFCL Leaderboard : Berkeley Function-Calling Leaderboard

📱 Challenge Organizing Committee

Hiromi Wakaki (Sony)
Antoine Bosselut (EPFL)
Silin Gao (EPFL)
Yuki Mitsufuji (Sony)
Yoshinori Maeda (Sony)
Yukiko Nishimura (Sony)
Keiichi Yamada (Sony)
Shiva Sundaram (Sony)
Sergey Bashkirov (Sony)
Prithviraj Ammanabrolu (UCSD)

If you have queries or feedback or are looking for teammates, drop a message on AIcrowd Community. Don’t forget to hop onto the Discord channel to collaborate with fellow participants & connect directly with the organisers. Share your thoughts, spark collaborations and get your queries addressed promptly.