Warm Up Round: Completed Round 1: Completed Weight: 1.0
8935
401
54
696

🚀 Starter Kit

🚀 Official Baseline

👥 Looking for teammates or advice ? Join the competition Slack !

This task is about determining when and what clarifying questions to ask. Given the instruction from the Architect (e.g., “Help me build a house.”), the Builder needs to decide whether it has sufficient information to carry out that described task or if further clarification is needed. For instance, the Builder might ask “What material should I use to build the house?” or “Where do you want it?”. In this NLP task, we focus on the research question "what to ask to clarify a given instruction" independently from learning to interact with the 3D environment. The original instruction and its clarification can be used as input for the Builder to guide its progress.

Top: architect's instruction was clear, not clarifying question gets asked. Bottom: 'leftmost' is ambiguous, so the builder asks a clarifying question.

The original description of the baselines and the methodologies can be found in the following paper:

@inproceedings{aliannejadi-etal-2021-building, title = "Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions", author = "Aliannejadi, Mohammad and Kiseleva, Julia and Chuklin, Aleksandr and Dalton, Jeff and Burtsev, Mikhail", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", publisher = "Association for Computational Linguistics", doi = "10.18653/v1/2021.emnlp-main.367", pages = "4473--4484",
}

## 🖊 Evaluation

Models submitted to this track track are going to be evaluated according to both when to ask and what to ask criteria with a two-step scoring process.

• When to ask: This is a binary classification problem: Does the provided instruction require a clarification question? We use the the macro average F1 score to evaluate your classifer. However, we do not believe optimizing this metric too much is in the best use of your time from a research perspective. Hence we quantize the F1 score into the following bins:

• 0.90 - 1.0

• 0.85 - 0.90

• 0.75 - 0.85

• 0.65 - 0.75

• 0.50 - 0.65

• 0.35 - 0.50

• 0.0 - 0.35

So if your classifer gets a F1 score of 0.82, the binned F1 score will be 0.75. For a F1 score of 0.93, the binned score will be 0.90 and so on.

• What to ask: The second problem evaluates how well your model can rank a list of human-issued clarifying questions for a given ambiguous instruction. Your model will be evaluated on Mean Reciprocal Rank (MRR), rounded off to 3 significant digits.

The leaderboard will be ranked based on the binned F1 score, submissions with the same binned F1 score will be sorted with the MRR.

Please note above mentioned metrics is subject to be modified after completion of warm-up phase of the competition.

## 💾 Dataset

Download the public dataset for this Task using the link below, you'll need to accept the rules of the competition to access the data.

The dataset consists of

• clarifying_questions_train.csv
• question_bank.csv
• initial_world_paths folder

clarifying_questions_train.csv has the following columns:

• GameId - Id of the game session.
• InitializedWorldPath - Path to the file under initial_world_paths that contains state of the world intialized to the architect. The architect provides an instruction to build based on this world state. More information to follow on how the world state can be parsed/ visualized.
• InputInstruction - Instruction provided by the architect.
• IsInstructionClear - Specifies whether the instruction provided by architect is clear. This has been marked by another annotator who is not the architect.
• ClarifyingQuestion - Question asked by annotator upon marking instruction as being unclear.
• qrel - Question id (qid) of the relevant clarifying question for the current instruction.
• qbank - List of clarifying question ids that need to be ranked for each unclear instruction. The mapping between clarifying questions and ids is present in the question_bank.csv.

Merged list of ids in the qrel and qbank columns will give you the list of all qids to be ranked for each unclear instruction.

question_bank.csv: This file contains mapping between qids mentioned in qrel and qbank columns of the clarifying_questions_train.csv to the bank of clarifying questions issued by annotators.

## 🚀 Getting Started

Make your first submission using the starter Kit. 🚀

## 📅 Timeline

• July: Releasing materials: IGLU framework and baselines code.
• 29th July: Competition begins! Participants are invited to start submitting their solutions.
• 31th October: Submission deadline. Submissions are closed and organizers begin the evaluation process.
• November: Winners are announced and are invited to contribute to the competition writeup.
• 2nd-3rd of December: Presentation at NeurIPS 2022 (online/virtual).

## 🏆 Prizes

The challenge features a Total Cash Prize Pool of $16,500 USD. This prize pool for NLP Task is divided as follows: • 1st place:$4,000 USD
• 2nd place: $1,500 USD • 3st place:$1,000 USD

Research prizes: $3,500 USD Task Winners. For each task, we will evaluate submissions as described in the Evaluation section. The three teams that score highest on this evaluation will receive prizes of \$4,000 USD , \$1,500 USD and \$1,000 USD.

Research prizes. We have reserved \$3,500 USD of the prize pool to be given out at the organizers’ discretion to submissions that we think made a particularly interesting or valuable research contribution. If you wish to be considered for a research prize, please include some details on interesting research-relevant results in the README for your submission. We expect to award around 2-5 research prizes in total.

## Evaluation

Models submitted to the NLP track are going to be evaluated according to both when to ask and what to ask criteria with a two-step scoring process.

• When to ask: This is a binary classification problem: Does the provided instruction require a clarification question? We use the the macro average F1 score to evaluate your classifer. However, we do not believe optimizing this metric too much is in the best use of your time from a research perspective. Hence we quantize the F1 score into the following bins:

• 0.90 - 1.0

• 0.85 - 0.90

• 0.75 - 0.85

• 0.65 - 0.75

• 0.50 - 0.65

• 0.35 - 0.50

• 0.0 - 0.35

So if your classifer gets a F1 score of 0.82, the binned F1 score will be 0.75. For a F1 score of 0.93, the binned score will be 0.90 and so on.

• What to ask: The second problem evaluates how well your model can rank a list of human-issued clarifying questions for a given ambiguous instruction. Your model will be evaluated on Mean Reciprocal Rank (MRR), rounded off to 3 significant digits.

The leaderboard will be ranked based on the binned F1 score, submissions with the same binned F1 score will be sorted with the MRR.

Please note above mentioned metrics are subject to be modified after completion of warm-up phase of the competition.

## Baselines

We shall be releasing the baselines soon, be on the lookout on the forums.

## 👥 Team

The organizing team:

• Julia Kiseleva (Microsoft Research)
• Alexey Skrynnik (MIPT)
• Artem Zholus (MIPT)
• Shrestha Mohanty (Microsoft Research)
• Negar Arabzadeh (University of Waterloo)
• Marc-Alexandre Côté (Microsoft Research)
• Milagro Teruel (Microsoft Research)
• Ziming Li (Amazon Alexa)
• Mikhail Burtsev (MIPT)
• Maartje ter Hoeve (University of Amsterdam)
• Zoya Volovikova (MIPT)
• Aleksandr Panov (MIPT)
• Yuxuan Sun (Meta AI)
• Kavya Srinet (Meta AI)
• Arthur Szlam (Meta AI)
• Dipam Chakraborty (AIcrowd)

• Tim Rocktäschel (UCL & DeepMind)
• Julia Hockenmaier (University of Illinois at Urbana-Champaign)
• Bill Dolan (Microsoft Research)
• Ryen W. White (Microsoft Research)
• Maarten de Rijke (University of Amsterdam)
• Oleg Rokhlenko (Amazon Alexa Shopping)

Special thanks to our sponsors for their contributions.