The digitization of ancient texts is essential for analyzing ancient corpora and preserving cultural heritage. However, the transcription of ancient handwritten text using optical character recognition (OCR) methods remains challenging. Handwritten text recognition (HTR) concerns the conversion of scanned images of handwritten text into machine-encoded text. In contrast with OCR where the text to be transcribed is printed, HTR is more challenging and can lead to transcribed text that includes many more errors or even to no transcription at all when training data on the specific script (e.g., medieval) are not available.
Existing work on HTR combine OCR models and Natural language processing (NLP) methods from fields such as grammatical error correction (GEC), which can assist with the task of post-correcting transcription errors. The post-correction task has been reported as expensive, time-consuming, and challenging for the human expert, especially for OCRed text of historical newspapers, where the error rate is as low as 10%. This challenge aims to invite more researchers to work on HTR, initiate the implementation of new state-of-the-art models, and obtain meaningful insights for this task. The focus of this challenge will be on the post-correction of HTR transcription errors, attempting to build on recent NLP advances such as the successful applications of Transformers and transfer learning.
The participants will be provided with data consisting of images of handwritten text and the corresponding text transcribed by a state-of-the-art HTR model. One month will be given to the participants to implement and train their systems and then an evaluation set will be released. The participants will be asked to submit a file with the predictions of their systems for the evaluation set within a few days. The ground truth of the evaluation set will be used to score participating systems in terms of character error rate (CER). We will provide a script that calculates CER. Participating teams will be asked to author 4-page system description papers, which will be used to compile an overview of the task and draw the state of the art in the field.
- Since the task of this challenge is the post-correction of transcribed text, the HTRed text provided must be used as input. The images, other data or any information extracted from the images can be used as an additional input. However, models that use only the images and perform HTR will not be accepted.
- Participants are welcome to form teams. Teams should submit their predictions under a single account. For each team/participant the maximum number of submissions per day is five (5).
We provide training instances, consisting of handwritten texts that have been transcribed by human experts (the ground truth) and by a state-of-the-art HTR model (the input). The selected images of handwritten texts comprise Greek papyri and Byzantine manuscripts. First, more than 1,800 lines of transcribed text will be released in order to serve as training and validation data. The use of other resources for training is allowed and suggested. Next, an evaluation set will be released, for which we will only share the input. A very small part of the evaluation set is used to keep an up-to-date leaderboard.
An example of how the data look like is the following:
|ImageID||Human transcription||HTR transcription|
|Bodleian-Library-MS-Barocci-10200157fol-75r.jpg||ἐγγινομένα πάθη μὴ σβεννύντες ἀλλὰ τῆ εκλύσει||ἐγγενομεναπαδημησμεννωτες ἀλλατῆε κλησει|
|Bodleian-Library-MS-Barocci-10200157fol-75r.jpg||τοῦ βίου τοῦ καθ ΄ εαυτοὺς πολλὰ γίνεσθαι συγχωροῦν||του β ου του καλεαυτοὺς πολλαγινεσθαι συγχωρ όν|
|Bodleian-Library-MS-Barocci-10200157fol-75r.jpg||τες ἐμπυρίζουσι τὸν ἀμπελῶνα ἀλλὰ καὶ ὁ διὰ||τες εμπυριζου σιμαμπελῶνα ἀλλακαι ὅδξα|
|Bodleian-Library-MS-Barocci-10200157fol-75r.jpg||τῆς ἡδεῖας πλεονεξίας πολλοὺς εἰς τὴν τῶν ἀλλ||της ἐδίας πλσον ἐξιας πολλους ἐις τὴν τῶν ἀλ|
This example corresponds to fol. 75r from the Bodleian Library of the University of Oxford (Oxford, Bodleian Library MS. Barocci 102):
The dataset files are available in the Resources tab. They comprise:
- train.csv: Training set containing the HTRed text (input) and the human transcription (ground truth)
- test.csv: Unseen evaluation set containing only the HTR'ed text (input)
In the Notebooks tab, we provide a Starter Kit notebook that contains code to download the data, run two baselines and submit the predictions file in the right format.
The submissions should be in CSV format.
- Prepare a CSV containing two columns, one is 'ImageID', which corresponds to the images in test.csv, and 'Transcriptions' with the produced transcribed texts.
- Code for creating the submission file can be found in the Starter Kit.
Make your first submission here 🚀 !!
🖊 Evaluation Criteria
Character error reduction rate (CERR) measures the reduction in character error rate (CER) w/ and w/o the post-correction step. That is, two CERs are computed. The first is computed between the input and the ground truth. The second is computed between the post-corrected input text and the ground truth. CERR is the CER reduction from the first to the second. CERR is computed per line and the macro-average is reported across all lines of the sample. This will be the official measure of the challenge. Secondary measures may be employed, to allow a more detailed evaluation. The measure that is currently shown on the leaderboard as a secondary score is word error reduction rate (WERR).
- Training data release date: May 1st, 2022
- Evaluation data release date: June 1st, 2022
- Predictions submission deadline: 11:59 July 1st, 2022 (EXTENDED)
- Rankings release date: July 7th, 2022
- System description paper submission deadline: 11:59 August 1st, 2022
- Best system description paper announced: October 1st, 2022
- Workshop: November 7-8, 2022
All deadlines are in UTC -12h timezone (Anywhere on Earth).
The Participant with the best performing system will be invited to attend a workshop in Venice, upon the completion of the Challenge, and present the respective system description paper with all expenses covered.