Loading
Round 1: Completed

OpenFood Ingredients List Challenge

Hidden

Extracting language-specific text from food packaging

5405
7
0
7

The OpenFood project has photographed and manually extracted text from 1243 food packages, and we expect to photograph more than 20,000 in the course of the project. The goal of this challenge is to produce an automated solution for extracting and classifying by language the ingredients lists.

Photographs of ingredients lists from food packaging are available in the Dataset section, with manually extracted text in French, German, Italian and English. Photographs and extracted text for 800 products has been provided to participants.

The challenge is to build an OCR script and language classification model, that will process each folder of product images and produce a text file of the contents.

The code must perform two tasks:

  • Convert the images to text
  • Classify the language of the text

It is possible there will be some other languages, other than German (de), French (fr), Italian (it) and English (en). Text for additional languages is not required, and is to be discarded.

Example images

product_15665 Image

product_15665 json

{"product_id":15665,"ingredients":{"de": "Getreidemehle (Reis 50%, Hafer 26%, Mais 3%), Zucker, Inulin, Salz, Aroma, Farbstoff (Ammoniak-Zuckercouleur, Annatto)", "fr": "farines de céréales (riz 50%, avoine 26%, maïs 3%), sucre, inuline, sel, arôme, colorant (caramel E150c, Rocou)"}}

product_1978 image

product_1978 json

{"product_id":1978,"ingredients":{"de": "Tomatenmark, Zucker, Trinkwasser, Essig (Weisswein, Gerstenmalz, Branntwein / Weingeist), Melasse, modifizierte Stärke, Speisesalz, Gewürze (mit Senfsaat), Rapsöl, Raucharomen, Gerstenmalzextrakt, Stabilisator (Xanthan), Säuerungsmittel (Citronensäure), Konservierungsstoff (Kaliumsorbat), Anchovis, Tamarindenextrakt, Aromen"}}

product_15668 image

product_15668 json

{"product_id":15668,"ingredients":{"de": "Wasser, Kokosnussextrakt 33%, Zucker, Maniokstärke, Verdickungsmittel(E440, E406), Kaffee-Extrakt 0.6%, Zitronensaftkonzentrat", "fr": "eau, extrait de noix de coco 33%, sucre, fécule de manioc, épaississants(E440, E406), extrait de café 0.6%, concentré de jus de citron"}}

Evaluation criteria

  • An evaluation set of 443 ingredients lists images will be used.

  • Submissions will be run by crowdAI against a Docker container running the official Ubuntu 14.04 LTS container, loaded with the test images.

  • Participants must produce a folder of code with an install script called install.sh. When executed in the container this will download and install any necessary code and libraries.

  • Participants must also include a script run.sh which is run against a folder of images.

  • Participants can expect the 443 images to exist in the /home/ubuntu/images/ directory

  • The text classification predictions file is to be written to /home/ubuntu/predictions.json.txt. The training_predictions.json.txt file found under the Dataset link demonstrates the required format, which is one line per entry.

  • The predictions.json.txt file will be evaluated against the answer file using the Python difflib library’s ratio function, which provides a score between 0 and 1 as an indication of similarity. The score will be an average of all scores for the 443 images.

  • Each line of the output file will be processed individually, so the order of JSON entries in the file is irrelevant.

Resources

Participants should use the official Ubuntu 14.04 image for this challenge.

docker run -it ubuntu:trusty /bin/bash

  • Any required libraries and packages may be installed, and this should be defined by the participant in an install.sh script, to be run as bash /home/ubuntu/install.sh

  • The images will be loaded into the /home/ubuntu/images/ directory.

  • The participant will provide a run.sh script which will be executed as bash /home/ubuntu/run.sh

  • The predictions file should be written to /home/ubuntu/predictions.json.txt

Participants may use any FOSS (free and open source) resources to produce the solution, including programming languages, frameworks and public APIs.

Prizes

The author of the most highly ranked submission above 80% will be invited to the crowdAI winner’s symposium at the 2nd Applied Machine Learning Days (note that the challenge will end after the upcoming 1st Applied Machine Learning Days ). The educational award is given to the participant with the either the most insightful submission posts, or the best tutorial - the recipient of this award will also be invited to the symposium (the crowdAI team will pick the recipient of this award). Expenses for travel and accommodation are covered by crowdAI.

In addition, there is a CHF 2,000 (~ USD 2,000) prize on the most highly ranked submission above 90%.

Datasets License

Participants