Official Round: Completed

ImageCLEF 2018 Caption - Concept Detection

Identifying relevant concepts in a large corpus of medical images



Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines.

Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. The more image characteristics are known, the more structured are the radiology scans and hence, the more efficient are the radiologists regarding interpretation. We work on the basis of a large-scale collection of figures from open access biomedical journal articles (PubMed Central). All images in the training data are accompanied by UMLS concepts extracted from the original image caption.

Lessons learned:

  • In the first and second editions of this task, held at ImageCLEF 2017 and ImageCLEF 2018, participants noted a broad variety of content and situation among training images. For this year, the training data is reduced solely to radiology images

  • A large number of concepts was used in the previous years. This year, the captions are first processed before concept extraction, hence leading to a reduced number of concepts

  • As uncertainty regarding additional source was noted, we will clearly separate systems using exclusively the official training data from those that incorporate additional sources of evidence

Challenge description

The first step to automatic image captioning and scene understanding is identifying the presence and location of relevant concepts in a large corpus of medical images. Based on the visual image content, this subtask provides the building blocks for the scene understanding step by identifying the individual components from which captions are composed. The concepts can be further applied for context-based image and information retrieval purposes.

Evaluation is conducted in terms of set coverage metrics such as precision, recall, and combinations thereof. This task will be run using a subset of the Radiology Objects in COntext ( ROCO ) dataset [1].


From the PubMed Open Access subset containing 1,828,575 archives, a total number of 6,031,814 image - caption pairs were extracted. To focus on radiology images and non-compound figures, automatic filtering with deep learning systems as well as manual revisions were applied, reducing the dataset to 72,187 radiology images of several medical imaging modalities.

NOTE: If the usage of an additional source for training is intended, it should not be a subset of PubMed Central Open Access (archiving date: 01.02.2018 - 01.02.2019), to avoid an overlap with the test data.

Submission instructions

As soon as the submission is open, you will find a “Create Submission” button on this page (just next to the tabs)

For the submission we expect the following format:

  • Figure-ID TAB Concept-ID-1;Concept-ID-2;Concept-ID-n e.g.:
  • ROCO_41341 C0033785;C0035561

  • ROCO_07563 C0043299;C1306645;C1548003;C1962945

You need to respect the following constraints:

  • The separator between the figure ID and the concepts has to be a tabular whitespace
  • The separator between the UMLS concepts has to be a semicolon (;)
  • Each figure ID of the test set must be included in the submitted file exactly once (even if there are not concepts)
  • The same concept cannot be specified more than once for a given figure ID


PubMed Central


[1] O. Pelka, S. Koitka, J. Rückert, F. Nensa und C. M. Friedrich „Radiology Objects in COntext (ROCO): A Multimodal Image Dataset“, Proceedings of the MICCAI Workshop on Large-scale Annotation of Biomedical data and Expert Label Synthesis (MICCAI LABELS 2018), Granada, Spain, September 16, 2018, Lecture Notes in Computer Science (LNCS) Volume 11043, Page 180-189, DOI: 10.1007/978-3-030-01364-6_20, Springer Verlag, 2018.

Evaluation criteria

Evaluation is conducted in terms of F1 scores between system predicted and ground truth concepts, using the following methodology and parameters:

  • The default implementation of the Python scikit-learn (v0.17.1-2) F1 scoring method is used. It is documented here.

  • A Python (3.x) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT concept sets

  • For each candidate-GT concept set, the y_pred and y_true arrays are generated. They are binary arrays indicating for each concept contained in both candidate and GT set if it is present (1) or not (0).

  • The F1 score is then calculated. The default ‘binary’ averaging method is used.

  • All F1 scores are summed and averaged over the number of elements in the test set (10’000), giving the final score.

The ground truth for the test set was generated based on the UMLS Full Release 2018AB .

NOTE : The source code of the evaluation tool is available here. It must be executed using Python 3.x, on a system where the scikit-learn (>= v0.17.1-2) Python library is installed. The script should be run like this:

/path/to/python3 evaluate-f1.py /path/to/candidate/file /path/to/ground-truth/file

The leaderboard will be visible from 01.05.2019 (official deadline) on. The submission system will remain open a few more days. Results submitted after the deadline will not be part of the official results.


Contact us

We strongly encourage you to use the public channels mentioned above for communications between the participants and the organizers. In extreme cases, if there are any queries or comments that you would like to make using a private communication channel, then you can send us an email at :

  • Obioma Pelka: obioma[DOT]pelka[AT]fh-dortmund[DOT]de
  • Christoph M. Friedrich: christoph[DOT]friedrich[AT]fh-dortmund[DOT]de
  • Alba Garcia Seco de Herrera: alba[DOT]garcia[AT]essex[DOT]ac[DOT]uk
  • Henning Müller: henning[DOT]mueller[AT]hevs[DOT]ch

More information

You can find additional information on the challenge here: http://imageclef.org/2019/caption


ImageCLEF 2019 is an evaluation campaign that is being organized as part of the CLEF initiative labs. The campaign offers several research tasks that welcome participation from teams around the world. The results of the campaign appear in the working notes proceedings, published by CEUR Workshop Proceedings (CEUR-WS.org). Selected contributions among the participants, will be invited for publication in the following year in the Springer Lecture Notes in Computer Science (LNCS) together with the annual lab overviews.

Datasets License