Hi ValAn, the MRR will be computed over all the 3097 plant observation ids, and for each plant observation id you have to provide basically all the species predictions (almost 1000 species, 996 species if I’m not wrong). So your submitted file can contain up to 996x3097 lines. However, you can eventually limit the number of species predictions by an arbitrarily chosen probability threshold in order to keep only the most “relevant” propositions and limit the run file size. I don’t remember mentioning 10 predictions for MRR, maybe you saw that in the fake example of file prediction we provided? Thank you very much for your interest in the challenge!
Hello, we plan to provide the test set next week, April 15th, and the total number of submissions will be ten, not ten per day. Thank you for your interest in the challenge!
Hi, we have clarified the following points in the submission instructions:
Please have a look on the main page of the PlantCLEF2020 challenge, or just read below:
As soon as the submission is open, you will find a “Create Submission” button on this page (next to the tabs).
Before being allowed to submit your results, you have to first press the red participate button, which leads you to a page where you have to accept the challenges rules.
More practically, the run file to be submitted is a csv file type (with semicolon separators) and has to contain as much lines as the number of predictions, each prediction being composed of an ObservationId (the identifier of a specimen that can be itself composed of several images), a ClassId, a Probability and a Rank (used in case of equal probabilities). Each line should have the following format: <ObservationId;ClassId;Probability;Rank>
Here is a short fake run example respecting this format for only 3 observations: fake_run
As soon as the submission is open, you will find a “Create Submission” button on this page (just next to the tabs).Evaluation criteria
The primary metrics used for the evaluation of the task will be the Mean Reciprocal Rank. The MRR is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. The MRR is the average of the reciprocal ranks for the whole test set:
where |Q| is the total number of query occurrences in the test set.
A second metric will be again the MRR but computed on a subset of observations related to the less populated species in terms of photographies “in the field” based on the most comprehensive estimates possible from different data sources (IdigBio, GBIF, Encyclopedia of Life, Bing and Google Image search engines, previous datasets related to PlantCLEF and ExpertCLEF challenges).
As a general comment, we can assume that classical ConvNet-based approaches using complementary training sets containing photos in the field such as ExpertCLEF2019, in addition to the PlantCLEF2020 training set, will perform well on the primary metric. However, we can assume that cross-domain approaches will get better results on the second metric where there is a lack of in-the-field training photos.
Since the supremacy of deep learning and transfer learning techniques, it is conceptually difficult to prohibit the use of external training data, notably the training data used during last year’s ExperCLEF2019 challenge, or over pictures that can be met through the GBIF for example. However, we ask participants to provide at least one submission that uses only the training data provided this year.
Participants will be allowed to submit a maximum of 10 run files.
The “domain adaptation” section in paperwithcode is worth of interest too.
Dear PlantCLEF2020 participants,
Short of inspiration? Tired of classic ConvNet approaches?
I have just added in the Motivation section some links to papers that can be inspiring.
(many thanks to Juan (aabab)). Let’s start with two links:
If you wish, you can share here with the other participants the references that you think are suitable for the challenge.
Thank you in advance for your contributions, and good luck to all!
And the content is :
This package is organized into three subfolders:
“herbarium” subdirectory contains the vast majority of the data: it is a collection of about 327k herbarium scans relating to a selection of 1000 species of Amazonian plants mainly centered on French Guiana. The herbarium sheets are coming from two sources: the Herbier IRD de Guyane” digitized in the context of the e-ReColNat project, and iDigBio, a large international platform aggregating and giving access millions of images of herbarium specimens hosted by various National Museum of Natural History and botanical institutes around the world. Pictures and theirs related metadata xml files are organized into subfolders, one for each species. The name of the subfolders are directly the content of ClassId field that can be found in the xml content. The xml file content various information (when available) like longitude, latitude, place, date, taxonomy, some tags on the pictures. Some herbarium sheets are related to a same plant observation or “specimen” and can be found through the ObservationId field. All the pictures where resized to a maximum height of 1024 pixels, but the field OriginalUrl can be used to get pictures with a higher resolution.
the “herbarium_photo_associations” subdirectory contains more than 3 hundreds specimens related to about 250 species where we are supposed to have for each individual plant identified by the (ObservationId field) some pictures in the field and one or more herbarium sheets. The PhotoType field in the xml can take the value of “herbarium” or “Photo” in order to identify if the content is related to an herbarium sheet of a picture inf the field. The field “HerbariumPhotoAssociation” explicitly indicated if there is an association or not between pictures in the field and herbarium sheets related to a same specimen (but it’s possible that sometimes there are missing photos…). As the previous “herbarium” directory, pictures and theirs related metadata xml files are organized into subfolders, one for each species identified by a ClassId.
finally, the “photo” subfolder contains few pictures in the field that was provided by the IdigBio API when the training species were requested.
Pictures in the field contained into the “herbarium_photo_associations” and “photos” subdirectories could be used classically as an extra training dataset for fine tuning directly a ConvNet model for species classification. In the same vein, it would be possible also to use pictures in the field related to Amazonian plants like the PlantCLEF2019 training dataset. But we really encourage the participants to act as if no data were available other than herbarium sheets in the world (which is actually the case for many species in the training set and the test set). Photos in the “herbarium_photo_associations”, and eventually “photos”, subdirectory/ies are essentially provided to allow learning a mapping between the herbarium sheets domain and the field pictures domain.
a direct link is:
Dear heaven, since it is not convenient to add a new file in Zenodo, we provide now a description of the dataset directly in the file in the Resources tab.