EPFL Machine Learning Higgs
Spot the Boson
train.csv - Training set of 250000 events. The file starts with the ID column, then the label column (the y you have to predict), and finally 30 feature columns.
test.csv - The test set of around 568238 events - Everything as above, except the label is missing.
sample-submission.csv - a sample submission file in the correct format. The sample submission always predicts -1, that is ‘background’.
Zip file containing all 3 above files can be downloaded from the resource section.
For detailed information on the semantics of the features, labels, and weights, see the technical documentation from the LAL website on the task. Note that here for the EPFL course, we use a simpler evaluation metric instead (classification error).
Some details to get started:
- all variables are floating point, except PRI_jet_num which is integer
- variables prefixed with PRI (for PRImitives) are “raw” quantities about the bunch collision as measured by the detector.
- variables prefixed with DER (for DERived) are quantities computed from the primitive features, which were selected by the physicists of ATLAS.
- it can happen that for some entries some variables are meaningless or cannot be computed; in this case, their value is −999.0, which is outside the normal range of all variables.