Thanks to SEAM and AIcrowd for this wonderful competition. Thanks to all the explainers, particularly « PyTorch starter 0.857 F1-Score on public LB » from ivan_romanov, that were so helpful for me.
1. Summary of the solution :
- I solved this 3D image interpretation problem through 2D image segmentation, with the architecture of DeepLabV3+ and the encoder of efficientnet-b3.
- I trained 3 models (one to identify the first 20x weighted class, one to identify the second 20x weighted class, one to identify the rest of the 4 classes) and ensembled them to get the final segmentation.
- I used all 782 vertical images and 590 horizontal images in TRAIN for training, and the 78 vertical images (10% of all vertical images) near to the East border for validation.
- In the post-processing, there is a moving average of window size 71 through x-axis.
Github link of the Python code :
2. Saparate training for high-weight classes and ensemble :
The initial labels were from 1 to 6. I added -1 for each label. So in my script, labels became from 0 to 5.
In round 2, the F1-score is computed in a weighted way where class 4 and class 5 have 20x more weight than the rest of the classes. As a result, the correct prediction of class 4 and class 5 is particularly important. At first, I tried to train a single model to identify all of the 6 classes. I found it very difficult to obtain a model that works well for all classes. So I came up with the idea of training 3 models : a first model used to identify low weight classes (0, 1, 2 and 3), a second model to identify class 4 and a third model to identify class 5.
I used the script "train_inference_class0123.ipynb" to train class 0, 1, 2 and 3, and 2 other scripts to train class 4 and class 5. "train_inference_class0123.ipynb", "train_inference_class4.ipynb" and "train_inference_class5.ipynb" are very similar, with 2 differences: the number of classes in image segmentation, and the class weights given to each class during training.
"train_inference_class0123.ipynb" merges classes 4 and 5 into class 1, so the number of classes is 4. The 4 classes have the same weight.
"train_inference_class4.ipynb" keeps all of the classes during training, so the number of classes is 6. Class 4 has more class weight than other classes.
"train_inference_class5.ipynb" keeps all of the classes during training, so the number of classes is 6. Class 5 has more class weight than other classes.
"binary_ensemble.ipynb" combines the inference results of the first 3 scripts and generates the final submission file.
3. Train set and validation set :
As shown in the figure, TRAIN is a 3D image represented as an array of
1006 × 782 × 590 real numbers, stored in the order
(Z,X,Y). TEST2 (the test dataset for Round-2) is
1006 × 334 × 841 in size and borders the training image at East. As a result, training images near to the East border are very similar to test images. I used all 782 vertical images and 590 horizontal images in TRAIN for training, and the 78 vertical images (10% of all vertical images) near to the East border for validation. I found it important to include East border images in training, so that the model could learn from images similar to test set. It is also important to use East border images for validation, so that we choose the checkpoint that works well for test set. So I used East border images both in training and validation. To avoid overfitting, I limited the training to 20 epochs.
4. Other training parameters :
- Data augmentation : (package used : albumentations) ShiftScaleRotate + RandomCrop + MultiplicativeNoise + HorizontalFlip
- Architecture & encoder : DeepLabV3+ & efficientnet-b3 (package used : segmentation_models_pytorch) DeepLabV3+ outperformed Unet. Due to GPU memory limitation, I didn’t test bigger efficientnet encoders, such as efficientnet-b4.
- Optimizer : Adam with a learning rate of 1e-3
- Loss : CrossEntropy x 0.3 + GDiceLoss x 0.7
(GDiceLoss : Generalized Dice for multi-class image segmentation)
When using only CrossEntropy, I got high accuracy, but low F1-score. The minority classes, especially class 4, were over estimated. When using only GDiceLoss, the minority classes were under estimated. I got better F1-score by combining the two.
5. Post-processing :
- Moving average : In the test set, adjacent vertical images are very similar to each other, so when moving from the left most image to the right most image, the facies boundaries should change gradually (or smoothly). My inference was done for each vertical image individually without considering neighboring images. When I looked at my predictions from the left most image to the right most image, there was a trembling of facies boundaries from one image to another. I added a moving average of window size 71 through x-axis, i.e. averaged the prediction of 71 adjacent vertical images, to smooth the prediction. The F1-score improved.
6. Tested but failed to improve the performance : TTA, feed multiple adjacent images to the input layer
7. Did not implement due to time limitation : During round 1 of the competition, I implemented pseudo-labeling and post-processing such as removing small holes inside a facies, and they improved the performance. I did not implement these two techniques in round 2 due to time limitation.