Hey, thank you. I think I have transcription figured out
thanks. yeah, that step gave a small boost.
I had 2 ways to get validation scores:
- out of fold score for individual models: around .39
- average prediction of fold models on the held out set: around .41
LB: around .31
It’s possible that the distribution of the test smells was different from what we had in the train set or molecules were structurally different, hence the discrepancy between LB and validation. In previous rounds results were much closer.
This was one of my favorite challenges so far, because the problem formulation is very simple and it attempts to get insight into one of our primal but neglected basic senses. My solution was far behind top 2 competitors, so I feel like I was missing some crucial ingredient, so I am looking forward to learn about their approach.
The core of my approach is neural net on fingerprints.
Data: union of various fingerprints extracted with
SMILESin train set
from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import MACCSkeys mol = Chem.MolFromSmiles(smiles) fp0 = MACCSkeys.GenMACCSKeys(mol) # MACCS keys fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, 256) # Morgan fingerprints fp2 = Chem.RDKFingerprint(mol) fp3 = [len(mol.GetSubstructMatch(Chem.MolFromSmarts(smarts)) > 0 for smarts in smarts_inteligands] # smarts_inteligands has about 305 smarts patterns
Preprocessing: drop constant and duplicate fingerprints
from torch import nn hidden_size = 512 dropout = .3 output_size = 75 nn.Sequential( nn.Linear(input_size, hidden_size), nn.ReLU(inplace=True), nn.Dropout(dropout), nn.BatchNorm1d(hidden_size), nn.Linear(hidden_size, hidden_size), nn.ReLU(inplace=True), nn.Dropout(dropout), nn.BatchNorm1d(hidden_size), nn.Linear(hidden_size, output_size), )
Training was done over 5 folds, each one for 25 epochs with
nn.BCEWithLogitsLoss. The model tried to predict probabilities of 75 smells.
The last step was to come up with 5 prediction sequences starting from individual smell probabilities. For this I sampled smells using their predicted probabilities and found the sequence with the best
jaccardscore. Then found the next sequence with the best incremental
jaccardscore and so on.
Bells and whistles. Some of the things that made small improvements:
- label smoothing
- weighting labels for training
- weighting fingerprints based on their estimate importance
- Things that didn’t work:
- PCA on features and on labels
- UMAP on features and on labels
- pretraining on 109 labels
- continous version of IOU loss instead of BCE for training
- various learning rate schedulers
- dropping fingerprints with high correlation to others
- trying another dropout/learning rate
Well, when I submitted MSE=3333333.333 solution, I already knew the answer, so just added some noise to make it look cute.
But to your point, I looked at the distribution of fractional values and ran a couple of linear regressions. Fractional values in this data had some distinct pattern. Single stock prices usually get adjusted because of splits and dividends, so the distribution of their fractional values didn’t match the pattern. Indices on the other hand don’t get adjusted, so I searched for the combination of indices.
Overall, I think it was a great puzzle and a lot of fun to solve.
my guess is because the colorizers library opens image using PIL, so it has different convention on channel ordering than cv2.