Loading
Feedback

Learning to Smell

Where to start? 5 ways to learn 2 smell!

We have written a notebook that explores 5 ways to attempt this challenge.

By shraddhaa_mohan

Hi everyone!

Open In Colab

@rohitmidha23 and me are undergrad students studying computer science, and found this challenge particularly interesting to explore the applications of ML in Chemistry. We have written a notebook that explores 5 ways to attempt this challenge. It includes baselines for

  • ChemBERTa
  • Graph Conv Networks
  • MultiTaskClassifier using Molecular Fingerprints
  • Sklearn Classifiers (Random Forest etc.) using Molecular Fingerprints
  • Chemception (2D representation of molecules)

Check it out @ https://colab.research.google.com/drive/1-RedHEQSAVKUowOx2p-QoKthxayRshUa?usp=sharing

The most difficult task in this challenge is trying to get good representations of SMILES that is understandable for ML algorithms and we have tried to give examples on how that has been done in the past for these kind of tasks.

We hope that this notebook helps out other beginners like ourselves.

As always we are open to any feedback, suggestions and criticism!

If you found our work helpful, do drop us a :heart:!



AICrowd Learning To Smell Challenge

What is the challenge exactly?

This challenge is all about the ability to be able to predict the different smells associate with a molecule. The information based upon which we are supposed to predict the smell is the smile of a molecule. Each molecule is labelled with multiple smells, with the total number of distinct smells being 109.

What is a smile?

SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows a user to represent a chemical structure in a way that can be used by the computer. They describe the structure of chemical species using short ASCII strings.

What is the most important task in this challenge?

This most important task at hand here is gaining a meaningful representation of each smile. There are several ways to do this, and this notebook attempts to give you quite a few pathways to gain a representation of a smile that can then be used in an ML pipeline. The different ways discussed here are:

  • Tokenizing of Smiles and using ChemBERTA
  • Graph Conv
  • Molecular Fingerprints
  • 2D representation of molecules (Chemception)

Download the Data

In [ ]:
!gdown --id 1t5be8KLHOz3YuSmiiPQjopb4c_q2U4tG
!unzip olfactorydata.zip 
#thanks mmi333
Downloading...
From: https://drive.google.com/uc?id=1t5be8KLHOz3YuSmiiPQjopb4c_q2U4tG
To: /content/olfactorydata.zip
100% 94.3k/94.3k [00:00<00:00, 36.0MB/s]
Archive:  olfactorydata.zip
  inflating: train.csv               
  inflating: test.csv                
  inflating: sample_submission.csv   
  inflating: vocabulary.txt          
In [ ]:
!mkdir data
!mv train.csv data
!mv test.csv data
!mv vocabulary.txt data
!mv sample_submission.csv data

Install reqd Libraries

In [ ]:
import sys
import os
import requests
import subprocess
import shutil
from logging import getLogger, StreamHandler, INFO


logger = getLogger(__name__)
logger.addHandler(StreamHandler())
logger.setLevel(INFO)

def install(
        chunk_size=4096,
        file_name="Miniconda3-latest-Linux-x86_64.sh",
        url_base="https://repo.continuum.io/miniconda/",
        conda_path=os.path.expanduser(os.path.join("~", "miniconda")),
        rdkit_version=None,
        add_python_path=True,
        force=False):
    """install rdkit from miniconda
    ```
    import rdkit_installer
    rdkit_installer.install()
    ```
    """

    python_path = os.path.join(
        conda_path,
        "lib",
        "python{0}.{1}".format(*sys.version_info),
        "site-packages",
    )

    if add_python_path and python_path not in sys.path:
        logger.info("add {} to PYTHONPATH".format(python_path))
        sys.path.append(python_path)

    if os.path.isdir(os.path.join(python_path, "rdkit")):
        logger.info("rdkit is already installed")
        if not force:
            return

        logger.info("force re-install")

    url = url_base + file_name
    python_version = "{0}.{1}.{2}".format(*sys.version_info)

    logger.info("python version: {}".format(python_version))

    if os.path.isdir(conda_path):
        logger.warning("remove current miniconda")
        shutil.rmtree(conda_path)
    elif os.path.isfile(conda_path):
        logger.warning("remove {}".format(conda_path))
        os.remove(conda_path)

    logger.info('fetching installer from {}'.format(url))
    res = requests.get(url, stream=True)
    res.raise_for_status()
    with open(file_name, 'wb') as f:
        for chunk in res.iter_content(chunk_size):
            f.write(chunk)
    logger.info('done')

    logger.info('installing miniconda to {}'.format(conda_path))
    subprocess.check_call(["bash", file_name, "-b", "-p", conda_path])
    logger.info('done')

    logger.info("installing rdkit")
    subprocess.check_call([
        os.path.join(conda_path, "bin", "conda"),
        "install",
        "--yes",
        "-c", "rdkit",
        "python=={}".format(python_version),
        "rdkit" if rdkit_version is None else "rdkit=={}".format(rdkit_version)])
    logger.info("done")

    import rdkit
    logger.info("rdkit-{} installation finished!".format(rdkit.__version__))
install()
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit
done
rdkit-2020.09.1 installation finished!
In [ ]:
!pip install -q transformers
!pip install -q simpletransformers
# !pip install wandb   #Uncomment if you want to use wandb
     |████████████████████████████████| 1.3MB 7.7MB/s 
     |████████████████████████████████| 2.9MB 54.7MB/s 
     |████████████████████████████████| 890kB 42.7MB/s 
     |████████████████████████████████| 1.1MB 42.4MB/s 
  Building wheel for sacremoses (setup.py) ... done
     |████████████████████████████████| 215kB 9.0MB/s 
     |████████████████████████████████| 1.7MB 17.7MB/s 
     |████████████████████████████████| 51kB 7.3MB/s 
     |████████████████████████████████| 7.4MB 55.7MB/s 
     |████████████████████████████████| 71kB 10.6MB/s 
     |████████████████████████████████| 317kB 51.5MB/s 
     |████████████████████████████████| 163kB 50.3MB/s 
     |████████████████████████████████| 122kB 56.3MB/s 
     |████████████████████████████████| 102kB 14.7MB/s 
     |████████████████████████████████| 102kB 13.2MB/s 
     |████████████████████████████████| 6.7MB 46.1MB/s 
     |████████████████████████████████| 112kB 59.9MB/s 
     |████████████████████████████████| 4.4MB 46.3MB/s 
     |████████████████████████████████| 133kB 54.8MB/s 
     |████████████████████████████████| 71kB 10.4MB/s 
     |████████████████████████████████| 122kB 52.9MB/s 
     |████████████████████████████████| 71kB 10.3MB/s 
  Building wheel for seqeval (setup.py) ... done
  Building wheel for watchdog (setup.py) ... done
  Building wheel for subprocess32 (setup.py) ... done
  Building wheel for blinker (setup.py) ... done
  Building wheel for pathtools (setup.py) ... done
ERROR: google-colab 1.0.0 has requirement ipykernel~=4.10, but you'll have ipykernel 5.3.4 which is incompatible.
ERROR: seqeval 1.2.1 has requirement numpy==1.19.2, but you'll have numpy 1.18.5 which is incompatible.
ERROR: seqeval 1.2.1 has requirement scikit-learn==0.23.2, but you'll have scikit-learn 0.22.2.post1 which is incompatible.
ERROR: botocore 1.19.2 has requirement urllib3<1.26,>=1.25.4; python_version != "3.4", but you'll have urllib3 1.24.3 which is incompatible.

ChemBerta

ChemBERTa ia a collection of BERT-like models applied to chemical SMILES data for drug design, chemical modelling, and property prediction. We finetune this existing model to use it for our application.

First we visualize the attention head using the bert-viz library, we can use this tool to see if the model infact understands the smiles it is processing.

We will be using the tokenizer that was pretrained, if we trained our own tokenizer the results would probably be better.

I plan on implementing this soon, but I have included a link in the References section of this notebook, if you want to have a crack at this.

In [ ]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
      jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
  }
});
In [ ]:
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

Lets load the train data and have a look at a few molecules that have the same label and pass them to the pretrained roberta model(trained on the zinc 250k dataset).

In [ ]:
import pandas as pd
import numpy as np
train_df = pd.read_csv("data/train.csv")
train_df.head()
Out[ ]:
SMILES SENTENCE
0 C/C=C/C(=O)C1CCC(C=C1C)(C)C fruity,rose
1 COC(=O)OC fresh,ethereal,fruity
2 Cc1cc2c([nH]1)cccc2 resinous,animalic
3 C1CCCCCCCC(=O)CCCCCCC1 powdery,musk,animalic
4 CC(CC(=O)OC1CC2C(C1(C)CC2)(C)C)C coniferous,camphor,fruity
In [ ]:
train_df.loc[train_df["SENTENCE"]=="resinous,animalic"]
Out[ ]:
SMILES SENTENCE
2 Cc1cc2c([nH]1)cccc2 resinous,animalic
1108 Cc1nc2c(o1)cccc2 resinous,animalic
3183 Cc1ccc2c(n1)cccc2 resinous,animalic
In [ ]:
import torch
import rdkit
import rdkit.Chem as Chem
from rdkit.Chem import rdFMCS
from matplotlib import colors
from rdkit.Chem import Draw
from rdkit.Chem.Draw import MolToImage
m = Chem.MolFromSmiles('Cc1nc2c(o1)cccc2')
fig = Draw.MolToMPL(m, size=(200, 200))
In [ ]:
m = Chem.MolFromSmiles('Cc1ccc2c(n1)cccc2')
fig = Draw.MolToMPL(m, size=(200,200))
In [ ]:
!git clone https://github.com/jessevig/bertviz.git
In [ ]:
import sys
sys.path.append("bertviz")
In [ ]:
from transformers import RobertaModel, RobertaTokenizer
from bertviz import head_view


model_version = 'seyonec/ChemBERTa_zinc250k_v2_40k'
model = RobertaModel.from_pretrained(model_version, output_attentions=True)
tokenizer = RobertaTokenizer.from_pretrained(model_version)

sentence_a = "Cc1cc2c([nH]1)cccc2"
sentence_b = "Cc1ccc2c(n1)cccc2"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
attention = model(input_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

call_html()

head_view(attention, tokens)
Layer: