Loading
Feedback

Emotion Detection

[Getting Started Code] Emotion Detection using spaCy

In this first challenge, we are going to learn fundamentals of Natural Language Processing.

By  Shubhamaicrowd


image.png

Starter Code for Emotion Detection

What we are going to Learn

  • Basics of Natual Language Preprocessing
  • Using a very popular & powerful python library called spaCy for language language processing to see how we can preprocess out texts using spaCy and convert them into numbers.
  • Using Decision Tree Classifier from sklearn to train, validate & test the model for text classification.
  • Testing and Submitting the Results to the Challenge.

Natural Language Preprocessing 🗣️

Now Natural Language Preprocessing ( or NLP in short ), actually allows Machines to understand human languages and and do certain kind of task such as classification, for ex.

  • Gmail - Classifying emails as Spam/Not Spam.

  • GPT-3 ( A powerful language model by OpenAI ) - Generating blogs so good that even humans couldn't classify accuractly if the blog was generated by machine or human ๐Ÿคฏ

and tons of others....

Now in this challange, we are going to learn a very basic task in Natural Language Processing, which is Text Classification, so let's begin!

Downloading Dataset

AIcrowd had a recent addition that allows you to directly download the dataset from any challenge using AIcrowd CLI.

So we will first need to download the python library by AIcrowd that will allow us to download the dataset by just inputting the API key.

In [ ]:
!pip install aicrowd-cli
Collecting aicrowd-cli
  Downloading https://files.pythonhosted.org/packages/1f/57/59b5a00c6e90c9cc028b3da9dff90e242ad2847e735b1a0e81a21c616e27/aicrowd_cli-0.1.7-py3-none-any.whl (49kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 51kB 4.6MB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Collecting requests<3,>=2.25.1
  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 61kB 8.8MB/s 
Collecting tqdm<5,>=4.56.0
  Downloading https://files.pythonhosted.org/packages/42/d7/f357d98e9b50346bcb6095fe3ad205d8db3174eb5edb03edfe7c4099576d/tqdm-4.61.0-py2.py3-none-any.whl (75kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 81kB 12.3MB/s 
Collecting rich<11,>=10.0.0
  Downloading https://files.pythonhosted.org/packages/6b/39/fbe8d15f0b017d63701f2a42e4ccb9a73cd4175e5c56214c1b5685e3dd79/rich-10.2.2-py3-none-any.whl (203kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 204kB 41.1MB/s 
Collecting requests-toolbelt<1,>=0.9.1
  Downloading https://files.pythonhosted.org/packages/60/ef/7681134338fc097acef8d9b2f8abe0458e4d87559c689a8c306d0957ece5/requests_toolbelt-0.9.1-py2.py3-none-any.whl (54kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 61kB 10.3MB/s 
Collecting gitpython<4,>=3.1.12
  Downloading https://files.pythonhosted.org/packages/27/da/6f6224fdfc47dab57881fe20c0d1bc3122be290198ba0bf26a953a045d92/GitPython-3.1.17-py3-none-any.whl (166kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 174kB 67.9MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading https://files.pythonhosted.org/packages/b1/92/dfd892312d822f36c55366118b95d914e5f16de11044a27cf10a7d71bbbf/commonmark-0.9.1-py2.py3-none-any.whl (51kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 51kB 8.7MB/s 
Collecting colorama<0.5.0,>=0.4.0
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Requirement already satisfied: typing-extensions<4.0.0,>=3.7.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (3.7.4.3)
Collecting gitdb<5,>=4.0.1
  Downloading https://files.pythonhosted.org/packages/ea/e8/f414d1a4f0bbc668ed441f74f44c116d9816833a48bf81d22b697090dba8/gitdb-4.0.7-py3-none-any.whl (63kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 71kB 7.2MB/s 
Collecting smmap<5,>=3.0.1
  Downloading https://files.pythonhosted.org/packages/68/ee/d540eb5e5996eb81c26ceffac6ee49041d473bc5125f2aa995cf51ec1cf1/smmap-4.0.0-py2.py3-none-any.whl
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
Installing collected packages: requests, tqdm, commonmark, colorama, rich, requests-toolbelt, smmap, gitdb, gitpython, aicrowd-cli
  Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
Successfully installed aicrowd-cli-0.1.7 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.7 gitpython-3.1.17 requests-2.25.1 requests-toolbelt-0.9.1 rich-10.2.2 smmap-4.0.0 tqdm-4.61.0
In [ ]:
API_KEY = '61d7dd898be9a4343531783c2ca4a402' # Please get your your API Key from [https://www.aicrowd.com/participants/me]
!aicrowd login --api-key $API_KEY
API Key valid
Saved API Key successfully!
In [ ]:
# Downloading the Dataset
!mkdir data
!aicrowd dataset download --challenge emotion-detection -j 3 -o data
val.csv:   0% 0.00/262k [00:00<?, ?B/s]
train.csv:   0% 0.00/2.30M [00:00<?, ?B/s]

val.csv: 100% 262k/262k [00:00<00:00, 1.41MB/s]


test.csv: 100% 642k/642k [00:00<00:00, 2.01MB/s]

train.csv: 100% 2.30M/2.30M [00:00<00:00, 4.96MB/s]

Downloading & Importing Libraries

Here we are going to use spaCy to do our text classification task, now spaCy is a very popular python library for Natural Language Processing and got one of the beautiful documentation I have ever seen ๐Ÿ˜‡ and they also got the Advance NLP with spaCy if anyone want's to check out, but back on the topic.

We are also downloading a python file explacy.py from tylerneylon/explacy which will help us in visualizing some topics of NLP

In [ ]:
!pip install --upgrade spacy rich
!python -m spacy download en_core_web_sm # Downloaing the model for engligh language will contains many pretrained preprocessing pipelines
Collecting spacy
  Downloading https://files.pythonhosted.org/packages/1b/d8/0361bbaf7a1ff56b44dca04dace54c82d63dad7475b7d25ea1baefafafb2/spacy-3.0.6-cp37-cp37m-manylinux2014_x86_64.whl (12.8MB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 12.8MB 8.4MB/s 
Requirement already up-to-date: rich in /usr/local/lib/python3.7/dist-packages (10.2.2)
Collecting pydantic<1.8.0,>=1.7.1
  Downloading https://files.pythonhosted.org/packages/ca/fa/d43f31874e1f2a9633e4c025be310f2ce7a8350017579e9e837a62630a7e/pydantic-1.7.4-cp37-cp37m-manylinux2014_x86_64.whl (9.1MB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 9.1MB 36.9MB/s 
Requirement already satisfied, skipping upgrade: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.0.5)
Requirement already satisfied, skipping upgrade: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (20.9)
Collecting spacy-legacy<3.1.0,>=3.0.4
  Downloading https://files.pythonhosted.org/packages/8d/67/d4002a18e26bf29b17ab563ddb55232b445ab6a02f97bf17d1345ff34d3f/spacy_legacy-3.0.5-py2.py3-none-any.whl
Requirement already satisfied, skipping upgrade: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.25.1)
Requirement already satisfied, skipping upgrade: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.0.5)
Collecting srsly<3.0.0,>=2.4.1
  Downloading https://files.pythonhosted.org/packages/c3/84/dfdfc9f6f04f6b88207d96d9520b911e5fec0c67ff47a0dea31ab5429a1e/srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 460kB 49.5MB/s 
Requirement already satisfied, skipping upgrade: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (3.0.5)
Requirement already satisfied, skipping upgrade: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.19.5)
Requirement already satisfied, skipping upgrade: wasabi<1.1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.8.2)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy) (57.0.0)
Collecting typer<0.4.0,>=0.3.0
  Downloading https://files.pythonhosted.org/packages/90/34/d138832f6945432c638f32137e6c79a3b682f06a63c488dcfaca6b166c64/typer-0.3.2-py3-none-any.whl
Collecting thinc<8.1.0,>=8.0.3
  Downloading https://files.pythonhosted.org/packages/61/87/decceba68a0c6ca356ddcb6aea8b2500e71d9bc187f148aae19b747b7d3c/thinc-8.0.3-cp37-cp37m-manylinux2014_x86_64.whl (1.1MB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.1MB 44.7MB/s 
Requirement already satisfied, skipping upgrade: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.11.3)
Requirement already satisfied, skipping upgrade: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (4.61.0)
Collecting pathy>=0.3.5
  Downloading https://files.pythonhosted.org/packages/13/87/5991d87be8ed60beb172b4062dbafef18b32fa559635a8e2b633c2974f85/pathy-0.5.2-py3-none-any.whl (42kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 51kB 9.4MB/s 
Requirement already satisfied, skipping upgrade: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.4.1)
Requirement already satisfied, skipping upgrade: typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from spacy) (3.7.4.3)
Collecting catalogue<2.1.0,>=2.0.3
  Downloading https://files.pythonhosted.org/packages/9c/10/dbc1203a4b1367c7b02fddf08cb2981d9aa3e688d398f587cea0ab9e3bec/catalogue-2.0.4-py3-none-any.whl
Requirement already satisfied, skipping upgrade: colorama<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from rich) (0.4.4)
Requirement already satisfied, skipping upgrade: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich) (2.6.1)
Requirement already satisfied, skipping upgrade: commonmark<0.10.0,>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from rich) (0.9.1)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy) (2.4.7)
Requirement already satisfied, skipping upgrade: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.3)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)
Requirement already satisfied, skipping upgrade: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied, skipping upgrade: click<7.2.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.4.0,>=0.3.0->spacy) (7.1.2)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy) (2.0.1)
Collecting smart-open<4.0.0,>=2.2.0
  Downloading https://files.pythonhosted.org/packages/11/9a/ba2d5f67f25e8d5bbf2fcec7a99b1e38428e83cb715f64dd179ca43a11bb/smart_open-3.0.0.tar.gz (113kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 122kB 52.4MB/s 
Requirement already satisfied, skipping upgrade: zipp>=0.5; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.3->spacy) (3.4.1)
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... done
  Created wheel for smart-open: filename=smart_open-3.0.0-cp37-none-any.whl size=107107 sha256=7ce7240d2abbbe2f505b49e323cb71ef5bbb18b2de9392df3b4f702126ff9062
  Stored in directory: /root/.cache/pip/wheels/18/88/7c/f06dabd5e9cabe02d2269167bcacbbf9b47d0c0ff7d6ebcb78
Successfully built smart-open
Installing collected packages: pydantic, spacy-legacy, catalogue, srsly, typer, thinc, smart-open, pathy, spacy
  Found existing installation: catalogue 1.0.0
    Uninstalling catalogue-1.0.0:
      Successfully uninstalled catalogue-1.0.0
  Found existing installation: srsly 1.0.5
    Uninstalling srsly-1.0.5:
      Successfully uninstalled srsly-1.0.5
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: smart-open 5.0.0
    Uninstalling smart-open-5.0.0:
      Successfully uninstalled smart-open-5.0.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed catalogue-2.0.4 pathy-0.5.2 pydantic-1.7.4 smart-open-3.0.0 spacy-3.0.6 spacy-legacy-3.0.5 srsly-2.4.1 thinc-8.0.3 typer-0.3.2
2021-06-05 07:15:51.923052: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7MB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13.7MB 314kB/s 
Requirement already satisfied: spacy<3.1.0,>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from en-core-web-sm==3.0.0) (3.0.6)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (20.9)
Requirement already satisfied: thinc<8.1.0,>=8.0.3 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (8.0.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (4.61.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.0.5)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (57.0.0)
Requirement already satisfied: pydantic<1.8.0,>=1.7.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.7.4)
Requirement already satisfied: catalogue<2.1.0,>=2.0.3 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.0.4)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.5.2)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.4.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.0.5)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.19.5)
Requirement already satisfied: typer<0.4.0,>=0.3.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.3.2)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.5)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.4.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.11.3)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.8.2)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.7.4.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.25.1)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.4.7)
Requirement already satisfied: zipp>=0.5; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.3->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.4.1)
Requirement already satisfied: smart-open<4.0.0,>=2.2.0 in /usr/local/lib/python3.7/dist-packages (from pathy>=0.3.5->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.0)
Requirement already satisfied: click<7.2.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.4.0,>=0.3.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.0.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.4)
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.0.0
โœ” Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
In [ ]:
!wget https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py
--2021-06-05 07:15:57--  https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6896 (6.7K) [text/plain]
Saving to: โ€˜explacy.pyโ€™

explacy.py          100%[===================>]   6.73K  --.-KB/s    in 0s      

2021-06-05 07:15:58 (97.6 MB/s) - โ€˜explacy.pyโ€™ saved [6896/6896]

In [ ]:
import pandas as pd
import spacy
import explacy
import random
from sklearn import tree
from sklearn.metrics import f1_score, accuracy_score
import os

# To make things more beautiful! 
from rich.console import Console
from rich.table import Table
from rich import pretty
pretty.install()


# Seeding everything for getting same results 
random.seed(1)
spacy.util.fix_random_seed(1)


# function to display YouTube videos
from IPython.display import YouTubeVideo
In [ ]:
# spaCy v3.0 the the latest version spaCy 
spacy.__version__
'3.0.6'
Out[ ]:

Reading Dataset

Reading the necessary files to train, validation & submit our results!

In [ ]:
train_dataset = pd.read_csv("data/train.csv")
validation_dataset = pd.read_csv("data/val.csv")[1:]
test_dataset = pd.read_csv("data/test.csv")
train_dataset
Out[ ]:
text label
0 takes no time to copy/paste a press release 0
1 You're delusional 1
2 Jazz fan here. I completely feel. Lindsay Mann... 0
3 ah i was also confused but i think they mean f... 0
4 Thank you so much. โ™ฅ๏ธ that means a lot. 0
... ... ...
31250 thank you so much! :) 0
31251 That works too. To each their own. 0
31252 Friendly fire dude, I wanted the other criminal 0
31253 Yes, exactly. Fix a date and if he still procr... 0
31254 Ferrets are such good ESA's though! Good for y... 0

31255 rows ร— 2 columns

Text Classification 🧠

In this section, we are going to train a text classifier and do some validation tests.

Word2Vec

Now, Computers don't understand texts, they only understand numbers, now there are many ways to convert a text into a number, but there we are going to use word2vec from spaCy.

โ€ โ€ โ€ โ€ โ€ โ€ โ€ โ€ โ€ โ€ โ€ โ€

Now how word2vec actually converts text into numbers ? word2vec uses techniques like Skip-grams or CBOW. It relations between words by learning on a very large amount of texts ( corpus of text ).

โ€ โ€ โ€ โ€ โ€ โ€ โ€ The output is text embeddings which is a 1D array.

If you want to learn more about word2vec, i would suggest watch this YouTube video by Computerphile below and Illustrated Word2vec from Jalammar

In [ ]:
# If you want to learn more about word2vec, i would suggest watch this YouTube video by Computerphile
YouTubeVideo('gQddtTdmG_8')
Out[ ]:
In [ ]:
nlp = spacy.load('en_core_web_sm')

So, what we just did here ?

  • The spacy.load function contains pretty much everything you need for text preprocessing. We are going to debrief more in upcoming cells.

  • en is the language that will be in your dataset, spaCy supports many other languages.

In [ ]:
# Getting a sample text from training dataset to demonstrate word2vec  
sample_text = train_dataset.iloc[2]['text'] 
sample_text
"Jazz fan here. I completely feel. Lindsay Mann cousins has more votes than Lindsay Mann, and Lindsay Mann hasn't even stepped on the court this year"
Out[ ]:
In [ ]:
# Inputting the text in nlp function
doc = nlp(sample_text)
In [ ]:
# Getting the embeddings from the sample text
doc.vector
array([ 0.5871044 ,  0.10283045,  0.23638554, -0.08171239,  0.02029083,
       -0.12000011, -0.1948305 ,  0.1250714 , -0.01082261, -0.34358275,
       -0.16639529, -0.04950966, -0.01394221,  0.06337869, -0.30135745,
        0.22211872, -0.17156254,  0.03178046,  0.30427024, -0.10826215,
       -0.25342506,  0.30617625, -0.17000276,  0.35598457, -0.00835321,
       -0.11478721, -0.1430562 ,  0.02518663,  0.60922873,  0.19284171,
       -0.23238468, -0.27463096, -0.13183063, -0.27534658,  0.25664884,
        0.05174499,  0.18620381,  0.11441176, -0.10955156,  0.29338667,
       -0.15877348, -0.02914245,  0.1963947 , -0.04410601,  0.12061837,
       -0.0941467 ,  0.27903876, -0.09223508, -0.00497099, -0.25587952,
        0.21098505,  0.01725493, -0.29827487,  0.0894304 ,  0.14340732,
       -0.0376591 , -0.3396481 ,  0.19914041,  0.28556582,  0.18212257,
        0.5140986 ,  0.02056837, -0.18578346, -0.28987882, -0.16651031,
       -0.10539112, -0.05578137,  0.00634063,  0.02737209,  0.14916842,
        0.15076284,  0.31409967, -0.06142968, -0.13555318, -0.08603293,
        0.40901124, -0.07265005, -0.19719984, -0.349496  ,  0.11685906,
        0.20542377,  0.1133521 ,  0.13061962,  0.2739835 , -0.00384022,
       -0.21771309, -0.28375924, -0.41814512, -0.42588463, -0.06813539,
       -0.27145588,  0.17521115, -0.15065633, -0.05529505,  0.06760314,
       -0.1013436 ], dtype=float32)

Creating our Dataset

Ok, now we are starting, so in this section, we are going to make a function which will convert the dataset into a right format that we can directly put into the Machine Learning Model

In [ ]:
def create_data(dataset, is_train=True):

  # If we are using a training dataset
  if is_train == True:

    # Getting all text into a python list
    texts = list(dataset['text'].values)
                 
    # Put the list into the nlp pipeline and converting the output into a list
    preprocessed_texts = list(nlp.pipe(texts))

    # Getting vectors for all texts 
    X = [string.vector  for string in preprocessed_texts]

    # Labels for the corrosponding texts 
    y = dataset['label'].tolist()

    return X, y

  else:

    # Getting all text into a python list
    texts = list(dataset['text'].values)
                 
    # Put the list into the nlp pipeline and converting the output into a list
    preprocessed_texts = list(nlp.pipe(texts))

    # Getting vectors for all texts 
    X = [string.vector  for string in preprocessed_texts]

    return X
In [ ]:
# Creating the training dataset
X_train, y_train = create_data(train_dataset)

# Creating the validation dataset
X_val, y_val = create_data(validation_dataset)

X_train[0], y_train[0]
(
    array([ 0.14151856, -0.05933758, -0.08044346, -0.1107378 ,  0.11325426,
        0.15893325, -0.44572473,  0.21314225,  0.07863858, -0.12423076,
        0.08870672, -0.16083065,  0.03217425, -0.18913928, -0.43932933,
        0.614143  , -0.04348904,  0.15507331, -0.10762676, -0.6841317 ,
       -0.52705824,  0.19307005,  0.5150874 , -0.78364027, -0.5798768 ,
       -0.36855024,  0.261396  ,  0.08007441,  0.13464396, -0.11683945,
       -0.40335947, -0.4810744 , -0.1996509 , -0.40538788,  0.907061  ,
       -0.08425845,  0.41184407, -0.05256712, -0.01928839,  0.67991185,
        0.18288395, -0.00932413,  0.15208176,  0.5218999 , -0.21960959,
       -0.0870916 ,  0.0571377 ,  0.39526838,  0.11505216,  0.03575191,
        0.18940884,  0.35989684, -0.03341857,  0.50211823,  0.25563306,
       -0.45388022,  0.04130013,  0.1111506 ,  0.11391255,  0.12924205,
       -0.44648582, -0.04701398, -0.12538372,  0.11663225,  0.3620197 ,
       -0.00661604, -0.25024778, -0.3998903 , -0.07583453,  0.747198  ,
        0.5959583 , -0.17592818, -0.04306039, -0.52941674, -0.12472894,
       -0.43728942, -0.06499843,  0.0685156 ,  0.23714057, -0.20262413,
        0.41856593,  0.01466806, -0.19088936,  0.36778176,  0.09763627,
       -0.3632294 , -0.15831968, -0.43174067, -0.08282916, -0.09869532,
        0.14616643,  0.02205803, -0.33136448,  0.19892709, -0.50562334,
       -0.14163868], dtype=float32),
    0
)

Creating the Model

Now we are getting close. Here we are using sklearn ( A popular Machine Learning Library ) Decision Tree Classifier model to classifty our text ( vectors ) into 2 labels.

In [ ]:
clf = tree.DecisionTreeClassifier()

Training

And there we go! It's finally the time to start the training!

In [ ]:
clf = clf.fit(X_train, y_train)

Validation

Now we have trained the model, let's see the results for unseen validation dataset.

In [ ]:
y_pred = clf.predict(X_val)
In [ ]:
# Getting F1 & Accuracy score of validation predictions
f1 = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)

print(f"Validation F1 Score  : {f1} and Accuracy Score {accuracy}")
Validation F1 Score  : 0.2493333333333333 and Accuracy Score 0.6756912442396313

Submitting Results 📄

Okay, this is the last section ๐Ÿ˜Œ , let's get out testing results from the model real quick and submit our prediction directly using AIcrowd CLI

In [ ]:
# By settings is_train=False, the create_data function will only output the features as setuped in the function
test_data = create_data(test_dataset, is_train=False)

test_predictions = clf.predict(test_data)
In [ ]:
# Applying the predictions to the labels column of the sample submission 
test_dataset['label'] = test_predictions
test_dataset
Out[ ]:
text label
0 I was already over the edge with Cassie Zamora... 0
1 I think you're right. She has oodles of cash a... 1
2 Haha I love this. I used to give mine phone bo... 0
3 Probably out of desperation as they going no a... 1
4 Sorry !! Youโ€™re real good at that!! 0
... ... ...
8677 Yeah no...I would find it very demeaning 0
8678 This is how mafia works 1
8679 Ah thanks ๐Ÿ‘๐Ÿป 0
8680 I ask them straight why they don't respect my ... 1
8681 Annette Acosta also tends to out vote Annette ... 0

8682 rows ร— 2 columns

Note : Please make sure that there should be filename submission.csv in assets folder before submitting it

In [ ]:
!mkdir assets

# Saving the sample submission in assets directory
test_dataset.to_csv(os.path.join("assets", "submission.csv"), index=False)

Uploading the Results

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
!aicrowd notebook submit -c emotion-detection -a assets --no-verify
Mounting Google Drive ๐Ÿ’พ
Your Google Drive will be mounted to access the colab notebook
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AY0e-g78IJym0zBTHET4azZvVKa9yNL2P0u1Pt1xZsyGD-pRwS0uciy0uH4
Mounted at /content/drive
Using notebook: /content/drive/MyDrive/Colab Notebooks/Copy of Text Classification for submission...
Scrubbing API keys from the notebook...
Collecting notebook...
submission.zip โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100.0% โ€ข 460.1/458.4 KB โ€ข 735.1 kB/s โ€ข 0:00:00
                                                  โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ                                                  
                                                  โ”‚ Successfully submitted! โ”‚                                                  
                                                  โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ                                                  
                                                        Important links                                                        
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  This submission โ”‚ https://www.aicrowd.com/challenges/ai-blitz-9/problems/emotion-detection/submissions/144546              โ”‚
โ”‚                  โ”‚                                                                                                          โ”‚
โ”‚  All submissions โ”‚ https://www.aicrowd.com/challenges/ai-blitz-9/problems/emotion-detection/submissions?my_submissions=true โ”‚
โ”‚                  โ”‚                                                                                                          โ”‚
โ”‚      Leaderboard โ”‚ https://www.aicrowd.com/challenges/ai-blitz-9/problems/emotion-detection/leaderboards                    โ”‚
โ”‚                  โ”‚                                                                                                          โ”‚
โ”‚ Discussion forum โ”‚ https://discourse.aicrowd.com/c/ai-blitz-9                                                               โ”‚
โ”‚                  โ”‚                                                                                                          โ”‚
โ”‚   Challenge page โ”‚ https://www.aicrowd.com/challenges/ai-blitz-9/problems/emotion-detection                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Congratulations ๐ŸŽ‰ you did it, but there still a lot of improvement that can be made, data exploration is one of the most import pipelines in machine learning, especially in competitions, so maybe see if there is data imbalance, how minimize it's effects, maybe looking first few rows to each dataset. Or maybe improving the score, have fun!

And btw -

Don't be shy to ask question related to any errors you are getting or doubts in any part of this notebook in discussion forum or in AIcrowd Discord sever, AIcrew will be happy to help you :)

Also, wanna give us your valuable feedback for next blitz or wanna work with us creating blitz challanges ? Let us know!

In [ ]:

โ†•๏ธ  Read More

Liked by  

Comments

You must login before you can post a comment.