Loading
Feedback

Emotion Detection

[In Depth Code] Emotion Detection using spaCy

This is a more In depth Code of the Emotion Detection for Blitz 9.

By  Shubhamaicrowd


image.png

In-Depth Code for Emotion Detection

What we are going to Learn

  • Basics of Natual Language Preprocessing
  • Using a very popular & powerful python library called spaCy for language language processing to see how we can preprocess out texts using spaCy and convert them into numbers.
  • Using Decision Tree Classifier from sklearn to train, validate & test the model for text classification.
  • Testing and Submitting the Results to the Challenge.

Natural Language Preprocessing 🗣️

Now Natural Language Preprocessing ( or NLP in short ), actually allows Machines to understand human languages and and do certain kind of task such as classification, for ex.

  • Gmail - Classifying emails as Spam/Not Spam.

  • GPT-3 ( A powerful language model by OpenAI ) - Generating blogs so good that even humans couldn't classify accuractly if the blog was generated by machine or human 🀯

and tons of others....

Now in this challange, we are going to learn a very basic task in Natural Language Processing, which is Text Classification, so let's begin!

Downloading Dataset

AIcrowd had a recent addition that allows you to directly download the dataset from any challenge using AIcrowd CLI.

So we will first need to download the python library by AIcrowd that will allow us to download the dataset by just inputting the API key.

In [ ]:
!pip install aicrowd-cli
Collecting aicrowd-cli
  Downloading https://files.pythonhosted.org/packages/1f/57/59b5a00c6e90c9cc028b3da9dff90e242ad2847e735b1a0e81a21c616e27/aicrowd_cli-0.1.7-py3-none-any.whl (49kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51kB 1.4MB/s 
Collecting gitpython<4,>=3.1.12
  Downloading https://files.pythonhosted.org/packages/27/da/6f6224fdfc47dab57881fe20c0d1bc3122be290198ba0bf26a953a045d92/GitPython-3.1.17-py3-none-any.whl (166kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 174kB 3.6MB/s 
Collecting requests-toolbelt<1,>=0.9.1
  Downloading https://files.pythonhosted.org/packages/60/ef/7681134338fc097acef8d9b2f8abe0458e4d87559c689a8c306d0957ece5/requests_toolbelt-0.9.1-py2.py3-none-any.whl (54kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 61kB 6.2MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Collecting rich<11,>=10.0.0
  Downloading https://files.pythonhosted.org/packages/6b/39/fbe8d15f0b017d63701f2a42e4ccb9a73cd4175e5c56214c1b5685e3dd79/rich-10.2.2-py3-none-any.whl (203kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 204kB 17.7MB/s 
Collecting requests<3,>=2.25.1
  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 61kB 6.3MB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Collecting tqdm<5,>=4.56.0
  Downloading https://files.pythonhosted.org/packages/42/d7/f357d98e9b50346bcb6095fe3ad205d8db3174eb5edb03edfe7c4099576d/tqdm-4.61.0-py2.py3-none-any.whl (75kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 81kB 7.6MB/s 
Requirement already satisfied: typing-extensions>=3.7.4.0; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from gitpython<4,>=3.1.12->aicrowd-cli) (3.7.4.3)
Collecting gitdb<5,>=4.0.1
  Downloading https://files.pythonhosted.org/packages/ea/e8/f414d1a4f0bbc668ed441f74f44c116d9816833a48bf81d22b697090dba8/gitdb-4.0.7-py3-none-any.whl (63kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 71kB 7.8MB/s 
Collecting commonmark<0.10.0,>=0.9.0
  Downloading https://files.pythonhosted.org/packages/b1/92/dfd892312d822f36c55366118b95d914e5f16de11044a27cf10a7d71bbbf/commonmark-0.9.1-py2.py3-none-any.whl (51kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51kB 5.3MB/s 
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting colorama<0.5.0,>=0.4.0
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.0.4)
Collecting smmap<5,>=3.0.1
  Downloading https://files.pythonhosted.org/packages/68/ee/d540eb5e5996eb81c26ceffac6ee49041d473bc5125f2aa995cf51ec1cf1/smmap-4.0.0-py2.py3-none-any.whl
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
Installing collected packages: smmap, gitdb, gitpython, requests, requests-toolbelt, commonmark, colorama, rich, tqdm, aicrowd-cli
  Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
Successfully installed aicrowd-cli-0.1.7 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.7 gitpython-3.1.17 requests-2.25.1 requests-toolbelt-0.9.1 rich-10.2.2 smmap-4.0.0 tqdm-4.61.0
In [ ]:
API_KEY = '61d7dd898be9a4343531783c2ca4a402' # Please get your your API Key from [https://www.aicrowd.com/participants/me]
!aicrowd login --api-key $API_KEY
API Key valid
Saved API Key successfully!
In [ ]:
# Downloading the Dataset
!mkdir data
!aicrowd dataset download --challenge emotion-detection -j 3 -o data
val.csv:   0% 0.00/262k [00:00<?, ?B/s]
test.csv:   0% 0.00/642k [00:00<?, ?B/s]

val.csv: 100% 262k/262k [00:00<00:00, 366kB/s]

test.csv: 100% 642k/642k [00:00<00:00, 748kB/s]


train.csv: 100% 2.30M/2.30M [00:01<00:00, 1.85MB/s]

Downloading & Importing Libraries

Here we are going to use spaCy to do our text classification task, now spaCy is a very popular python library for Natural Language Processing and got one of the beautiful documentation I have ever seen πŸ˜‡ and they also got the Advance NLP with spaCy if anyone want's to check out, but back on the topic.

We are also downloading a python file explacy.py from tylerneylon/explacy which will help us in visualizing some topics of NLP

In [ ]:
!pip install --upgrade spacy rich
!python -m spacy download en_core_web_sm # Downloaing the model for engligh language will contains many pretrained preprocessing pipelines
Collecting spacy
  Downloading https://files.pythonhosted.org/packages/1b/d8/0361bbaf7a1ff56b44dca04dace54c82d63dad7475b7d25ea1baefafafb2/spacy-3.0.6-cp37-cp37m-manylinux2014_x86_64.whl (12.8MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12.8MB 9.6MB/s 
Requirement already up-to-date: rich in /usr/local/lib/python3.7/dist-packages (10.2.2)
Requirement already satisfied, skipping upgrade: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.0.5)
Requirement already satisfied, skipping upgrade: wasabi<1.1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.8.2)
Collecting typer<0.4.0,>=0.3.0
  Downloading https://files.pythonhosted.org/packages/90/34/d138832f6945432c638f32137e6c79a3b682f06a63c488dcfaca6b166c64/typer-0.3.2-py3-none-any.whl
Collecting srsly<3.0.0,>=2.4.1
  Downloading https://files.pythonhosted.org/packages/c3/84/dfdfc9f6f04f6b88207d96d9520b911e5fec0c67ff47a0dea31ab5429a1e/srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 460kB 38.3MB/s 
Collecting thinc<8.1.0,>=8.0.3
  Downloading https://files.pythonhosted.org/packages/61/87/decceba68a0c6ca356ddcb6aea8b2500e71d9bc187f148aae19b747b7d3c/thinc-8.0.3-cp37-cp37m-manylinux2014_x86_64.whl (1.1MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.1MB 40.0MB/s 
Requirement already satisfied, skipping upgrade: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.4.1)
Collecting catalogue<2.1.0,>=2.0.3
  Downloading https://files.pythonhosted.org/packages/9c/10/dbc1203a4b1367c7b02fddf08cb2981d9aa3e688d398f587cea0ab9e3bec/catalogue-2.0.4-py3-none-any.whl
Collecting spacy-legacy<3.1.0,>=3.0.4
  Downloading https://files.pythonhosted.org/packages/8d/67/d4002a18e26bf29b17ab563ddb55232b445ab6a02f97bf17d1345ff34d3f/spacy_legacy-3.0.5-py2.py3-none-any.whl
Requirement already satisfied, skipping upgrade: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (3.0.5)
Requirement already satisfied, skipping upgrade: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (20.9)
Requirement already satisfied, skipping upgrade: typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from spacy) (3.7.4.3)
Requirement already satisfied, skipping upgrade: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.11.3)
Collecting pydantic<1.8.0,>=1.7.1
  Downloading https://files.pythonhosted.org/packages/ca/fa/d43f31874e1f2a9633e4c025be310f2ce7a8350017579e9e837a62630a7e/pydantic-1.7.4-cp37-cp37m-manylinux2014_x86_64.whl (9.1MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9.1MB 34.4MB/s 
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy) (57.0.0)
Requirement already satisfied, skipping upgrade: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.19.5)
Requirement already satisfied, skipping upgrade: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.25.1)
Requirement already satisfied, skipping upgrade: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (4.61.0)
Requirement already satisfied, skipping upgrade: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.0.5)
Collecting pathy>=0.3.5
  Downloading https://files.pythonhosted.org/packages/13/87/5991d87be8ed60beb172b4062dbafef18b32fa559635a8e2b633c2974f85/pathy-0.5.2-py3-none-any.whl (42kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51kB 5.8MB/s 
Requirement already satisfied, skipping upgrade: colorama<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from rich) (0.4.4)
Requirement already satisfied, skipping upgrade: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich) (2.6.1)
Requirement already satisfied, skipping upgrade: commonmark<0.10.0,>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from rich) (0.9.1)
Requirement already satisfied, skipping upgrade: click<7.2.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.4.0,>=0.3.0->spacy) (7.1.2)
Requirement already satisfied, skipping upgrade: zipp>=0.5; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.3->spacy) (3.4.1)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy) (2.4.7)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy) (2.0.1)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)
Requirement already satisfied, skipping upgrade: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied, skipping upgrade: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.3)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Collecting smart-open<4.0.0,>=2.2.0
  Downloading https://files.pythonhosted.org/packages/11/9a/ba2d5f67f25e8d5bbf2fcec7a99b1e38428e83cb715f64dd179ca43a11bb/smart_open-3.0.0.tar.gz (113kB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 122kB 41.4MB/s 
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... done
  Created wheel for smart-open: filename=smart_open-3.0.0-cp37-none-any.whl size=107107 sha256=7ce7240d2abbbe2f505b49e323cb71ef5bbb18b2de9392df3b4f702126ff9062
  Stored in directory: /root/.cache/pip/wheels/18/88/7c/f06dabd5e9cabe02d2269167bcacbbf9b47d0c0ff7d6ebcb78
Successfully built smart-open
Installing collected packages: typer, catalogue, srsly, pydantic, thinc, spacy-legacy, smart-open, pathy, spacy
  Found existing installation: catalogue 1.0.0
    Uninstalling catalogue-1.0.0:
      Successfully uninstalled catalogue-1.0.0
  Found existing installation: srsly 1.0.5
    Uninstalling srsly-1.0.5:
      Successfully uninstalled srsly-1.0.5
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: smart-open 5.0.0
    Uninstalling smart-open-5.0.0:
      Successfully uninstalled smart-open-5.0.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed catalogue-2.0.4 pathy-0.5.2 pydantic-1.7.4 smart-open-3.0.0 spacy-3.0.6 spacy-legacy-3.0.5 srsly-2.4.1 thinc-8.0.3 typer-0.3.2
2021-06-05 07:15:51.865990: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13.7MB 304kB/s 
Requirement already satisfied: spacy<3.1.0,>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from en-core-web-sm==3.0.0) (3.0.6)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.7.4.3)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.8.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.11.3)
Requirement already satisfied: typer<0.4.0,>=0.3.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.3.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.25.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (20.9)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.5.2)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (0.4.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.0.5)
Requirement already satisfied: pydantic<1.8.0,>=1.7.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.7.4)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.19.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.0.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (4.61.0)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.4.1)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.5)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (57.0.0)
Requirement already satisfied: catalogue<2.1.0,>=2.0.3 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.0.4)
Requirement already satisfied: thinc<8.1.0,>=8.0.3 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (8.0.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.0.1)
Requirement already satisfied: click<7.2.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.4.0,>=0.3.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (7.1.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2020.12.5)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (2.4.7)
Requirement already satisfied: smart-open<4.0.0,>=2.2.0 in /usr/local/lib/python3.7/dist-packages (from pathy>=0.3.5->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.0.0)
Requirement already satisfied: zipp>=0.5; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.3->spacy<3.1.0,>=3.0.0->en-core-web-sm==3.0.0) (3.4.1)
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.0.0
βœ” Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
In [ ]:
!wget https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py
--2021-06-05 07:16:00--  https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6896 (6.7K) [text/plain]
Saving to: β€˜explacy.py’

explacy.py          100%[===================>]   6.73K  --.-KB/s    in 0s      

2021-06-05 07:16:00 (71.0 MB/s) - β€˜explacy.py’ saved [6896/6896]

In [ ]:
# Importing Libraries
import pandas as pd
import spacy
import explacy
import random
from sklearn import tree
from sklearn.metrics import f1_score, accuracy_score
import os

# To make things more beautiful! 
from rich.console import Console
from rich.table import Table
from rich import pretty
pretty.install()


# Seeding everything for getting same results 
random.seed(1)
spacy.util.fix_random_seed(1)


# function to display YouTube videos
from IPython.display import YouTubeVideo
In [ ]:
# spaCy v3.0 the the latest version spaCy 
spacy.__version__
'3.0.6'
Out[ ]:

Reading Dataset

Reading the necessary files to train, validation & submit our results!

In [ ]:
train_dataset = pd.read_csv("data/train.csv")
validation_dataset = pd.read_csv("data/val.csv")
test_dataset = pd.read_csv("data/test.csv")
train_dataset
Out[ ]:
text label
0 takes no time to copy/paste a press release 0
1 You're delusional 1
2 Jazz fan here. I completely feel. Lindsay Mann... 0
3 ah i was also confused but i think they mean f... 0
4 Thank you so much. β™₯️ that means a lot. 0
... ... ...
31250 thank you so much! :) 0
31251 That works too. To each their own. 0
31252 Friendly fire dude, I wanted the other criminal 0
31253 Yes, exactly. Fix a date and if he still procr... 0
31254 Ferrets are such good ESA's though! Good for y... 0

31255 rows Γ— 2 columns

Text Preprocessing 🏭

Now, computers just can't understand texts, those texts need to be converted into numbers for computers to understand, and before converting to numbers, we also need to clean our texts, remove unnecessary letters, punctunations, special characters and much much more. So let's see in more detail what I mean.

image.png

In [ ]:
nlp = spacy.load('en_core_web_sm')

So, what we just did here ?

  • The spacy.load function contains pretty much everything you need for text preprocessing. We are going to debrief more in upcoming cells.

  • en is the language that will be in your dataset, spaCy supports many other languages.

In [ ]:
# Getting a sample text from training dataset to demonstrate the underlying processes in nlp function  
sample_text = train_dataset.iloc[3]['text'] 
sample_text
'ah i was also confused but i think they mean friends around the same age'
Out[ ]:
In [ ]:
# The different preprocessing pipelines in spaCy that takes out text and does some preprocessings onto it, let's talk about few of them. 
nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

Tokenization

The first step nlp function does after taking the text is well tokenization πŸ˜„ . So what does tokenization means and why it's is important ?

In simple terms, tokenization --

  • Splits our senteces into individuals words/sub words that can't be divided more. The words/sub words still have the meaning.

  • Using Rule Based Methods to --

    • Fine-tune and break the letters even more if possible
    • Removing non-essential characters like punctuations marks.

Let's see an example --

In [ ]:
doc = nlp(sample_text)
In [ ]:
# Creating the table
console = Console()
token_table = Table(show_header=True, header_style="bold magenta")

# Adding the columns
token_table.add_column("Token", width=12)

# Going trough each token and it's corrosponding POS
for token in doc:
    token_table.add_row(token.text)

# Showing the table
console.print(token_table)
┏━━━━━━━━━━━━━━┓
┃ Token        ┃
┑━━━━━━━━━━━━━━┩
β”‚ ah           β”‚
β”‚ i            β”‚
β”‚ was          β”‚
β”‚ also         β”‚
β”‚ confused     β”‚
β”‚ but          β”‚
β”‚ i            β”‚
β”‚ think        β”‚
β”‚ they         β”‚
β”‚ mean         β”‚
β”‚ friends      β”‚
β”‚ around       β”‚
β”‚ the          β”‚
β”‚ same         β”‚
β”‚ age          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Part of Speech Tagging

Now Part of Speech Tagging os POS for short is to apply different tags to each tokens based on the context. spaCy using statictical models to predict the tag for each tokens.

You can head out to universaldependencies.org if you want to know what these each tag means.

Also check out Matcher for interactively learning rule based pattern matching. image.png

In [ ]:
# Creating the table
console = Console()
pos_table = Table(show_header=True, header_style="bold magenta")

# Adding the columns
pos_table.add_column("Token", width=12)
pos_table.add_column("POS", width=12)

# Going trough each token and it's corrosponding POS
for token in doc:
    pos_table.add_row(token.text, token.pos_)

# Showing the table
console.print(pos_table)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Token        ┃ POS          ┃
┑━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
β”‚ ah           β”‚ INTJ         β”‚
β”‚ i            β”‚ PRON         β”‚
β”‚ was          β”‚ AUX          β”‚
β”‚ also         β”‚ ADV          β”‚
β”‚ confused     β”‚ ADJ          β”‚
β”‚ but          β”‚ CCONJ        β”‚
β”‚ i            β”‚ PRON         β”‚
β”‚ think        β”‚ VERB         β”‚
β”‚ they         β”‚ PRON         β”‚
β”‚ mean         β”‚ VERB         β”‚
β”‚ friends      β”‚ NOUN         β”‚
β”‚ around       β”‚ ADP          β”‚
β”‚ the          β”‚ DET          β”‚
β”‚ same         β”‚ ADJ          β”‚
β”‚ age          β”‚ NOUN         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Dependency Labels

The Dependency labels shows the relations between each tokens & assigning syntatic struature to it.

Let's also see what each columns in the below table means -

  • Dep Tree ‍ ‍ ‍ ‍ ‍- ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍The arrow points the relations from a token to a token

  • Token ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍ - ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍Printing Each token

  • Dep Type ‍ ‍ ‍ ‍ - ‍‍‍‍‍ ‍ ‍‍‍‍‍‍ ‍ ‍‍‍‍‍‍ ‍ ‍‍‍‍‍‍Type of relation between the two tokens

  • Lemma ‍ ‍ ‍ ‍ ‍ ‍ ‍ - ‍ ‍ ‍ ‍ ‍ ‍ ‍ Lemmatization is simply getting base form of the words

  • Part of Sp ‍ ‍ ‍ - ‍ ‍ ‍ ‍ ‍ ‍ ‍ ‍It is Part of Speech we Talked Earlier

In [ ]:
explacy.print_parse_info(nlp, sample_text)
Dep tree         Token    Dep type Lemma    Part of Sp
──────────────── ──────── ──────── ──────── ──────────
            β”Œβ”€β”€β–Ί ah       intj     ah       INTJ      
            β”‚β”Œβ”€β–Ί i        nsubj    I        PRON      
            β”œβ”Όβ”€β”€ was      ROOT     be       AUX       
            │└─► also     advmod   also     ADV       
            └──► confused acomp    confused ADJ       
            β”Œβ”€β”€β–Ί but      cc       but      CCONJ     
            β”‚β”Œβ”€β–Ί i        nsubj    I        PRON      
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”΄β”€β”€ think    ROOT     think    VERB      
β”‚            β”Œβ”€β–Ί they     nsubj    they     PRON      
β””β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€ mean     ccomp    mean     VERB      
   β””β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€ friends  dobj     friend   NOUN      
      β””β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€ around   prep     around   ADP       
         β”‚  β”Œβ”€β”€β–Ί the      det      the      DET       
         β”‚  β”‚β”Œβ”€β–Ί same     amod     same     ADJ       
         └─►└┴── age      pobj     age      NOUN      
In [ ]:
# To get full form about the Specific Dep type, you can put it here! 
spacy.explain("det")
'determiner'
Out[ ]:

Entity Recognition

Entity Recognition is simply recognizing real world objects like Dates, products, companies in a text.

In [ ]:
# We can also visualize out input text and see different tags can spaCy can apply to texts 
spacy.displacy.render(doc, style="ent", jupyter=True)
/usr/local/lib/python3.7/dist-packages/spacy/displacy/__init__.py:189: UserWarning: [W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.
  warnings.warn(Warnings.W006)
ah i was also confused but i think they mean friends around the same age

Lemmatizer

Lemmatizer, Lemma for short in the table below, simply gets the base form of the words.

For ex. In the below image, the 3D token is converted from takes to take and the 6th token years is being converted to year , it's base form

In [ ]:
explacy.print_parse_info(nlp, sample_text)
Dep tree         Token    Dep type Lemma    Part of Sp
──────────────── ──────── ──────── ──────── ──────────
            β”Œβ”€β”€β–Ί ah       intj     ah       INTJ      
            β”‚β”Œβ”€β–Ί i        nsubj    I        PRON      
            β”œβ”Όβ”€β”€ was      ROOT     be       AUX       
            │└─► also     advmod   also     ADV       
            └──► confused acomp    confused ADJ       
            β”Œβ”€β”€β–Ί but      cc       but      CCONJ     
            β”‚β”Œβ”€β–Ί i        nsubj    I        PRON      
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”΄β”€β”€ think    ROOT     think    VERB      
β”‚            β”Œβ”€β–Ί they     nsubj    they     PRON      
β””β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€ mean     ccomp    mean     VERB      
   β””β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€ friends  dobj     friend   NOUN      
      β””β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€ around   prep     around   ADP       
         β”‚  β”Œβ”€β”€β–Ί the      det      the      DET       
         β”‚  β”‚β”Œβ”€β–Ί same     amod     same     ADJ       
         └─►└┴── age      pobj     age      NOUN      

Now there are many more pipeline that you can add in spaCy that we didn't mentioned here. You can also create & add a custom preprocessing pipeline if you want, check it out here.

So, that's all, i know it was a lot but there's still a lot left! But this is just the part of nlp function. Our Main goal : Text Classification is still left. Let's do it πŸ’ͺ

If you want to explore more, check you spaCy documentation and there's also a interactive web app that you might want to check out :)