Loading

NLP Feature Engineering #2

Solution for submission 171870

A detailed solution for submission 171870 submitted for challenge NLP Feature Engineering #2

mkeywood

image.png

Starter Code for NLP Feature Engineering #2

What we are going to Learn

  • How to convert your text into numbers ?
  • How Bag of words, TF-IDF works ?
  • Testing and Submitting the Results to the Challenge.

About this Challanges

Now, this challange is very different form what we usually do in AIcrowd Blitz.In this challanges, the task is to generate features from a text data. Extracting features helps up to generate text embeddings will contains more useful information about the text.

Setup AIcrowd Utilities 🛠

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [1]:
!pip install -q -U aicrowd-cli
%load_ext aicrowd.magic
     |████████████████████████████████| 44 kB 1.1 MB/s 
     |████████████████████████████████| 170 kB 14.8 MB/s 
     |████████████████████████████████| 63 kB 977 kB/s 
     |████████████████████████████████| 54 kB 1.6 MB/s 
     |████████████████████████████████| 214 kB 45.1 MB/s 
     |████████████████████████████████| 1.1 MB 46.7 MB/s 
     |████████████████████████████████| 63 kB 2.0 MB/s 
     |████████████████████████████████| 51 kB 7.8 MB/s 
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.

How to use this notebook? 📝

notebook overview

  • Update the config parameters. You can define the common variables here
Variable Description
AICROWD_DATASET_PATH Path to the file containing test data (The data will be available at /data/ on aridhia workspace). This should be an absolute path.
AICROWD_OUTPUTS_PATH Path to write the output to.
AICROWD_ASSETS_DIR In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
AICROWD_API_KEY In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me
  • Installing packages. Please use the Install packages 🗃 section to install the packages
  • Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /data on the workspace.

In [2]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")

Install packages 🗃

We are going to use many different libraries to demonstrate many idfferent techniques to convert text into numbers ( or more specifically vectors )

In [3]:
!pip install --upgrade spacy rich gensim tensorflow scikit-learn
!python -m spacy download en_core_web_sm # Downloaing the model for english language will contains many pretrained preprocessing pipelines
Requirement already satisfied: spacy in /usr/local/lib/python3.7/dist-packages (2.2.4)
Collecting spacy
  Downloading spacy-3.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
     |████████████████████████████████| 6.0 MB 7.1 MB/s 
Requirement already satisfied: rich in /usr/local/lib/python3.7/dist-packages (10.16.2)
Collecting rich
  Downloading rich-11.0.0-py3-none-any.whl (215 kB)
     |████████████████████████████████| 215 kB 76.2 MB/s 
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (3.6.0)
Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
     |████████████████████████████████| 24.1 MB 1.2 MB/s 
Requirement already satisfied: tensorflow in /usr/local/lib/python3.7/dist-packages (2.7.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.2)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
     |████████████████████████████████| 10.1 MB 40.1 MB/s 
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.9.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.11.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.0.6)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.19.5)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.27.1)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.6 MB/s 
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (21.3)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.4.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy) (57.4.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (3.0.6)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
     |████████████████████████████████| 181 kB 63.4 MB/s 
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (4.62.3)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.0.6)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
     |████████████████████████████████| 451 kB 67.3 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from spacy) (3.10.0.2)
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
     |████████████████████████████████| 628 kB 78.5 MB/s 
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.6->spacy) (3.7.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy) (3.0.6)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /usr/local/lib/python3.7/dist-packages (from pathy>=0.3.5->spacy) (5.2.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.0.10)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.3)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.5.0,>=0.3.0->spacy) (7.1.2)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from rich) (0.9.1)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich) (2.6.1)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from rich) (0.4.4)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.4.1)
Requirement already satisfied: tensorflow-estimator<2.8,~=2.7.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.7.0)
Requirement already satisfied: keras<2.8,>=2.7.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.7.0)
Requirement already satisfied: gast<0.5.0,>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.4.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.13.3)
Requirement already satisfied: libclang>=9.0.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (12.0.0)
Requirement already satisfied: absl-py>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.12.0)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.1.0)
Requirement already satisfied: flatbuffers<3.0,>=1.12 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.6.3)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.3.0)
Requirement already satisfied: tensorboard~=2.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.7.0)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.21.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.23.1)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.43.0)
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.17.3)
Requirement already satisfied: wheel<1.0,>=0.32.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.37.1)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.15.0)
Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.2)
Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py>=2.9.0->tensorflow) (1.5.2)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow) (1.35.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow) (1.8.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow) (0.4.6)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow) (0.6.1)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow) (1.0.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow) (3.3.6)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow) (4.8)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow) (4.2.4)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (1.3.0)
Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard~=2.6->tensorflow) (4.10.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow) (3.1.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy) (2.0.1)
Installing collected packages: catalogue, typer, srsly, pydantic, thinc, spacy-loggers, spacy-legacy, pathy, langcodes, spacy, rich, gensim
  Attempting uninstall: catalogue
    Found existing installation: catalogue 1.0.0
    Uninstalling catalogue-1.0.0:
      Successfully uninstalled catalogue-1.0.0
  Attempting uninstall: srsly
    Found existing installation: srsly 1.0.5
    Uninstalling srsly-1.0.5:
      Successfully uninstalled srsly-1.0.5
  Attempting uninstall: thinc
    Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Attempting uninstall: spacy
    Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
  Attempting uninstall: rich
    Found existing installation: rich 10.16.2
    Uninstalling rich-10.16.2:
      Successfully uninstalled rich-10.16.2
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aicrowd-cli 0.1.10 requires rich<11,>=10.0.0, but you have rich 11.0.0 which is incompatible.
Successfully installed catalogue-2.0.6 gensim-4.1.2 langcodes-3.3.0 pathy-0.6.1 pydantic-1.8.2 rich-11.0.0 spacy-3.2.1 spacy-legacy-3.0.8 spacy-loggers-1.0.1 srsly-2.4.2 thinc-8.0.13 typer-0.4.0
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
     |████████████████████████████████| 13.9 MB 6.7 MB/s 
Requirement already satisfied: spacy<3.3.0,>=3.2.0 in /usr/local/lib/python3.7/dist-packages (from en-core-web-sm==3.2.0) (3.2.1)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.6)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.4.1)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.10.0.2)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.8)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.6.1)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.8.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (57.4.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.4.0)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (0.9.0)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.4.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.27.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (21.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.11.3)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.19.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (4.62.3)
Requirement already satisfied: thinc<8.1.0,>=8.0.12 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (8.0.13)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.6)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.3.0)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.0.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.6->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.7.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (3.0.6)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /usr/local/lib/python3.7/dist-packages (from pathy>=0.3.5->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (5.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (1.24.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.10)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.5.0,>=0.3.0->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy<3.3.0,>=3.2.0->en-core-web-sm==3.2.0) (2.0.1)
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')

Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

In [4]:
# Importing Libraries
import pandas as pd
import numpy as np
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
import random
from tqdm.notebook import tqdm
import unicodedata
import re

# Tensorflow 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score

# Word2vec Implementation
import spacy
nlp = spacy.load('en_core_web_sm', exclude=['tagger', 'ner', 'attribute_ruler', 'lemmatizer'])

from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser

# To make things more beautiful! 
from rich.console import Console
from rich.table import Table
from rich.segment import Segment
from rich import pretty
pretty.install()

# Seeding everything for getting same results 
random.seed(42)
np.random.seed(42)

# function to display YouTube videos
from IPython.display import YouTubeVideo
In [5]:
# Latest version of gensim
import gensim
gensim.__version__
'4.1.2'
Out[5]:
In [6]:
# Defining the function for preprocessing test dataset which will run after submitting the notebook

def tokenize_sentence(sentences, num_words=10000, maxlen=256, show=False): 

  # Creating the tokenizer, the num_words represents the vocabulary and assigning OOV token ( out of vocaculary ) for unknown token
  # Which can arise if we input a sentence containing a words that tokenizer don't have in his vocabulary

  tokenizer = Tokenizer(num_words=num_words, oov_token="<OOV>")

  tokenizer.fit_on_texts(sentences)
  
  # Getting the unique ID for each token
  word_index = tokenizer.word_index

  # Convert the sentences into vector
  sequences = tokenizer.texts_to_sequences(sentences)

  # Padding the vectors so that all vectors have the same length
  padded_sequences = pad_sequences(sequences, padding='post', truncating='pre', maxlen=maxlen)

  word_index = np.asarray(word_index)
  sequences = np.asarray(sequences)
  padded_sequences = np.asarray(padded_sequences)

  if show==True:
    console = Console()

    console.log("Word Index. A unique ID is assigned to each token.")
    console.log(word_index)
    console.log("---"*10)

    console.log("Sequences. senteces converted into vector.")
    console.log(np.array(sequences[0]))
    console.log("---"*10)

    console.log("Padded Sequences. Adding,( 0 in this case ) or removing elements to make all vectors in the samples same.")
    console.log(np.array(padded_sequences[0]))
    console.log("---"*10)



  return tokenizer, word_index, sequences, padded_sequences

Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

Downloading Dataset

Must be pretty familar thing by now :) In case, here we are downloading the challange dataset using AIcrowd CLI

In [7]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/cSdanpa28Pf3R5CRfYmZsJLhBF4ft7fBmN_B0eOEsVA
API Key valid
Saved API Key successfully!
In [10]:
# Downloading the Dataset
!mkdir data

# Donwloading programming language classification dataset for testing purposes
!mkdir programming-language-data
data.csv: 100% 8.37k/8.37k [00:00<00:00, 374kB/s]
train.csv: 100% 7.71M/7.71M [00:00<00:00, 33.2MB/s]
test.csv:   0% 0.00/1.50M [00:00<?, ?B/s]
sample_submission.csv: 100% 121k/121k [00:00<00:00, 2.42MB/s]
test.csv: 100% 1.50M/1.50M [00:00<00:00, 11.0MB/s]

Reading Dataset

Reading the necessary files to train, validation & submit our results!

We are also using Programming Language Challange dataset for testing purposes.

In [11]:
dataset = pd.read_csv("data/data.csv")
train_data = pd.read_csv("programming-language-data/train.csv")

dataset
Out[11]:
id text feature
0 0 Incels can't even get a pleb thing called sex.... [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
1 1 Seriously? I couldn't even remember the dude's... [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, ...
2 2 Seeing [NAME] on Sunday and I'm so fucking stoked [0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, ...
3 3 Your not winning their hearts, your just tortu... [1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, ...
4 4 This sub needs more Poison memes. [1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, ...
5 5 It is part of the political game unfortunately. [1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, ...
6 6 That looks lovely, but what’s the mutant on th... [0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, ...
7 7 Who will rid me of these meddlesome Golden Kni... [0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, ...
8 8 We were all duped? I’ve seen through her from ... [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, ...
9 9 "Lol, hold my craft beer." [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, ...

Creating our Templete

So, with this train_model we are going to test the various different techniques and compare to see which works best!

In [12]:
def train_model(X, y):

  # Splitting the dataset into training and testing,  also by using stratify, we are making sure to use the same class balance between training and testing. 
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

  # Creating and training sklearn's Decision Tree Classifier Model 
#  clf = DecisionTreeClassifier(random_state=42)
  clf = RandomForestClassifier(random_state=42)
  clf.fit(X_train, y_train)

  # Getting the predictions form unseen (testing dataset)
  predictions = clf.predict(X_test)

  # Calcuating the metrics 
  f1 = f1_score(y_test, predictions, average='weighted')
  accuracy = accuracy_score(y_test, predictions)

  # Creating the table
  console = Console()
  result_table = Table(show_header=False, header_style="bold magenta")

  result_table.add_row("F1 Score", str(f1))
  result_table.add_row("Accuracy Score", str(accuracy))

  # Showing the table
  console.print(result_table)

  return f1, accuracy

Simple Tokenization 🪙

Here, all what we are doing is splitting the sentences into tokens/words, and then assigning a unique id to each token, and here we go, we converted the text into a vector. We are also using padding to make sure all vectors are of maxlen which is 256.

In [13]:
def tokenize_sentence(sentences, num_words=10000, maxlen=256, show=False): 

  # Creating the tokenizer, the num_words represents the vocabulary and assigning OOV token ( out of vocaculary ) for unknown tokenn
  # Which can arise if we input a sentence containing a words that tokenizer don't have in his vocabulary

  tokenizer = Tokenizer(num_words=num_words, oov_token="<OOV>")

  tokenizer.fit_on_texts(sentences)
  
  # Getting the unique ID for each token
  word_index = tokenizer.word_index

  # Convert the senteces into vector
  sequences = tokenizer.texts_to_sequences(sentences)

  # Padding the vectors so that all vectors have the same length
  padded_sequences = pad_sequences(sequences, padding='post', truncating='pre', maxlen=maxlen)

  word_index = np.asarray(word_index)
  sequences = np.asarray(sequences)
  padded_sequences = np.asarray(padded_sequences)

  if show==True:
    console = Console()

    console.log("Word Index. A unique ID is assigned to each token.")
    console.log(word_index)
    console.log("---"*10)

    console.log("Sequences. senteces converted into vector.")
    console.log(np.array(sequences[0]))
    console.log("---"*10)

    console.log("Padded Sequences. Adding,( 0 in this case ) or removing elements to make all vectors in the samples same.")
    console.log(np.array(padded_sequences[0]))
    console.log("---"*10)



  return tokenizer, word_index, sequences, padded_sequences
In [14]:
# Sample Senteces
sample_sentences = dataset.iloc[0, 1].split(".")
sample_sentences
[
    "Incels can't even get a pleb thing called sex",
    ' Sex is for [NAME] apparently',
    ''
]
In [15]:
_, _, _, _ = tokenize_sentence(sample_sentences, num_words=50, maxlen=16, show=True)
[21:46:32] Word Index. A unique ID is assigned to each     <ipython-input-13-1f4ed6658b36>:26
           token.                                                                            
           {'<OOV>': 1, 'sex': 2, 'incels': 3, "can't": 4, <ipython-input-13-1f4ed6658b36>:27
           'even': 5, 'get': 6, 'a': 7, 'pleb': 8,                                           
           'thing': 9, 'called': 10, 'is': 11, 'for': 12,                                    
           'name': 13, 'apparently': 14}                                                     
           ------------------------------                  <ipython-input-13-1f4ed6658b36>:28
           Sequences. senteces converted into vector.      <ipython-input-13-1f4ed6658b36>:30
           [ 3  4  5  6  7  8  9 10  2]                    <ipython-input-13-1f4ed6658b36>:31
           ------------------------------                  <ipython-input-13-1f4ed6658b36>:32
           Padded Sequences. Adding,( 0 in this case ) or  <ipython-input-13-1f4ed6658b36>:34
           removing elements to make all vectors in the                                      
           samples same.                                                                     
           [ 3  4  5  6  7  8  9 10  2  0  0  0  0  0  0   <ipython-input-13-1f4ed6658b36>:35
           0]                                                                                
           ------------------------------                  <ipython-input-13-1f4ed6658b36>:36
In [16]:
# Training the model using the vectors and the features

tokenizer, _, _, X = tokenize_sentence(train_data['code'].values)
y = train_data['language'].values
In [17]:
print("Sentence : ", train_data['code'][2])
print("Simple Tokenizer : ", X[2])
Sentence :  /*

     Explanation :- a user gives a String (it can be incomplete uppercase or

         partial uppercase) and then the program would convert it into a

         complete(all characters in lower case) lower case string. The

Simple Tokenizer :  [ 862    8  377 1465    8   33   90  179   49    1 1531  109 3361 1531
   24  247    3  467  821  442   90  295    8 1489  123  450   15  372
  156  372  156   33    3    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]
In [18]:
token_id_f1, token_id_accuracy = train_model(X, y)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ F1 Score       ┃ 0.3839573556783904 ┃
│ Accuracy Score │ 0.4425816348893272 │
└────────────────┴────────────────────┘

Now, the advantages of this method is that it is very simple, but one of the major disadvantages for this is it doesn't contain the "meaning" of the text, and mny next will also not be able to solve this issue, unzip word2vec

Bag of Words 🎒

In Bag of Words, instead of what we did in simple tokenization, just assiging a unique ID to each token, bag of words, does things a little different.

I find bag of words harder to understand with a text, so i find this video really helpful for understand bag of words in a more visual way. Be sure to watch it.

In [19]:
tokenizer, _, _, _ = tokenize_sentence(train_data['code'].values, num_words=256)
X = tokenizer.texts_to_matrix(train_data['code'].values)
In [20]:
print("Sentence : ", train_data['code'][0])
print("BOW : ", X[0])
Sentence :              var result = testObj1 | testObj2;

             // Assert

             Assert.AreEqual(expected, result.ToString());

         }

         [TestCase(1, 1, 1, 1, "1")]

         [TestCase(5, 3, 8, 4, "0000")]

BOW :  [0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
In [21]:
bow_f1, bow_accuracy = train_model(X, y)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ F1 Score       ┃ 0.7379933591445309 ┃
│ Accuracy Score │ 0.7447950909489371 │
└────────────────┴────────────────────┘

Yee! both of the metrics did increased! Advantages of Bag of words is that it's again, really simple, but it doesn't keep the same order of words in sentence and doesn't keep the meaning of the sentence.

Count Vectorization 🔢

Count Vectorization is also very similar to Bag of Wards but instead of one hot, it also include the count of each token in a sentece.

In [22]:
tokenizer, _, _, _ = tokenize_sentence(train_data['code'].values, num_words=256)
X = tokenizer.texts_to_matrix(train_data['code'].values, mode='count')
In [23]:
print("Sentence : ", train_data['code'][2])
print("CV : ", X[2])
Sentence :  /*

     Explanation :- a user gives a String (it can be incomplete uppercase or

         partial uppercase) and then the program would convert it into a

         complete(all characters in lower case) lower case string. The

CV :  [ 0. 15.  0.  2.  0.  0.  0.  0.  3.  0.  0.  0.  0.  0.  0.  1.  0.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  2.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  2.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  2.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  0.  0.]
In [24]:
count_f1, count_accuracy = train_model(X, y)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ F1 Score       ┃ 0.7406875891199644 ┃
│ Accuracy Score │ 0.7487398641244795 │
└────────────────┴────────────────────┘

TF - IDF 📐

TF-IDF is Term Frequency - Inverse Document Frequency. So as we way in last section, about count frequency, that had a bit of flaw, for ex. tokens such a is, are, the are very common and will generally have bigger counts, but they don't ususally help for ex. classify whether the text is positive or negative. TF-IDF actually try to solve this issue. TF-IDF applies lower score to the common tokens and higher scores for more rarer tokens.

In [25]:
tokenizer, _, _, _ = tokenize_sentence(train_data['code'].values, num_words=256)
X = tokenizer.texts_to_matrix(train_data['code'].values, mode='tfidf')
In [26]:
print("Sentence : ", train_data['code'][0])
print("TF-IDF : ", X[0])
Sentence :              var result = testObj1 | testObj2;

             // Assert

             Assert.AreEqual(expected, result.ToString());

         }

         [TestCase(1, 1, 1, 1, "1")]

         [TestCase(5, 3, 8, 4, "0000")]

TF-IDF :  [ 0.         31.60460419  4.23447758  0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          2.57234506  0.          0.          0.          0.
  0.          0.          0.          0.          2.88231493  2.92883682
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          5.82824057  0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          3.58251801  0.          0.          0.
  0.          0.          0.          0.          0.          0.
  6.31346134  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          3.694688    0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          4.48009244
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.        ]
In [27]:
tfidf_f1, tfidf_accuracy = train_model(X, y)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ F1 Score       ┃ 0.7439965221060034 ┃
│ Accuracy Score │ 0.7516984440061363 │
└────────────────┴────────────────────┘

Prediction phase 🔎

Generating the features in test dataset.

In [28]:
test_dataset = pd.read_csv(AICROWD_DATASET_PATH)
test_dataset
Out[28]:
id text feature
0 0 Incels can't even get a pleb thing called sex.... [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
1 1 Seriously? I couldn't even remember the dude's... [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, ...
2 2 Seeing [NAME] on Sunday and I'm so fucking stoked [0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, ...
3 3 Your not winning their hearts, your just tortu... [1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, ...
4 4 This sub needs more Poison memes. [1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, ...
5 5 It is part of the political game unfortunately. [1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, ...
6 6 That looks lovely, but what’s the mutant on th... [0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, ...
7 7 Who will rid me of these meddlesome Golden Kni... [0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, ...
8 8 We were all duped? I’ve seen through her from ... [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, ...
9 9 "Lol, hold my craft beer." [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, ...
In [29]:
#import nltk
#from nltk.corpus import stopwords
#nltk.download('stopwords')

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = s.replace('?', '')
#    s = " ".join([word for word in s.split(' ') if word not in stopwords.words('english')])
    return s
In [30]:
test_dataset['text2'] = test_dataset['text'].apply(lambda x: normalizeString(x))
In [37]:
# So, let's do a simple tokenization and generate the features!

# _, _, _, X = tokenize_sentence(test_dataset['text'].values)
tokenizer, _, _, _ = tokenize_sentence(test_dataset['text2'].values, num_words=256)
X = tokenizer.texts_to_matrix(test_dataset['text'].values, mode='count')

for index, row in tqdm(test_dataset.iterrows()):
  test_dataset.iloc[index, 2] = str(X[index].tolist())

test_dataset
Out[37]:
id text feature text2
0 0 Incels can't even get a pleb thing called sex.... [0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 2.0, ... incels can t even get a pleb thing called sex ...
1 1 Seriously? I couldn't even remember the dude's... [0.0, 2.0, 2.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, ... seriously i couldn t even remember the dude s...
2 2 Seeing [NAME] on Sunday and I'm so fucking stoked [0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... seeing name on sunday and i m so fucking stoked
3 3 Your not winning their hearts, your just tortu... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... your not winning their hearts your just tortur...
4 4 This sub needs more Poison memes. [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... this sub needs more poison memes .
5 5 It is part of the political game unfortunately. [0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... it is part of the political game unfortunately .
6 6 That looks lovely, but what’s the mutant on th... [0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... that looks lovely but what s the mutant on the...
7 7 Who will rid me of these meddlesome Golden Kni... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... who will rid me of these meddlesome golden kni...
8 8 We were all duped? I’ve seen through her from ... [0.0, 2.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... we were all duped i ve seen through her from ...
9 9 "Lol, hold my craft beer." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... lol hold my craft beer .
In [38]:
# Saving the sample submission
test_dataset.to_csv(os.path.join(AICROWD_OUTPUTS_PATH,'submission.csv'), index=False)

Submit to AIcrowd 🚀

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
%aicrowd notebook submit --assets-dir $AICROWD_ASSETS_DIR --challenge nlp-feature-engineering-2 --no-verify
WARNING: Assets directory is empty

Congratulations 🎉 you did it, but there still a lot of improvement that can be made, this is feature engineering challange after all, means that we have to fit as much information as we can about the text in 256 numbers. We only covered converting texts into vector, but there are so many things you can try more, for ex. unsupervised classification, idk, maybe it can help :)

And btw -

Don't be shy to ask question related to any errors you are getting or doubts in any part of this notebook in discussion forum or in AIcrowd Discord sever, AIcrew will be happy to help you :)

Also, wanna give us your valuable feedback for next blitz or wanna work with us creating blitz challanges ? Let us know!

In [ ]:


Comments

You must login before you can post a comment.

Execute