
AI Blitz #9

Different methods in NLP feature Engineering

A notebook featuring different ways for extracting features from text


A walkthrough over different methods for feature engineering from conventional approaches to SOTA methods such as Transformers


Motivation behind Feature Enginering in Natural Language Processing

Let us start with why we are interested in employing Feature engineering,

Say you are given an unstructured(images,text,audio,videos,ect...) dataset and now you have to employ a Machine learning. But how do you convert the dataset into numbers!?

For NLP datasets such as Text, we are interested in finding a vectorial representation of the words which in turn helps for ML algorithm to learn better

So what we should do for this competetion!?

In this competition, we should use the given dataset data.csv to generate features from the text, i.e we should convert the text into its vector representation.

The corresponding generated features will be used to train a classical Machine Learning model in the testing phase and the results are evaluated based on that.

In this notebook, we will look on some interesting ways to create features for text using NLP techniques

Note: Some of the methods here cannot be directly used for submission, since it takes a long time for creating the vectors

Let's load the Data

Download the dataset using Aicrowd CLI

We will be using the previous datset also, for experimentation and learning purposes

Install packages 🗃

import os
import os

# Please use the absolute for the location of the dataset.

# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")
Peek into the dataset

Remember, this competition is quite different from other ones, the dataset which is shared contains only 10 samples and we can use any other dataset to train a model for converting this 10 samples into features

Here, I will be using dataset from previous research paper classification task

In [ ]:
import pandas as pd

train_data_path = "/content/research-paper-data/train.csv"
val_data_path  = "/content/research-paper-data/val.csv"
test_data_path  = "/content/research-paper-data/test.csv"

train_data = pd.read_csv(train_data_path)
#make a copy of the original dataset
train = train_data.copy()
In [ ]:
Preprocessing in NLP

Before moving onto vectorization let's see some methods for preprocessing the sentences which will help us in later stages

Converting text into lowercase

Let's convert all the text into lower case which will later help us in preprocessing

In [ ]:
Tokenization is nothing but splitting each sentences into words, there are many types of tokenization,the most important types are

  • Word level tokenization(splitting by words)
  • Character level tokenzation(splitting by characters)
  • Subword based tokenization(splitting by subword)

We will implement each of these for experimental purposes

Word tokenization

From the below output, you can notice the sentence have been splitted based on words, including punctuations

Char tokenization

Character tokenization is a way of tokenizing by splitting into characters

In [ ]:
text = "NLP for feature engeneering"
lst = [x for x in text]
['N', 'L', 'P', ' ', 'f', 'o', 'r', ' ', 'f', 'e', 'a', 't', 'u', 'r', 'e', ' ', 'e', 'n', 'g', 'e', 'n', 'e', 'e', 'r', 'i', 'n', 'g']

Suword based tokenization:

Subowrd based tokenization is commonly employed in transformer based models such as BERT, GPT, etc..

There are many types of suboword based tokenization

  • BPE based models
  • Word Piece
  • Sentence Piece

I would recommend you reading this article for knowing more about tis topic

Stopwords Removal

Stopwords removal is one of the essential step in preprocessing in NLP projects, it involves removing the unwanted words such as and, is, was, the we remove these words since, these words doesn't have any impact on the topic of the sentences

Out[ ]:
'propose deep network models learning algorithms learning binary hash codes given image representations unsupervised supervised manners . novelty network design constrain one hidden layer directly output binary codes . resulting optimizations involving binary, independence, balance constraints difficult solve .'

Stopowrd is a list of frequent words

You may notice, in the above text we don't have any stopwords. Now for our competition, we can finalize our approch by using word tokenizer in our main text field

In [ ]:
Out[ ]:
id text label word_tokenize text_without_stopwords
0 0 we propose deep network models and learning al... 3 [we, propose, deep, network, models, and, lear... propose deep network models learning algorithm...
1 1 multi-distance information computed by the mdl... 3 [multi-distance, information, computed, by, th... multi-distance information computed mdlp aids ...
2 2 traditional solutions consider dense pedestria... 2 [traditional, solutions, consider, dense, pede... traditional solutions consider dense pedestria...
3 3 in this paper, is used the lagrangian classica... 2 [in, this, paper, ,, is, used, the, lagrangian... paper, used lagrangian classical mechanics mod...
4 4 the aim of this work is to determine how vulne... 3 [the, aim, of, this, work, is, to, determine, ... aim work determine vulnerable different iris c...
... ... ... ... ... ...
31495 31495 the proposed method is easily programmed by ki... 2 [the, proposed, method, is, easily, programmed... proposed method easily programmed kinesthetic ...
31496 31496 research in unpaired video translation has foc... 3 [research, in, unpaired, video, translation, h... research unpaired video translation focused sh...
31497 31497 deep learning models exhibit limited generaliz... 3 [deep, learning, models, exhibit, limited, gen... deep learning models exhibit limited generaliz...
31498 31498 in this paper, we aim to incorporate global se... 3 [in, this, paper, ,, we, aim, to, incorporate,... paper, aim incorporate global semantic context...
31499 31499 to precisely calculate context-based probabili... 3 [to, precisely, calculate, context-based, prob... precisely calculate context-based probabilitie...

31500 rows × 5 columns

Out[ ]:
0        propose deep network models learning algorithm...
1        multi-distance information computed mdlp aids ...
2        traditional solutions consider dense pedestria...
3        paper , used lagrangian classical mechanics mo...
4        aim work determine vulnerable different iris c...
31495    proposed method easily programmed kinesthetic ...
31496    research unpaired video translation focused sh...
31497    deep learning models exhibit limited generaliz...
31498    paper , aim incorporate global semantic contex...
31499    precisely calculate context-based probabilitie...
Name: text, Length: 31500, dtype: object

Other types of Preprocessing

There are some couple of other methods that are common in NLP preprocessing, that includes

  • Stemming
  • Lemetization

To learn more about these, I recommend you to check out these blogs

Representing words as vectors

Let's start our workflow for the competition, here I will walk you through the methods for representing words as vectors.

As always there are many ways to do this,

We will start with the most simplest one

One Hot Encoding

Let's say we have a corpus consists of all the unique words from this dataset, this corpus is called as Vocab

Example: [cat,dog,word,text,research,....] After applying One-hot encoding, each word would be represented as one and all the other words as zero

cat: [1,0,0,0,0..] dog: [0,1,0,0,..] and same goes for all!

