AI Blitz #9
Different methods in NLP feature Engineering
A notebook featuring different ways for extracting features from text
A walkthrough over different methods for feature engineering from conventional approaches to SOTA methods such as Transformers
Motivation behind Feature Enginering in Natural Language Processing¶
Let us start with why we are interested in employing Feature engineering,
Say you are given an unstructured(images,text,audio,videos,ect...) dataset and now you have to employ a Machine learning. But how do you convert the dataset into numbers!?
For NLP datasets such as Text, we are interested in finding a vectorial representation of the words which in turn helps for ML algorithm to learn better
So what we should do for this competetion!?¶
In this competition, we should use the given dataset data.csv to generate features from the text, i.e we should convert the text into its vector representation.
The corresponding generated features will be used to train a classical Machine Learning model in the testing phase and the results are evaluated based on that.
In this notebook, we will look on some interesting ways to create features for text using NLP techniques
Note: Some of the methods here cannot be directly used for submission, since it takes a long time for creating the vectors
Let's load the Data¶
Download the dataset using Aicrowd CLI
We will be using the previous datset also, for experimentation and learning purposes
Install packages 🗃¶
!pip install aicrowd-cli -q
!pip install gensim zeugma pandas numpy -q
API_KEY = "" # Please enter your API Key from [https://www.aicrowd.com/participants/me]
!aicrowd login --api-key $API_KEY
import os
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")
# Downloading the Dataset
!mkdir data
!aicrowd dataset download --challenge nlp-feature-engineering -j 3 -o data
# Donwloading research paper classification for training purposes
!mkdir research-paper-data
!aicrowd dataset download --challenge research-paper-classification -j 3 -o research-paper-data
Peek into the dataset¶
Remember, this competition is quite different from other ones, the dataset which is shared contains only 10 samples and we can use any other dataset to train a model for converting this 10 samples into features
Here, I will be using dataset from previous research paper classification task
import pandas as pd
train_data_path = "/content/research-paper-data/train.csv"
val_data_path = "/content/research-paper-data/val.csv"
test_data_path = "/content/research-paper-data/test.csv"
train_data = pd.read_csv(train_data_path)
#make a copy of the original dataset
train = train_data.copy()
train.head()
train.shape
Preprocessing in NLP¶
Before moving onto vectorization let's see some methods for preprocessing the sentences which will help us in later stages
Converting text into lowercase¶
Let's convert all the text into lower case which will later help us in preprocessing
def to_lowercase(text):
return text.lower()
train["text"] = train["text"].apply(to_lowercase)
train.head()
Tokenization¶
Tokenization is nothing but splitting each sentences into words, there are many types of tokenization,the most important types are
- Word level tokenization(splitting by words)
- Character level tokenzation(splitting by characters)
- Subword based tokenization(splitting by subword)
We will implement each of these for experimental purposes
Word tokenization¶
from nltk import word_tokenize
import nltk
nltk.download('punkt')
#apply word tokenize
train['word_tokenize'] = train['text'].apply(word_tokenize)
From the below output, you can notice the sentence have been splitted based on words, including punctuations
train['word_tokenize'][0]
Char tokenization¶
Character tokenization is a way of tokenizing by splitting into characters
text = "NLP for feature engeneering"
lst = [x for x in text]
print(lst)
Stopwords Removal¶
Stopwords removal is one of the essential step in preprocessing in NLP projects, it involves removing the unwanted words such as and, is, was, the we remove these words since, these words doesn't have any impact on the topic of the sentences
#nltk contains all the stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword = stopwords.words('english')
def remove_stopwords(text):
"""custom function to remove the stopwords"""
return " ".join([word for word in str(text).split() if word not in stopword])
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
train['text_without_stopwords'] = train['text'].apply(remove_stopwords)
train['text_without_stopwords'][0]
Stopowrd is a list of frequent words
stopword
#sanity check
train['text_without_stopwords'][0]
You may notice, in the above text we don't have any stopwords. Now for our competition, we can finalize our approch by using word tokenizer in our main text field
train
#apply word tokenization
#df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
train['text'] = train['text'].apply(word_tokenize)
# using list comprehension
def list_to_str(text):
return ' '.join([str(elem) for elem in text])
train['text'] = train['text'].apply(list_to_str)
#print(train['text'][0])
#remove all the stopwords
train['text'] = train['text'].apply(remove_stopwords)
train['text']
Representing words as vectors¶
Let's start our workflow for the competition, here I will walk you through the methods for representing words as vectors.
As always there are many ways to do this,
We will start with the most simplest one
One Hot Encoding¶
Let's say we have a corpus consists of all the unique words from this dataset, this corpus is called as Vocab
Example: [cat,dog,word,text,research,....] After applying One-hot encoding, each word would be represented as one and all the other words as zero
cat: [1,0,0,0,0..] dog: [0,1,0,0,..] and same goes for all!
train
pd.get_dummies(train['text'])