Research Paper Classification
[Getting Started Code] Research Paper Classification
In this second challenge of Blitz 9, we are going to use LSTM for multi class text classification
Starter Code for Research paper Classification
Ok, we learned the fundamentals from Natural Language Processing in our First Challenge and we did Classification of different emotions. Now here the task is still the same - Classification. But the main point of this challange isn't the task itself, but is how we complete the task. There are many cons with the word2vec, which we are trying solve here.
What we are going to Learn¶
- What is LSTM & why LSTM ?
- Using Tensorflow to create the dataset, converting texts into tokens and encoding them using Vectorization.
- Creating & Training a Tenforflow models with LSTM layers.
- Testing and Submitting the Results to the Challenge.
Setup AIcrowd Utilities 🛠¶
Here we are installing AIcrowd CLI to download the the challange dataset
!pip install -q -U aicrowd-cli
Downloading Dataset¶
So first, as in the previous challenge, we will first need to download the python library by AIcrowd that will allow us to download the dataset by just inputting the API key.
API_KEY = '61d7dd898be9a4343531783c2ca4a402' # Please get your your API Key from [https://www.aicrowd.com/participants/me]
!aicrowd login --api-key $API_KEY
# Downloading the Dataset ( removing data and assets folder if existing already and then creating the folder )
!rm -rf data
!mkdir data
!rm -rf assets
!mkdir assets
!aicrowd dataset download --challenge research-paper-classification -j 3 -o data # Downloading the dataset and saving it in data folder
Define preprocessing code 💻¶
As you probably have guessed, we will be using Tensorflow maily for creating the dataset and training the LSTM model.
# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score, accuracy_score
import os
# Tensorflow
import tensorflow as tf
tf.random.set_seed(42) # Addign seed to reproducability
# To make things more beautiful!
from rich.console import Console
from rich.table import Table
from rich import pretty
pretty.install()
# function to display YouTube videos
from IPython.display import YouTubeVideo
Reading Dataset¶
Reading the necessary files to train, validation & submit our results!
train_df = pd.read_csv("data/train.csv")
val_df = pd.read_csv("data/val.csv")
train_df
train_df['label'].value_counts().plot(kind='bar')
Ok, we are seeing that there is quite a big dataset imbalance problem here. But, I will leave this to you to fix it. You know, the starter code will not contain solutions to everything ๐ .
Creating the Dataset 📁¶
From here, we will be using Tensorflow extensily to create the dataset and in the next section, training and our model and submitting resuts.
One hot encoding is a technique which helps to convert your categorical column ( in this case, the label column ) to a numerial column which we can input into the model. There are many other techniques to do this, one hot encode is very popular and a good technique among them. In simple here's what one hot encoding does --
train_one_hot_label = pd.get_dummies(train_df['label'])
val_one_hot_label = pd.get_dummies(val_df['label'])
train_one_hot_label[:10]
train_df.head(10)
Can you detect the pattern ?
The from_tensor_slices
helps to convert the dataset from numpy array to a Tensorflow Dataset which we can them use a tons of other functions to create batches and inputting our dataset into the model
X_train, y_train = train_df['text'].values.astype(str), np.asarray(train_one_hot_label.values).astype(np.float32)
X_val, y_val = val_df['text'].values.astype(str), np.asarray(val_one_hot_label.values).astype(np.float32)
# Inputting the X ( features ) and y ( labels )
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_dataset = tf.data.Dataset.from_tensor_slices((X_val, y_val))
train_dataset
# Setting up batch size
BATCH_SIZE = 64
train_dataset = train_dataset.batch(BATCH_SIZE)
validation_dataset = validation_dataset.batch(BATCH_SIZE)
train_dataset
# Reading sample text and labels from the dataset
for example, label in train_dataset.take(1):
print('Text : ', example.numpy()[0])
print('Label : ', label.numpy()[0])
There will be a lot going in the upcoming cells let's debrief here -
The
TextVectorization
basically helps us to convert your texts into vectors ( as you can probably guessed by the function name )There are several steps inside the
TextVectorization
function -- Doing little bit of preprocessing/clearning the text.
- Converting all of the sentences into words ( tokens )
- Assigning a unique numerical ID to each token and output the vector.
VOCAB_SIZE = 10000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
max_tokens=VOCAB_SIZE, )
encoder.adapt(train_dataset.map(lambda text, label: text))
encoder.adapt(validation_dataset.map(lambda text, label: text))
# Printing the individual tokens in the vocabulary ( first 10 )
vocab = np.array(encoder.get_vocabulary())
print("Tokens : ", vocab[:10])
print("Number of tokens : ", len(vocab))
The [UNK]
is an unknown word, If there we any word found in text which was not in vocabulary ( for example - in testing dataset ), the UNK
token will be applied
# Vectorization
text = example[0].numpy()
encoded_text = encoder(example)[0].numpy()
print("Text : ", text, "\n", "Encoded Text : ", encoded_text)
Creating the Model¶
We are getting close, here we are creating our models and layers like Embedding, LSTM and simple layers like Dense and Dopout, let's dig in and learn model about LSTM.
Now, you might be asking - Let's just do the same as first challenge, why going so advance.
LSTM Neural Network are trying to sove a problem that we never discused about in the previous challange. Texts are Sequences
, means that. If we want to predict a next word in a text, we need to know about the previous text(s), that's exactly that LSTM do, they take each word one by one and proprocess them and then they gives the output. While the sklearn mode didn't had that capability to do so.
But, how does LSTM work ? Good question, here a really good video around how LSTM works.
YouTubeVideo('QciIcRxJvsM')
If you want to go a bit more advance about LSTM, Understanding LSTM Networks is a really good blog by colah
# Creating a Sequential Model
model = tf.keras.Sequential([
encoder,
# Word embedding are very similar to word2vec that we used in the previous challanges, but in this, this will train as the model trains
tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
# Creating the LSTM layers, the return_sequences is set to True when there is also LSTM layer after it.
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, )),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.2),
# Output layer with 4 neurons. ( 4 classes )
tf.keras.layers.Dense(4)
])
# Configuring the models and settup up parameters, including optimizer, loss and metrics.
model.compile(loss=tf.keras.losses.binary_crossentropy,
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])
# predict on a sample text
sample_text = train_df.text[0]
predictions = model.predict(np.array([sample_text]))
print(predictions[0])
There are the probabilies for each class
Training the Model 🚆¶
# Let's goo!
history = model.fit(train_dataset, epochs=10)
Validation¶
Now, we have done the training, let's test our model on unseen ( validation dataset ) to see how well our model performs!
validation_predictions = model.predict(validation_dataset, verbose=1)
validation_predictions[0]
# Converting the predictions from probabilites into binary
y_pred_encoded = np.argmax(validation_predictions, axis=1)
# Getting the labels from binary using train_one_hot_label
y_pred = [train_one_hot_label.columns[i] for i in y_pred_encoded]
print("F1 Score : ", f1_score(val_df['label'], y_pred, average="weighted"))
Prediction phase 🔎¶
Again! Let's make our predictions just like in previous challange!
# Loading the test dataset, model & train_one_hot_label
test_df = pd.read_csv("data/test.csv")
# Making the predictions and convert them into actual labels
X_test = test_df['text'].values.astype(str)
model_results = model.predict(X_test)
encoded_results = np.argmax(model_results, axis=1)
results = [train_one_hot_label.columns[i] for i in encoded_results]
# Putting the results into the column of test dataset
test_df['label'] = results
test_df
Note : Please make sure that there should be filename submission.csv
in assets
folder before submitting it
# Saving out results in submission.csv
test_df.to_csv(os.path.join("assets", 'submission.csv'), index=False)
Submit to AIcrowd 🚀¶
Note : Please save the notebook before submitting it (Ctrl + S)
!aicrowd notebook submit -c research-paper-classification -a assets --no-verify
Congratulations ๐ you did it, but there still a lot of improvement that can be made, here are some suggestions -
- Try out to solve the dataset imbalance issue
- Try changing parameters, or adding more LSTM layers in the tensorflow model.
And btw -
Don't be shy to ask question related to any errors you are getting or doubts in any part of this notebook in discussion forum or in AIcrowd Discord sever, AIcrew will be happy to help you :)
Also, wanna give us your valuable feedback for next blitz or wanna work with us creating blitz challanges ? Let us know!
Content
Comments
You must login before you can post a comment.