# BERT for research paper classification

Fine tuning BERT pretrained on MLM for paper classification task

Google BERT is an important model ubiquitous across NLP tasks. In this notebook we'll use the HuggingFace transformers library to fine-tune pretrained BERT model for classification. The transformers library provides pretrained state-of-the-art BERT models. Here we finetune it for paper classification task.

# Introduction¶

Google BERT is an important model ubiquitous across NLP tasks. In this notebook we'll use the HuggingFace transformers library to fine-tune pretrained BERT model for classification. The transformers library provides pretrained state-of-the-art BERT models.

Reference:

Here are some useful blogs on transformers:

# Setup¶

In [ ]:
# Install huggingface library
!pip install transformers

!pip install aicrowd-cli

Requirement already satisfied: transformers in /opt/conda/lib/python3.7/site-packages (4.5.1)
Requirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from transformers) (3.0.12)
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from transformers) (20.9)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /opt/conda/lib/python3.7/site-packages (from transformers) (0.10.2)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.7/site-packages (from transformers) (2021.3.17)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.7/site-packages (from transformers) (1.19.5)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.7/site-packages (from transformers) (4.59.0)
Requirement already satisfied: sacremoses in /opt/conda/lib/python3.7/site-packages (from transformers) (0.0.45)
Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from transformers) (2.25.1)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->transformers) (2.4.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests->transformers) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->transformers) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->transformers) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->transformers) (1.26.4)
Requirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from sacremoses->transformers) (1.15.0)
Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from sacremoses->transformers) (1.0.1)
Requirement already satisfied: aicrowd-cli in /opt/conda/lib/python3.7/site-packages (0.1.7)
Requirement already satisfied: rich<11,>=10.0.0 in /opt/conda/lib/python3.7/site-packages (from aicrowd-cli) (10.3.0)
Requirement already satisfied: requests-toolbelt<1,>=0.9.1 in /opt/conda/lib/python3.7/site-packages (from aicrowd-cli) (0.9.1)
Requirement already satisfied: requests<3,>=2.25.1 in /opt/conda/lib/python3.7/site-packages (from aicrowd-cli) (2.25.1)
Requirement already satisfied: click<8,>=7.1.2 in /opt/conda/lib/python3.7/site-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: gitpython<4,>=3.1.12 in /opt/conda/lib/python3.7/site-packages (from aicrowd-cli) (3.1.14)
Requirement already satisfied: tqdm<5,>=4.56.0 in /opt/conda/lib/python3.7/site-packages (from aicrowd-cli) (4.59.0)
Requirement already satisfied: toml<1,>=0.10.2 in /opt/conda/lib/python3.7/site-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /opt/conda/lib/python3.7/site-packages (from gitpython<4,>=3.1.12->aicrowd-cli) (4.0.7)
Requirement already satisfied: smmap<5,>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=3.1.12->aicrowd-cli) (3.0.5)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.4.4)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /opt/conda/lib/python3.7/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.8.1)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /opt/conda/lib/python3.7/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.9.1)
Requirement already satisfied: typing-extensions<4.0.0,>=3.7.4 in /opt/conda/lib/python3.7/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (3.7.4.3)

In [ ]:
import os
import re
import random
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn.functional as F
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

from transformers import BertTokenizer
from transformers import BertModel
from tqdm import tqdm

%matplotlib inline

In [ ]:
API_KEY = '32a51964ffee76a5eea45a0832879125' # Please get your your API Key from [https://www.aicrowd.com/participants/me]

!mkdir data

API Key valid
Saved API Key successfully!
mkdir: cannot create directory ‘data’: File exists
test.csv:   0%|                                     | 0.00/3.01M [00:00<?, ?B/s]
train.csv:   0%|                                    | 0.00/8.77M [00:00<?, ?B/s]

val.csv:   0%|                                       | 0.00/883k [00:00<?, ?B/s]

val.csv: 100%|███████████████████████████████| 883k/883k [00:00<00:00, 1.26MB/s]

test.csv:  35%|█████████▊                  | 1.05M/3.01M [00:00<00:01, 1.41MB/s]
test.csv: 100%|████████████████████████████| 3.01M/3.01M [00:00<00:00, 3.13MB/s]

train.csv:  48%|████████████▉              | 4.19M/8.77M [00:01<00:00, 5.58MB/s]
train.csv: 100%|███████████████████████████| 8.77M/8.77M [00:01<00:00, 7.15MB/s]

In [ ]:
train_dataset = pd.read_csv("data/train.csv")

In [ ]:
X_train = train_dataset.text.values
y_train = train_dataset.label.values
X_val = validation_dataset.text.values
y_val = validation_dataset.label.values

In [ ]:
# Let's look at some examples
print(X_train[:10])
print(y_train[:10])

['we propose deep network models and learning algorithms for learning binary hash codes given image representations under both unsupervised and supervised manners . the novelty of our network design is that we constrain one hidden layer to directly output the binary codes . resulting optimizations involving these binary, independence, and balance constraints are difficult to solve .'
'multi-distance information computed by the MDLP aids in robust extraction of the texture arrangement . a novel color feature descriptor is proposed in this manuscript .'
"traditional solutions consider dense pedestrians as passive/active moving obstacles that are the cause of all troubles . crowd flow locally observed can be treated as a sensory measurement about the surrounding scenario, encoding the scene's traversability but also its social navigation preference ."
'in this paper, is used the lagrangian classical mechanics for modeling the dynamics of an underactuated system . a basic design of the system is proposed in SOLIDWORKS 3D CAD software, which provides some physical variables necessary for modeling .'
'the aim of this work is to determine how vulnerable different iris coding methods are in relation to biometric template aging phenomenon . this is considered to be particularly important when the time lapse between gallery and probe samples extends significantly, to more than a few years .'
'classification is one of the most well studied models of machine learning . in many scenarios there is the opportunity to label critical items for manual revision, instead of trying to automatically classify every item .'
'denoising autoencoders (DAE) are trained to reconstruct clean inputs with noise injected at the input level . variational autoencodingrs are trained with noise injection in their stochastic hidden layer . we propose a modified variational lower bound as an improved objective function in this setup .'
'we present a novel haptic teleoperation approach that considers not only the safety but also the stability of a robot . this approach uses control barrier functions (CBFs) to generate reference feedback that informs the human operator on the internal state of the system . previous work in the area neglected to consider the feedback loop through the user, possibly resulting in unstable closed trajectories .'
'deep convolutional neural networks (CNN) have shown great success in supervised classification tasks such as character classification or dating . traditional methods are often better than or equivalent to deep learning methods .'
'the focus of this work is sign spotting - given a video of an isolated sign . we train a model using multiple types of available supervision . these tasks are integrated into a unified learning framework .']
[3 3 2 2 3 3 1 2 3 3]


## Set up GPU¶

Select GPU accelerator on colab

Runtime -> Change runtime type -> Hardware accelerator: GPU

In [ ]:
if torch.cuda.is_available():
device = torch.device("cuda")
print(f'There are {torch.cuda.device_count()} GPU(s) available.')
print('Device name:', torch.cuda.get_device_name(0))

There are 1 GPU(s) available.
Device name: Tesla P100-PCIE-16GB


# D - Fine-tuning BERT¶

## Tokenization and Input Formatting¶

Before tokenizing our text, we will perform some slight processing on our text including removing entity mentions (eg. @united) and some special character. The level of processing here is much less than in previous approachs because BERT was trained with the entire sentences.

In [ ]:
def text_preprocessing(text):
"""
- Remove entity mentions (eg. '@united')
- Correct errors (eg. '&amp;' to '&')
@param    text (str): a string to be processed.
@return   text (Str): the processed string.
"""
# Remove '@name'
text = text.lower()

return text

In [ ]:
# Print sentence 0
print('Original: ', X_train[0])
print('Processed: ', text_preprocessing(X_train[0]))

Original:  we propose deep network models and learning algorithms for learning binary hash codes given image representations under both unsupervised and supervised manners . the novelty of our network design is that we constrain one hidden layer to directly output the binary codes . resulting optimizations involving these binary, independence, and balance constraints are difficult to solve .
Processed:  we propose deep network models and learning algorithms for learning binary hash codes given image representations under both unsupervised and supervised manners . the novelty of our network design is that we constrain one hidden layer to directly output the binary codes . resulting optimizations involving these binary, independence, and balance constraints are difficult to solve .


### 2.1. BERT Tokenizer¶

In order to apply the pre-trained BERT, we must use the tokenizer provided by the library. This is because (1) the model has a specific, fixed vocabulary and (2) the BERT tokenizer has a particular way of handling out-of-vocabulary words.

In addition, we are required to add special tokens to the start and end of each sentence, pad & truncate all sentences to a single constant length, and explicitly specify what are padding tokens with the "attention mask".

In [ ]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, )

# Create a function to tokenize a set of texts
def preprocessing_for_bert(data):
"""Perform required preprocessing steps for pretrained BERT.
@param    data (np.array): Array of texts to be processed.
@return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
@return   attention_masks (torch.Tensor): Tensor of indices specifying which
tokens should be attended to by the model.
"""
# Create empty lists to store outputs
input_ids = []

# For every sentence...
for sent in data:
# encode_plus will:
#    (1) Tokenize the sentence
#    (2) Add the [CLS] and [SEP] token to the start and end
#    (3) Truncate/Pad sentence to max length
#    (4) Map tokens to their IDs
#    (6) Return a dictionary of outputs
encoded_sent = tokenizer.encode_plus(
text=text_preprocessing(sent),  # Preprocess sentence
add_special_tokens=True,        # Add [CLS] and [SEP]
max_length=MAX_LEN,                  # Max length to truncate/pad
#return_tensors='pt',           # Return PyTorch tensor
truncation=True)

# Add the outputs to the lists
input_ids.append(encoded_sent.get('input_ids'))

# Convert lists to tensors
input_ids = torch.tensor(input_ids)



Before tokenizing, we need to specify the maximum length of our sentences.

In [ ]:
# Concatenate train data and test data
all_data = np.concatenate([X_train, X_val])

# Encode our concatenated data
encoded_data = [tokenizer.encode(sent, add_special_tokens=True) for sent in all_data]

# Find the maximum length
max_len = max([len(sent) for sent in encoded_data])
print('Max length: ', max_len)

Max length:  112

In [ ]:
# len_dist = [len(sent) for sent in encoded_tweets]
# plt.hist(len_dist, 50)
# # print(sorted(len_dist)[::-2])


Now let's tokenize our data.

In [ ]:
# Specify MAX_LEN
MAX_LEN = 100

# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X_train[0]])[0].squeeze().numpy())
print('Original: ', X_train[0])
print('Token IDs: ', token_ids)

Original:  we propose deep network models and learning algorithms for learning binary hash codes given image representations under both unsupervised and supervised manners . the novelty of our network design is that we constrain one hidden layer to directly output the binary codes . resulting optimizations involving these binary, independence, and balance constraints are difficult to solve .
Token IDs:  [101, 2057, 16599, 2784, 2897, 4275, 1998, 4083, 13792, 2005, 4083, 12441, 23325, 9537, 2445, 3746, 15066, 2104, 2119, 4895, 6342, 4842, 11365, 2098, 1998, 13588, 14632, 1012, 1996, 21160, 1997, 2256, 2897, 2640, 2003, 2008, 2057, 9530, 20528, 2378, 2028, 5023, 6741, 2000, 3495, 6434, 1996, 12441, 9537, 1012, 4525, 20600, 2015, 5994, 2122, 12441, 1010, 4336, 1010, 1998, 5703, 14679, 2024, 3697, 2000, 9611, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py:2079: FutureWarning: The pad_to_max_length argument is deprecated and will be removed in a future version, use padding=True or padding='longest' to pad to the longest sequence in the batch, or use padding='max_length' to pad to a max length. In this case, you can give a specific length with max_length (e.g. max_length=45) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
FutureWarning,


We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed.

In [ ]:
# Run function preprocessing_for_bert on the train set and the validation set
print('Tokenizing data...')

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

Tokenizing data...


## Train Our Model¶

BERT-base consists of 12 transformer layers, each transformer layer takes in a list of token embeddings, and produces the same number of embeddings with the same hidden size (or dimensions) on the output. The output of the final transformer layer of the [CLS] token is used as the features of the sequence to feed a classifier.

In [ ]:
%%time

# Create the BertClassfier class
class BertClassifier(nn.Module):
"""
def __init__(self, freeze_bert=False):
"""
@param    bert: a BertModel object
@param    classifier: a torch.nn.Module classifier
@param    freeze_bert (bool): Set False to fine-tune the BERT model
"""
super(BertClassifier, self).__init__()
# Specify hidden size of BERT, hidden size of our classifier, and number of labels
D_in, H, D_out = 768, 16, 4

# Instantiate BERT model
self.bert = BertModel.from_pretrained('bert-base-uncased')

# Instantiate an one-layer feed-forward classifier
self.classifier = nn.Sequential(
nn.Linear(D_in, H),
nn.ReLU(),
#nn.Dropout(0.5),
nn.Linear(H, D_out)
)

# Freeze the BERT model
if freeze_bert:
for param in self.bert.parameters():

"""
Feed input to BERT and the classifier to compute logits.
@param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
max_length)
information with shape (batch_size, max_length)
@return   logits (torch.Tensor): an output tensor with shape (batch_size,
num_labels)
"""
# Feed input to BERT
outputs = self.bert(input_ids=input_ids,

# Extract the last hidden state of the token [CLS] for classification task
last_hidden_state_cls = outputs[0][:, 0, :]

# Feed input to classifier to compute logits
logits = self.classifier(last_hidden_state_cls)

return logits

def initialize_model(epochs=4):
"""Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
"""
# Instantiate Bert Classifier
bert_classifier = BertClassifier(freeze_bert=False)

# Tell PyTorch to run the model on GPU
bert_classifier.to(device)

# Create the optimizer
lr=5e-5,    # Default learning rate
eps=1e-8    # Default epsilon value
)

# Total number of training steps

# Set up the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0, # Default value
num_training_steps=total_steps)
return bert_classifier, optimizer, scheduler

CPU times: user 49 µs, sys: 0 ns, total: 49 µs
Wall time: 54.8 µs


## Train/ eval Loop¶

The script below is commented with the details of our training, evaluation and predict steps.

In [ ]:
# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
"""Set seed for reproducibility.
"""
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)

"""Train the BertClassifier model.
"""
# Start training loop
print("Start training...\n")
for epoch_i in range(epochs):
# =======================================
#               Training
# =======================================
# Print the header of the result table
print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
print("-"*70)

# Measure the elapsed time of each epoch
t0_epoch, t0_batch = time.time(), time.time()

# Reset tracking variables at the beginning of each epoch
total_loss, batch_loss, batch_counts = 0, 0, 0

# Put the model into the training mode
model.train()

# For each batch of training data...
batch_counts +=1
b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

# Zero out any previously calculated gradients

# Perform a forward pass. This will return logits.

# Compute loss and accumulate the loss values
loss = loss_fn(logits, b_labels)
batch_loss += loss.item()
total_loss += loss.item()

# Perform a backward pass to calculate gradients
loss.backward()

# Clip the norm of the gradients to 1.0 to prevent "exploding gradients"

# Update parameters and the learning rate
optimizer.step()
scheduler.step()

# Print the loss values and time elapsed for every 20 batches
if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
# Calculate time elapsed for 20 batches
time_elapsed = time.time() - t0_batch

# Print training results
print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

# Reset batch tracking variables
batch_loss, batch_counts = 0, 0
t0_batch = time.time()

# Calculate the average loss over the entire training data

print("-"*70)
# =======================================
#               Evaluation
# =======================================
if evaluation == True:
# After the completion of each training epoch, measure the model's performance
# on our validation set.

# Print performance over the entire training data
time_elapsed = time.time() - t0_epoch

print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
print("-"*70)
print("\n")

print("Training complete!")

"""After the completion of each training epoch, measure the model's performance
on our validation set.
"""
# Put the model into the evaluation mode. The dropout layers are disabled during
# the test time.
model.eval()

# Tracking variables
val_accuracy = []
val_loss = []

# For each batch in our validation set...
b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

# Compute logits

# Compute loss
loss = loss_fn(logits, b_labels)
val_loss.append(loss.item())

# Get the predictions
preds = torch.argmax(logits, dim=1).flatten()

# Calculate the accuracy rate
accuracy = (preds == b_labels).cpu().numpy().mean() * 100
val_accuracy.append(accuracy)

# Compute the average accuracy and loss over the validation set.
val_loss = np.mean(val_loss)
val_accuracy = np.mean(val_accuracy)

return val_loss, val_accuracy

"""Perform a forward pass on the trained BERT model to predict probabilities
on the test set.
"""
# Put the model into the evaluation mode. The dropout layers are disabled during
# the test time.
model.eval()

all_logits = []

# For each batch in our test set...
b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]

# Compute logits
all_logits.append(logits)

# Concatenate logits from each batch
all_logits = torch.cat(all_logits, dim=0)

# Apply softmax to calculate probabilities
probs = F.softmax(all_logits, dim=1).cpu().numpy()

return probs


Now, let's start training our BertClassifier!

### Train Our Model¶

In [ ]:
# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 32

# Create the DataLoader for our training set
train_sampler = RandomSampler(train_data)

# Create the DataLoader for our validation set
val_sampler = SequentialSampler(val_data)

# Concatenate the train set and the validation set
full_train_data = torch.utils.data.ConcatDataset([train_data, val_data])
full_train_sampler = RandomSampler(full_train_data)

# Train the Bert Classifier on the entire training data
set_seed(42)
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)

Start training...

Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed
----------------------------------------------------------------------
1    |   20    |   0.985625   |     -      |     -     |   6.90
1    |   40    |   0.765241   |     -      |     -     |   6.50
1    |   60    |   0.650742   |     -      |     -     |   6.52
1    |   80    |   0.519994   |     -      |     -     |   6.63
1    |   100   |   0.550070   |     -      |     -     |   6.53
1    |   120   |   0.527957   |     -      |     -     |   6.54
1    |   140   |   0.489843   |     -      |     -     |   6.51
1    |   160   |   0.482148   |     -      |     -     |   6.59
1    |   180   |   0.471007   |     -      |     -     |   6.60
1    |   200   |   0.493979   |     -      |     -     |   6.54
1    |   220   |   0.395955   |     -      |     -     |   6.55
1    |   240   |   0.458692   |     -      |     -     |   6.69
1    |   260   |   0.448997   |     -      |     -     |   6.55
1    |   280   |   0.454144   |     -      |     -     |   6.62
1    |   300   |   0.406686   |     -      |     -     |   6.53
1    |   320   |   0.411704   |     -      |     -     |   6.54
1    |   340   |   0.431477   |     -      |     -     |   6.51
1    |   360   |   0.467282   |     -      |     -     |   6.61
1    |   380   |   0.437731   |     -      |     -     |   6.59
1    |   400   |   0.382060   |     -      |     -     |   6.55
1    |   420   |   0.397287   |     -      |     -     |   6.56
1    |   440   |   0.340312   |     -      |     -     |   6.50
1    |   460   |   0.402937   |     -      |     -     |   6.54
1    |   480   |   0.395239   |     -      |     -     |   6.59
1    |   500   |   0.387243   |     -      |     -     |   6.56
1    |   520   |   0.421651   |     -      |     -     |   6.54
1    |   540   |   0.404790   |     -      |     -     |   6.51
1    |   560   |   0.431920   |     -      |     -     |   6.52
1    |   580   |   0.409982   |     -      |     -     |   6.59
1    |   600   |   0.406167   |     -      |     -     |   6.56
1    |   620   |   0.391683   |     -      |     -     |   6.55
1    |   640   |   0.389269   |     -      |     -     |   6.51
1    |   660   |   0.370293   |     -      |     -     |   6.59
1    |   680   |   0.374124   |     -      |     -     |   6.64
1    |   700   |   0.372981   |     -      |     -     |   6.53
1    |   720   |   0.364327   |     -      |     -     |   6.55
1    |   740   |   0.404715   |     -      |     -     |   6.50
1    |   760   |   0.371905   |     -      |     -     |   6.55
1    |   780   |   0.417554   |     -      |     -     |   6.60
1    |   800   |   0.393825   |     -      |     -     |   6.58
1    |   820   |   0.434368   |     -      |     -     |   6.53
1    |   840   |   0.406136   |     -      |     -     |   6.51
1    |   860   |   0.393687   |     -      |     -     |   6.54
1    |   880   |   0.338327   |     -      |     -     |   6.57
1    |   900   |   0.377238   |     -      |     -     |   6.57
1    |   920   |   0.328799   |     -      |     -     |   6.53
1    |   940   |   0.390369   |     -      |     -     |   6.50
1    |   960   |   0.339908   |     -      |     -     |   6.54
1    |   980   |   0.305509   |     -      |     -     |   6.57
1    |   984   |   0.298841   |     -      |     -     |   1.14
----------------------------------------------------------------------

Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed
----------------------------------------------------------------------
2    |   20    |   0.259033   |     -      |     -     |   6.87
2    |   40    |   0.213879   |     -      |     -     |   6.56
2    |   60    |   0.293290   |     -      |     -     |   6.50
2    |   80    |   0.258832   |     -      |     -     |   6.54
2    |   100   |   0.293762   |     -      |     -     |   6.58
2    |   120   |   0.249684   |     -      |     -     |   6.55
2    |   140   |   0.278557   |     -      |     -     |   6.56
2    |   160   |   0.252613   |     -      |     -     |   6.51
2    |   180   |   0.270853   |     -      |     -     |   6.55
2    |   200   |   0.256150   |     -      |     -     |   6.59
2    |   220   |   0.264495   |     -      |     -     |   6.57
2    |   240   |   0.239394   |     -      |     -     |   6.54
2    |   260   |   0.212492   |     -      |     -     |   6.53
2    |   280   |   0.229465   |     -      |     -     |   6.54
2    |   300   |   0.258033   |     -      |     -     |   6.57
2    |   320   |   0.219788   |     -      |     -     |   6.57
2    |   340   |   0.295560   |     -      |     -     |   6.53
2    |   360   |   0.234664   |     -      |     -     |   6.51
2    |   380   |   0.280145   |     -      |     -     |   6.58
2    |   400   |   0.275102   |     -      |     -     |   6.56
2    |   420   |   0.299080   |     -      |     -     |   6.53
2    |   440   |   0.251985   |     -      |     -     |   6.54
2    |   460   |   0.242607   |     -      |     -     |   6.50
2    |   480   |   0.258352   |     -      |     -     |   6.57
2    |   500   |   0.280211   |     -      |     -     |   6.59
2    |   520   |   0.262910   |     -      |     -     |   6.54
2    |   540   |   0.272464   |     -      |     -     |   6.55
2    |   560   |   0.242203   |     -      |     -     |   6.50


## 4. Predictions on Test Set¶

In [ ]:
# Run preprocessing_for_bert on the test set
print('Tokenizing data...')

# Create the DataLoader for our test set
test_sampler = SequentialSampler(test_dataset)

In [ ]:
# Compute predicted probabilities on the test set

# Get predictions from the probabilities
# Since it is multi class, we take argmax for label
preds = np.argmax(probs, axis=1)

In [ ]:
test_data['label'] = preds
print(test_data)

In [ ]:
!mkdir assets

# Saving the sample submission in assets directory
test_data.to_csv(os.path.join("assets", "submission.csv"), index=False)

In [ ]:
!aicrowd notebook submit -c research-paper-classification -a assets --no-verify

1438