Loading

NLP Feature Engineering

Solution for submission 147040

A detailed solution for submission 147040 submitted for challenge NLP Feature Engineering

sean_benhur

Solution for NLP Feature Engineering LB: 0.772

This solution consists utilises a count vectorizer and a TF IDF as feature engineering.

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file πŸ™‚

The dataset is available under /data on the workspace.

In [1]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")

Install packages 🗃

We are going to use sklearn to do Count Vectorization and TF IDF.

In [2]:
!pip install --upgrade scikit-learn
!pip install -q -U aicrowd-cli
Collecting scikit-learn
  Downloading https://files.pythonhosted.org/packages/a8/eb/a48f25c967526b66d5f1fa7a984594f0bf0a5afafa94a8c4dbc317744620/scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 22.3MB 1.5MB/s 
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.1.0
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51kB 4.5MB/s 
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 81kB 6.5MB/s 
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 215kB 38.1MB/s 
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 61kB 8.0MB/s 
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 163kB 58.1MB/s 
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 61kB 8.9MB/s 
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51kB 7.8MB/s 
     |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 71kB 10.0MB/s 
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.

Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

In [3]:
from glob import glob
import os
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
import sklearn

Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

For this solution approach there is no training needed! πŸ™‚

In [4]:

API Key valid
Saved API Key successfully!
In [5]:
# Downloading the Dataset
!mkdir data
data.csv: 100% 110k/110k [00:00<00:00, 1.54MB/s]

Prediction phase 🔎

Generating the features in test dataset.

In [6]:
test_dataset = pd.read_csv(AICROWD_DATASET_PATH)
test_dataset
Out[6]:
id text feature
0 0 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0.3745401188473625, 0.9507143064099162, 0.731...
1 1 This paper is an exposition of the so-called i... [0.9327284833540133, 0.8660638895004084, 0.045...
2 2 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0.9442664891134339, 0.47421421665746377, 0.86...
3 3 We calculate the equation of state of dense hy... [0.18114934953468032, 0.6811178539649828, 0.18...
4 4 The Donald-Flanigan conjecture asserts that fo... [0.5435382173426461, 0.08172534574677826, 0.45...
5 5 Let $E$ be a primarily quasilocal field, $M/E$... [0.7945155444907487, 0.7070864772666982, 0.050...
6 6 The paper deals with the study of labor market... [0.3129073942136482, 0.27109625376406576, 0.59...
7 7 Axisymmetric equilibria with incompressible fl... [0.40680480095172356, 0.3282331056783394, 0.45...
8 8 This paper analyses the possibilities of perfo... [0.013682414760681105, 0.08159872000483837, 0....
9 9 I show that an (n+2)-dimensional n-Lie algebra... [0.9562918815133613, 0.37667644042946247, 0.33...

Count Vectorizer 🔢

A count vectorizer outputs a text as a matrix of counts of the related word.
It has a vocabulary that includes every word that is present in the data. When it converts a text into a vector, it first counts all the words.
For example, if the first digit of the vector contains the word "hello" and "hello" is counted 2 times in the text, then the number 2 will be in this position.
The advantage of this method is that the vector always has the same size and is therefore independent of the input.

TFidf 📐

Here is a very in-depth explanation:
https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

Somehow to submit something the output needs to be integers otherwise the evaluation will fail.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(max_features = 512)
X_train_counts = count_vect.fit_transform([i for i in test_dataset.text.tolist()])

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf = np.round(X_train_tf.toarray()*100).astype(int)

test_dataset.feature = [str(i) for i in X_train_tf.tolist()]
test_dataset
Out[7]:
id text feature
0 0 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 1 This paper is an exposition of the so-called i... [0, 6, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 2 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 3 We calculate the equation of state of dense hy... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 4 The Donald-Flanigan conjecture asserts that fo... [0, 0, 0, 0, 0, 0, 0, 11, 0, 25, 0, 0, 0, 0, 0...
5 5 Let $E$ be a primarily quasilocal field, $M/E$... [9, 0, 0, 0, 7, 0, 0, 0, 0, 7, 0, 0, 0, 0, 9, ...
6 6 The paper deals with the study of labor market... [0, 0, 0, 13, 0, 0, 0, 0, 13, 0, 0, 0, 0, 0, 0...
7 7 Axisymmetric equilibria with incompressible fl... [0, 0, 0, 0, 7, 0, 9, 0, 0, 0, 0, 0, 0, 9, 0, ...
8 8 This paper analyses the possibilities of perfo... [0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 30, 6, 0, 0,...
9 9 I show that an (n+2)-dimensional n-Lie algebra... [0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 29, 0, 0, 0, 0...
In [8]:
# Saving the sample submission
test_dataset.to_csv(os.path.join(AICROWD_OUTPUTS_PATH,'submission.csv'), index=False)

Submit to AIcrowd 🚀

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd -v notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge nlp-feature-engineering
WARNING: No assets directory at assets... Creating one...
WARNING: Assets directory is empty
Mounting Google Drive πŸ’Ύ
Your Google Drive will be mounted to access the colab notebook
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:

Congratulations πŸŽ‰!

Now you have an understanding of how to do simple feature engineering in NLP.
If you liked it please leave a like.

PS: The original notebook I copied it from is the getting-stated notebook by Shubhamaicrowd.

In [ ]:


Comments

You must login before you can post a comment.

Execute