Loading

NLP Feature Engineering

Solution for submission 146912

A detailed solution for submission 146912 submitted for challenge NLP Feature Engineering

falak

Solution for NLP Feature Engineering LB: 0.781

This solution consists utilises a count vectorizer and a TF IDF as feature engineering.

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file πŸ™‚

The dataset is available under /data on the workspace.

In [1]:
import os
# import nltk
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")

Install packages 🗃

We are going to use sklearn to do Count Vectorization and TF IDF.

In [2]:
!pip install --upgrade scikit-learn
!pip install -q -U aicrowd-cli
! pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
# ! pip install clean-text
Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.7/dist-packages (0.24.2)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (2.1.0)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk) (1.15.0)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[2]:
True

Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

In [3]:
from glob import glob
import os
import pandas as pd
import numpy as np
# from sklearn import model_selection
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import f1_score, accuracy_score
import sklearn
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
porter=PorterStemmer()

Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

For this solution approach there is no training needed! πŸ™‚

In [4]:

API Key valid
Saved API Key successfully!
In [5]:
# Downloading the Dataset
!mkdir data
mkdir: cannot create directory β€˜data’: File exists
data.csv: 100% 110k/110k [00:00<00:00, 663kB/s]

Prediction phase 🔎

Generating the features in test dataset.

In [6]:
test_dataset = pd.read_csv(AICROWD_DATASET_PATH)
test_dataset
Out[6]:
id text feature
0 0 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0.3745401188473625, 0.9507143064099162, 0.731...
1 1 This paper is an exposition of the so-called i... [0.9327284833540133, 0.8660638895004084, 0.045...
2 2 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0.9442664891134339, 0.47421421665746377, 0.86...
3 3 We calculate the equation of state of dense hy... [0.18114934953468032, 0.6811178539649828, 0.18...
4 4 The Donald-Flanigan conjecture asserts that fo... [0.5435382173426461, 0.08172534574677826, 0.45...
5 5 Let $E$ be a primarily quasilocal field, $M/E$... [0.7945155444907487, 0.7070864772666982, 0.050...
6 6 The paper deals with the study of labor market... [0.3129073942136482, 0.27109625376406576, 0.59...
7 7 Axisymmetric equilibria with incompressible fl... [0.40680480095172356, 0.3282331056783394, 0.45...
8 8 This paper analyses the possibilities of perfo... [0.013682414760681105, 0.08159872000483837, 0....
9 9 I show that an (n+2)-dimensional n-Lie algebra... [0.9562918815133613, 0.37667644042946247, 0.33...

Count Vectorizer 🔢

A count vectorizer outputs a text as a matrix of counts of the related word.
It has a vocabulary that includes every word that is present in the data. When it converts a text into a vector, it first counts all the words.
For example, if the first digit of the vector contains the word "hello" and "hello" is counted 2 times in the text, then the number 2 will be in this position.
The advantage of this method is that the vector always has the same size and is therefore independent of the input.

TFidf 📐

Here is a very in-depth explanation:
https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

Somehow to submit something the output needs to be integers otherwise the evaluation will fail.

In [7]:
import re
tlist = test_dataset.text.tolist()
In [8]:
def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(lemmatizer.lemmatize(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)
    
tlist_updated = []
for sent in tlist:
  sent = sent.replace('\n', ' ')
  # sent = clean(sent)
  sent = re.sub("\$.*?\$", "", sent)
  sent = re.sub("-", " ", sent)
  sent = re.sub(r'[^\w\s]', '', sent)
  sent = re.sub("\(", "", sent)
  sent = re.sub("\)", "", sent)
  
  sent = re.sub(' +', ' ', sent)
  sent = sent.lower()
  sent = stemSentence(sent)
  tlist_updated.append(sent)
In [9]:
tlist_updated
Out[9]:
['zero divisor zds derived by cayley dickson process cdp from n dimensional hypercomplex number n a power of 2 at least 4 can represent singularity and a n approach infinite fractal and therebyscale free network any integer greater than 8 and not a power of 2 generates a meta fractal or sky when it is interpreted a the strut constant s of an ensemble of octahedral vertex figure called box kite the fundamental building block of zds remarkably simple bit manipulation rule or recipe provide tool for transforming one fractal genus into others within the context of wolfram class 4 complexity ',
 'this paper is an exposition of the so called injective morita context in which the connecting bimodule morphisms are injective and morita context in which the connecting bimodules enjoy some local projectivity in the sense of zimmermann huisgen motivated by situation in which only one trace ideal is in action or the compatibility between the bimodule morphisms is not needed we introduce the notion of morita semi context and morita data and investigate them injective morita data will be used with the help of static and adstatic module to establish equivalence between some intersecting subcategories related to subcategories of module that are localized or colocalized by trace ideal of a morita datum we end up with application of morita context to module and injective right wide morita context ',
 'zero divisor zds derived by cayley dickson process cdp from n dimensional hypercomplex number n a power of 2 at least 4 can represent singularity and a n approach infinite fractal and therebyscale free network any integer greater than 8 and not a power of 2 generates a meta fractal or sky when it is interpreted a the strut constant s of an ensemble of octahedral vertex figure called box kite the fundamental building block of zds remarkably simple bit manipulation rule or recipe provide tool for transforming one fractal genus into others within the context of wolfram class 4 complexity ',
 'we calculate the equation of state of dense hydrogen within the chemical picture fluid variational theory is generalized for a multi component system of molecule atom electron and proton chemical equilibrium is supposed for the reaction dissociation and ionization we identify the region of thermodynamic instability which is related to the plasma phase transition the reflectivity is calculated along the hugoniot curve and compared with experimental result the equation of state data is used to calculate the pressure and temperature profile for the interior of jupiter ',
 'the donald flanigan conjecture asserts that for any finite group and for any field the corresponding group algebra can be deformed to a separable algebra the minimal unsolved instance namely the quaternion group over a field of characteristic 2 wa considered a a counterexample we present here a separable deformation of the quaternion group algebra in a sense the conjecture for any finite group is open again ',
 'let be a primarily quasilocal field a finite galois extension and a central division algebra of index divisible by in addition to the main result of part i this part of the paper show that if the galois group is not nilpotent then doe not necessarily embed in a an subalgebra when is quasilocal we find the structure of the character group of it absolute galois group this enables u to prove that if is strictly quasilocal and almost perfect then the divisible part of the multiplicative group equal the intersection of the norm group of finite galois extension of ',
 'the paper deal with the study of labor market dynamic and aim to characterize it equilibrium and possible trajectory the theoretical background is the theory of the segmented labor market the main idea is that this theory is well adapted to interpret the observed trajectory due to the heterogeneity of the work situation ',
 'axisymmetric equilibrium with incompressible flow of arbitrary direction are studied in the framework of magnetohydrodynamics under a variety of physically relevant side condition to this end a set of pertinent non linear ode are transformed to quasilinear one and the respective initial value problem is solved numerically with appropriately determined initial value near the magnetic axis several equilibrium are then constructed surface by surface the non field aligned flow result in novel configuration with a single magnetic axis toroidal shell configuration in which the plasma is confined within a couple of magnetic surface and double shell like configuration in addition the flow affect the elongation and triangularity of the magnetic surface ',
 'this paper analysis the possibility of performing parallel transaction oriented simulation with a special focus on the space parallel approach and discrete event simulation synchronisation algorithm that are suitable for transaction oriented simulation and the target environment of ad hoc grid to demonstrate the finding a java based parallel transaction oriented simulator for the simulation language gpssh is implemented on the basis of the promising shock resistant time warp synchronisation algorithm and using the grid framework proactive the validation of this parallel simulator show that the shock resistant time warp algorithm can successfully reduce the number of rolled back transaction move but it also reveals circumstance in which the shock resistant time warp algorithm can be outperformed by the normal time warp algorithm the conclusion of this paper suggests possible improvement to the shock resistant time warp algorithm to avoid such problem ',
 'i show that an n2 dimensional n lie algebra over an algebraically closed field must have a subalgeba of codimension 1 ']
In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()

count_vect = CountVectorizer(max_features = 512)
X_train_counts = count_vect.fit_transform(tlist_updated)
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf = np.round(X_train_tf.toarray()*100).astype(int)

test_dataset.feature = [str(i) for i in X_train_tf.tolist()]
test_dataset
Out[10]:
id text feature
0 0 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 1 This paper is an exposition of the so-called i... [0, 7, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 2 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 3 We calculate the equation of state of dense hy... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,...
4 4 The Donald-Flanigan conjecture asserts that fo... [0, 0, 0, 0, 0, 0, 0, 11, 0, 26, 0, 0, 0, 0, 0...
5 5 Let $E$ be a primarily quasilocal field, $M/E$... [9, 0, 0, 0, 7, 0, 0, 0, 0, 6, 0, 0, 0, 9, 0, ...
6 6 The paper deals with the study of labor market... [0, 0, 0, 14, 0, 0, 0, 0, 14, 0, 0, 0, 0, 0, 0...
7 7 Axisymmetric equilibria with incompressible fl... [0, 0, 0, 0, 7, 0, 8, 0, 0, 0, 0, 0, 8, 0, 0, ...
8 8 This paper analyses the possibilities of perfo... [0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 35, 0, 0, 0,...
9 9 I show that an (n+2)-dimensional n-Lie algebra... [0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 28, 0, 0, 0, 0...
In [11]:
# Saving the sample submission
test_dataset.to_csv(os.path.join(AICROWD_OUTPUTS_PATH,'submission.csv'), index=False)

Submit to AIcrowd 🚀

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd -v notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge nlp-feature-engineering
WARNING: No assets directory at assets... Creating one...
WARNING: Assets directory is empty
Mounting Google Drive πŸ’Ύ
Your Google Drive will be mounted to access the colab notebook
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:

Congratulations πŸŽ‰!

Now you have an understanding of how to do simple feature engineering in NLP.
If you liked it please leave a like.

PS: The original notebook I copied it from is the getting-stated notebook by Shubhamaicrowd.

In [ ]:


Comments

You must login before you can post a comment.

Execute