Solution for NLP Feature Engineering LB: 0.803¶
This solution consists utilises a count vectorizer a TF IDF and a stopword filter as feature engineering.
AIcrowd Runtime Configuration 🧷¶
Define configuration parameters. Please include any files needed for the notebook to run under
ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂
The dataset is available under
/data on the workspace.
import os # Please use the absolute for the location of the dataset. # Or you can use relative path with `os.getcwd() + "test_data/test.csv"` AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv") AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "") AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")
!pip install --upgrade scikit-learn gensim !pip install -q -U aicrowd-cli
Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.7/dist-packages (0.24.2) Requirement already up-to-date: gensim in /usr/local/lib/python3.7/dist-packages (4.0.1) Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (2.1.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1) Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1) Requirement already satisfied, skipping upgrade: smart-open>=1.8.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (5.1.0)
Define preprocessing code 💻¶
The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.
from glob import glob import os import pandas as pd import numpy as np from sklearn import model_selection from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score, accuracy_score import sklearn
Training phase ⚙️¶
You can define your training code here. This sections will be skipped during evaluation.
For this solution approach there is no training needed! 🙂
# Downloading the Dataset !mkdir data
test_dataset = pd.read_csv(AICROWD_DATASET_PATH) test_dataset
Count Vectorizer 🔢¶
A count vectorizer outputs a text as a matrix of counts of the related word.
It has a vocabulary that includes every word that is present in the data. When it converts a text into a vector, it first counts all the words.
For example, if the first digit of the vector contains the word "hello" and "hello" is counted 2 times in the text, then the number 2 will be in this position.
The advantage of this method is that the vector always has the same size and is therefore independent of the input.
Here is a very in-depth explanation:
Stopwords are words which have little meaning. If they are removed it should improve the compression of the text data into smaller vectors.
Somehow to submit something the output needs to be integers otherwise the evaluation will fail.
from gensim.parsing.preprocessing import remove_stopwords from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(max_features = 512) X_train_counts = count_vect.fit_transform([remove_stopwords(i) for i in test_dataset.text.tolist()]) from sklearn.feature_extraction.text import TfidfTransformer tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts) X_train_tf = tf_transformer.transform(X_train_counts) X_train_tf = np.round(X_train_tf.toarray()*6).astype(int) # Multiply by 5 is better than 100 test_dataset.feature = [str(i) for i in X_train_tf.tolist()] test_dataset
# Saving the sample submission test_dataset.to_csv(os.path.join(AICROWD_OUTPUTS_PATH,'submission.csv'), index=False)
!DATASET_PATH=$AICROWD_DATASET_PATH \ aicrowd -v notebook submit \ --assets-dir $AICROWD_ASSETS_DIR \ --challenge nlp-feature-engineering
Now you have an understanding of how to do simple feature engineering in NLP.
If you liked it please leave a like.
PS: The original notebook I copied it from is the getting-stated notebook by Shubhamaicrowd.