AI Blitz #9: Completed #educational #blitz Weight: 10.0
2260
244
17
532

📝 Don't forget to participate in the Community Contribution Prize!

# Introduction

Natural Language Processing is a field of Artificial Intelligence focusing on the interaction between computers and human languages. With the rise of virtual assistants like Amazon Alexa, Siri, and Google Home. NLP has become more mainstream. Recently, GPT-3, an advanced NLP model generated blogs similar to that of humans.

We, humans, are very intuitive at identifying emotions. For instance, if you look at these GIFs, you can easily identify which one is portraying positive emotion and which one is negative.

We often make an important purchasing decision by looking at the review section on online shopping sites. We buy products with positive reviews and reject the ones with negative.

This problem aims to teach a computer to distinguish between positive and negative emotions. You will be given sentences as input. Your model should be able to accurately label those sentences as positive or negative and output 0 or 1 respectively. For this challenge, all puzzles will contain a dataset in the English language.

Don’t know how to start? We got you covered! Jump to see the starter code-kit.

## 💪 Getting Started

For this problem, you will be using the powerful NLP python library SpaCy. Install this library to perform all the necessary pre-processing. What's pre-processing you ask? Here's a breakdown of important NLP vocabulary.

1. Tokenization

Simply put, it is the process of segmenting text into sentences and words. It’s the task of cutting a text into pieces called tokens. On the surface, it might seem simple like just removing spaces and punctuation but it is more nuanced than that (for example New York would be one token, despite the space between New and York). Read more about using Spacy to perform tokenization here.

2. Stop Words

This process includes getting rid of common language articles, pronouns, and prepositions such as “and”, “the” or “to” in English. These common words appear frequently but don't provide much value in creating an objective NLP model. This process focuses on frequent words that are not informative about the text. Refer to this link on how to use Spacy to remove stopwords.

3. Stemming and Lemmatization

Stemming is the process of removing the prefix and suffix of words. Due to the nature of the English language, sometimes this can offset the meaning of the word. But using a reliable model will account for the issue. Overall, stemming helps improve the speed of an NLP model.

Lemmatization reduced words to their dictionary form for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas. This process also takes into account the context of the word and helps resolve any instances of disambiguation. Here’s the Spacy guide on how to perform this.

4. Part-of-Speech tagging

This refers to the process of marking up a word in a text (corpus) as corresponding to a particular part of speech based on both its definition and its context. An example of this is the popular school activity of identifying whether a word is a noun, pronoun, verb, adjective, adverb, etc. Find the Spacy documentation on this feature here.

5. Named Entity Recognition

The NER process locates and classifies pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, and more. This process can help answer many real-world questions. Check out SpaCy’s powerful NER library and its various categories over here.

These and many other steps and tools required to make your submissions are included in the stater code-kit. Check out the notebook here.

Now that you have covered some of the basics of Natural Language Processing and prepared the dataset, it’s time to classify your data into positive or negative emotions.

One of the methods to perform this is using a Decision Tree Classifier. It is defined as “A tree constructed by asking a series of questions with respect to the dataset”. At each step, after receiving an answer a follow-up question is asked until a conclusion about the class level is reached. “The series of questions and their possible answers can be organized in the form of a decision tree, which is a hierarchical structure consisting of nodes and directed edges”.In this challenge, we are employing Scikit-Learn Decision Tree Classifier in the starter code-kit.

Get cracking with the help of the starter code-kit.

## 💾 Dataset

The dataset is fairly easy to understand, in any training/validation dataset, there will be two columns of text & label. The text is the sentences from humans about emotions on various things like products, services, entertainment, etc. The label column represents if the emotion is positive(0) or negative(1) ( the neutral ones are counted as positive ).

 text label This Apple product seems really good, I want to buy this! 0 This PC lags a lot! 1

## 📁 Files

Following files are available in the resources section:

• train.csv - (31255 samples) This CSV file containing a text column as the sentence and a label column as the emotion of the sentence is positive or negative.
• val.csv - (3475 samples) This CSV file containing a text column as the sentence and a label column as the emotion of the sentence is positive or negative.
• test.csv - (8683 samples) This CSV file containing a text column as the sentence and a label column containing random emotions of the sentence is positive or negative. This file also serves the purpose of sample_submission.csv

## 🚀  Submission

• Creating a submission directory
• Use test.csv and fill the corresponding labels.
• Save the test.csv in the submission directory. The name of the above file should be submission.csv.
• Inside a submission directory, put the .ipynb notebook from which you trained the model and made inference and save it as original_notebook.ipynb.

Overall, this is what your submission directory should look like

Zip the submission directory!

Make your first submission here 🚀 !!

## 🖊 Evaluation Criteria

During the evaluation, the F1 score ( Weighted Average )  and Accuracy Score will be used to test the efficiency of the model where,

$F1 = 2 * \frac{precision*recall}{precision+recall}$

$$x = {-b \pm \sqrt{b^2-4ac} \over 2a}$$