Round 1: Completed #educational Weight: 30.0

AIcrowd

6285

251

276

Welcome to AI Blitz XII! 🚀 | Starter Kit For This Challenge! 🛠

Community Contribution Prizes 📓 | Find Teammates 👯‍♀️

Discord AI Community 🎧

Introduction

Feature engineering is the process of creating features for machine learning algorithms utilizing domain knowledge of the data. Just like teaching a child to talk, feeding precise information to a model helps the model comprehend information correctly. We will get a better outcome with a data-focused strategy than with a model-focused one. Feature engineering allows us to build better data that the model can understand, resulting in better outcomes.

Let us take a look at how the participant can get started in this puzzle!

💪 Getting Started

Word2Vec is a word embedding learning approach. The use of word embedding allows us to have a deeper grasp of the text. It's a representation of the terminology used in the paper.
To offer a richer context for the data, it collects semantic and grammatical similarities, as well as relationships between words. Word2Vec is an embedding method that uses two common approaches: skip-gram and bag of words.

Bag of Words implementation

We use the Bag of Words approach in our beginning kit, followed by count vectorization and TF-IDF. Finally, the word2vec technique is trained, tested, and processed.

Check out the starter kit here! 🎉

💾 Dataset

Following files are available in the resources section:

data.csv - (10 samples) This CSV file contains a text column as the sentence and a feature column as vectors of the corresponding text. Only for testing your code/notebook.

🚀 Submission

This challenge accepts the notebook as a submission.
During the evaluation, the Define preprocessing code 💻 and Prediction phase 🔎 parts notebook will be run, so please make sure it runs without any errors before submitting.
The notebook follows a particular format, please stick to it.
Do not delete the header of the cells in the notebook.

And Let us surely know in Discussion Section if you have any Doubts or Issues :)

Make your first submission here 🚀 !!

🖊 Evaluation

We are using a very different evaluation pipeline than we usually use in other blitz challenges. In this evaluator, after you submit your notebook. The notebook is run with the actual data.csv

After getting the output submission. the file is split into 3 parts, 50% for train, 25% for the public score, and the other 25% for the private score. The first 50% split is used to train a Machine Learning Model based on your features and the text/abstract's corresponding labels ( categories ) of the text/abstract.

And the second split ( 25% ) is used for public evaluation and the third split ( 25% ) is used for a private evaluation.

F1 score and Accuracy Score will be used to test the efficiency of the model where,

$F1 = 2 * \frac{precision*recall}{precision+recall}$

$x = {-b \pm \sqrt{b^2-4ac} \over 2a}$

We are using seed to make sure no randomization is any training/splitting process is happening!

Here's the sample evaluation code. Function such a CLASSIFIED_SKLEARN_MODEL are not mentioned intentionally.

📱 Contact

Aditya Jha
Shubhamai