Round 1: Completed #educational Weight: 10.0

AIcrowd

4873

242

641

Welcome to AI Blitz XII! 🚀 | Starter Kit For This Challenge! 🛠

Community Contribution Prizes 📓 | Find Teammates 👯‍♀️

Discord AI Community 🎧

Introduction

Open-sourced project repositories need to come with their own intricate documentation. However, with the huge stack of coding languages used by organizations nowadays, the documentation requires a rather tedious amount of time to figure out the language present in the codebase.

Can NLP can help solve this problem by classifying the coding languages? The first AIBlitz puzzle seeks answers through this puzzle.

You are presented with a corpus containing over 45628 lines of code, written in 15 different programming languages. Your model should distinguish between these languages as accurately as possible.

Check out the Starter Code to get a context and reference to the problem statement and clear steps on solving the problem.

💪 Getting Started

This puzzle is a classification problem and has similarities with the Emotion Detection problem from AI Blitz 9. Emotion Detection problem aims to teach a computer to distinguish between positive and negative emotions. Can you use the resources and tools of that problem to come up with a unique solution for this puzzle?

Here’s how you can classify the corpus into various programming languages. Our Starter Kit comes with the implementation of Mult Nomial Naive Bayes Classifier paired with Count Vectorizer and TFIDF Transformer. Multi Nomial Naive Bayes Classifier is a popular probabilistic learning method used mostly in NLP. Deriving its form from the classic Naive Bayes algorithm, this algorithm aims at calculating the probability of each tag (here language) for a given sample and then gives the tag (language) with the highest probability as output.

💾 Dataset

The dataset contains code snippets written by various developers across the world in different programming languages and the language they correspond to. There are snippets from a total of 15 programming languages. The columns present in the dataset are

id:- unique identifier of the sample
Code:- written programming code snippet
Language(Target):- Programming language the code snippet corresponds to.

📁 Files

Following files are available in the resources section:

train.csv: (45628 samples) The CSV contains all three columns id, code, and language.
test.csv: (4277 sample) This CSV file contains two columns sample_id and the code. You need to predict the language that the code corresponds to.
Sample_submission.csv: It contains the random labels for the data in test.csv in the desired submission format.

🚀 Submission

Creating a submission directory
Use sample_submission.csv to create your submission. The headers of the columns should be "id" and "prediction".
Save the CSV in the submission directory. The name of the above file should be submission.csv.
Inside a submission directory, put the .ipynb notebook from which you trained the model and made inference and save it as original_notebook.ipynb.

Overall, this is what your submission directory should look like -

Zip the submission directory!

Make your first submission here 🚀 !!

🖊 Evaluation Criteria

During the evaluation, the F1 score ( weighted average ) and Accuracy Score will be used to test the efficiency of the model where,

$F1 = 2 * \frac{precision*recall}{precision+recall}$

📱 Contact

Aditya Jha
Shubhamai