Loading

Programming Language Classification

Solution for submission 172010

A detailed solution for submission 172010 submitted for challenge Programming Language Classification

mkeywood

Getting Started with fastai NLP

In this puzzle, we have to classify the programming language from code. For classifying programming language we will have code snippets from which we need to identify the programming language.

In this notebook:

For tokenization: We will use TextDataLoaders.

For Classification: We will use text_classifier_learner.

AIcrowd code utilities for downloading data for Language Classification

Download the files 💾¶

Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

We'll install fastai too.

In [1]:
!pip install aicrowd-cli
# run this, then restart the runtime
! [ -e /content ] && pip install -Uqq fastai
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.10-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 2.1 MB/s 
Collecting requests<3,>=2.25.1
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 1.7 MB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 3.5 MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
     |████████████████████████████████| 214 kB 47.4 MB/s 
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     |████████████████████████████████| 170 kB 63.8 MB/s 
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 48.6 MB/s 
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 1.1 MB/s 
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |████████████████████████████████| 51 kB 5.4 MB/s 
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Installing collected packages: smmap, requests, gitdb, commonmark, colorama, rich, requests-toolbelt, pyzmq, GitPython, aicrowd-cli
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 22.3.0
    Uninstalling pyzmq-22.3.0:
      Successfully uninstalled pyzmq-22.3.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.10 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.9 pyzmq-22.1.0 requests-2.27.1 requests-toolbelt-0.9.1 rich-10.16.2 smmap-5.0.0
     |████████████████████████████████| 189 kB 8.4 MB/s 
     |████████████████████████████████| 56 kB 5.9 MB/s 

Login to AIcrowd ㊗¶

In [2]:
%load_ext aicrowd.magic
In [3]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/vZ0VPb2aPSeGoSGQ3DqB7LLw0s-fwMzNAQ1gMERaiXg
API Key valid
Saved API Key successfully!

Download Dataset¶

We will create a folder name data and download the files there.

In [4]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c programming-language-classification -o data

Importing Libraries:

In [5]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn

from fastai.text.all import *

# TODO: remove unused imports?
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score,f1_score

from sklearn import set_config
set_config(display="diagram")

plt.rcParams["figure.figsize"] = (15,6)

Diving in the dataset 🕵️‍♂️

In [6]:
train_df = pd.read_csv("data/train.csv")
In [7]:
test_df = pd.read_csv("data/test.csv")
In [8]:
len(train_df), len(test_df)
Out[8]:
(45628, 9277)

TODO

look for duplicates (especially in over-rep langs)

oversample under-rep with ...

  • remove leading/trailing whitespace

assume "part"s are separated by line break

  • remove 1st n parts
  • remove last n parts
  • remove n parts at random
  • shuffle parts?
  • remove dulicated parts

if we do this ↑ - we should augment and use TTA

double/triple

  • ruby 1117
  • dart 1023
  • julia 1005

4x?

  • php 260
  • swift 260
  • f-sharp 246
  • R 160
  • scala 147
In [9]:
def augment_train_df(df, language, n_times):
    lang_df = df[df['language'] == language].copy()
    print(language, 'has', len(lang_df), 'samples')
    if n_times == 0:
        return df

    _df = lang_df.copy() # strip whitespace
    _df['code'] = _df['code'].str.strip()
    df = pd.concat([df, _df])
    if n_times == 1:
        return df
    
    _df = lang_df.copy() # remove 1st part
    def _do(s):
        if '\n' in s:
            return s[s.index('\n')+1:]
        return s
    _df['code'] = _df['code'].apply(_do)
    df = pd.concat([df, _df])
    
    _df = lang_df.copy() # remove last part
    def _do(s):
        if '\n' in s:
            return s[:s.rindex('\n')]
        return s
    _df['code'] = _df['code'].apply(_do)
    df = pd.concat([df, _df])
    
    _df = lang_df.copy() # shuffle parts
    def _do(s):
        if '\n' in s:
            ss = s.split('\n')
            random.shuffle(ss)
            return '\n'.join(ss)
        return s
    _df['code'] = _df['code'].apply(_do)
    df = pd.concat([df, _df])
    
    return df
In [10]:
for language in ['ruby', 'dart', 'julia']:
    train_df = augment_train_df(train_df, language, 1)
for language in ['php', 'swift', 'f-sharp', 'R', 'scala']:
    train_df = augment_train_df(train_df, language, 4)
len(train_df), len(test_df)
ruby has 1117 samples
dart has 1023 samples
julia has 1005 samples
php has 260 samples
swift has 260 samples
f-sharp has 246 samples
R has 160 samples
scala has 147 samples
Out[10]:
(53065, 9277)

↓ reduce the amount of data we're training on to make it quicker to get a trained classifier

don't do this for final submission

In [11]:
# train_df["RANK"] = train_df.groupby("language")["id"].rank(method="first", ascending=True)
# train_df = train_df[train_df['RANK']<1000]
# test_df = test_df[::10] # use every 10th row of the training data
In [12]:
# len(train_df), len(test_df)

Data processing

replace all numeric literals with special tokens

hope this will make it easier for the model to learn the concept of numbers without having to deal with all of the different actual values

try to keep important whitespace

whitespace is usually compressed into single spaces when working with natural languages - i think whitespace might have semantic meaning (hopefully predictive power) in code

replace 2 spaces with a special token

do we want to add some kind of repitition marker - like xxrep?

replace tabs and linebreaks with special tokens

LAST STEP: replace any consecutive whitespace with a single space

TODO: check that we now get fastai xxwrep for yy2space etc

In [13]:
def processs_df(df):
    for pat, repl in [
                      [r'(?<!\w)\d+\.\d+(?!\w)', 'yyfloat'], 
                      [r'(?<!\w)\d+(?!\w)', 'yyint'],
                    #   ['    ', ' yy4space '],
                    #   ['   ', ' yy3space '],
                      ['  ', ' yy2space '],
                      ['\t', ' yytab '],
                      ['\n', ' yylinebreak '],
                      ['\s+', ' ']]:
        df['code'] = df['code'].str.replace(pat, repl)
    return df
In [14]:
for df in [train_df, test_df]:
    processs_df(df)
In [15]:
train_df[train_df['code'].str.contains('yyfloat')]
Out[15]:
id code language
55 19346 void main() { yylinebreak yylinebreak yy2space double a = yyfloat, b = -yyfloat, c = yyfloat; yylinebreak yylinebreak yy2space List p = shreedharacharya(a, b, c); yylinebreak yylinebreak yy2space print(p); yylinebreak dart
57 29910 version = "yyfloat.yyint" yylinebreak yylinebreak [[NaNMath]] yylinebreak yylinebreak git-tree-sha1 = "bfe47e760d60b82b66b61d2d44128b62e3a369fb" yylinebreak yylinebreak uuid = "77ba4419-2d1f-58cd-9bb1-8ffee604a2e3" yylinebreak julia
84 59289 git-tree-sha1 = "bfdf9532c33db35d2ce9df4828330f0e92344a52" yylinebreak yylinebreak uuid = "476501e8-09a2-5ece-yyint-fb82de89a1fa" yylinebreak yylinebreak version = "yyfloat.yyint" yylinebreak yylinebreak [[SafeTestsets]] yylinebreak julia
109 22771 uuid = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7" yylinebreak yylinebreak version = "yyfloat.yyint" yylinebreak yylinebreak [[Distributed]] yylinebreak julia
114 25043 uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" yylinebreak yylinebreak [[BoundaryValueDiffEq]] yylinebreak yylinebreak deps = ["BandedMatrices", "DiffEqBase", "FiniteDiff", "ForwardDiff", "LinearAlgebra", "NLsolve", "Reexport", "SparseArrays"] yylinebreak yylinebreak git-tree-sha1 = "fe34902ac0c3a35d016617ab7032742865756d7d" yylinebreak yylinebreak uuid = "764a87c0-6b3e-53db-yyint-fe964310641d" yylinebreak yylinebreak version = "yyfloat.yyint" yylinebreak julia
... ... ... ...
30222 68414 yylinebreak yylinebreak sum(y.test == (y.pred > yyfloat)) / length(y.test) yylinebreak yylinebreak lgbm.booster.save(handle.booster, filename = "/tmp/model.txt") yylinebreak #Save model (can be loaded again via lgbm.booster.load(filename)) R
33417 41713 ## yyint yy2space yy2space yyint yy2space NA yy2space yyint yy2space yyfloat yy2space yyfloat yylinebreak yylinebreak ## yyint yy2space yy2space yyint yy2space yy2space yyint yy2space yyint -yyfloat yy2space yyfloat yylinebreak yylinebreak ## yyint yy2space yy2space yyint yy2space yy2space yyint yy2space yyint yy2space yyfloat yy2space yyfloat yylinebreak yylinebreak ## yy2space var1 var2 var3 yy2space yy2space var4 rnorm(yyint) yylinebreak yylinebreak yylinebreak ## yyint yy2space yy2space yyint yy2space NA yy2space yyint yy2space yyfloat yy2space yyfloat yylinebreak ## yyint yy2space yy... R
38685 14539 yylinebreak pred <- lgbm.booster.predict(handle.booster, x.test) yylinebreak sum(y.test == (y.pred > yyfloat)) / length(y.test) yylinebreak yylinebreak yylinebreak yylinebreak #Save model (can be loaded again via lgbm.booster.load(filename)) yylinebreak yylinebreak #Predict yylinebreak #Test accuracy R
42737 17430 library(xgboost) yylinebreak library(tidyverse) yylinebreak yylinebreak yylinebreak yylinebreak ind<-sample(yyint,nrow(diamonds),replace = T,prob = c(yyfloat,yyfloat)) R
45400 32375 yylinebreak yy2space while (gap > yyint && swaps == yyint) { yylinebreak yy2space swaps <- yyint yylinebreak yylinebreak yylinebreak yy2space yy2space gap = floor(gap / yyfloat) R

2362 rows × 3 columns

Quick fastai lstm classifier

https://github.com/fastai/fastai/blob/master/nbs/38_tutorial.text.ipynb

TextDataLoaders.from_df(
    df, path='.', valid_pct=0.2, seed=None, text_col=0, label_col=1, 
    label_delim=None, y_block=None, text_vocab=None, is_lm=False, 
    valid_col=None, tok_tfm=None, tok_text_col='text', seq_len=72, 
    backwards=False, bs=64, val_bs=None, shuffle=True, device=None)

language model

Start by training a language model - the pre-trained model (trained on wikipedia text) doesn't know much about code ...

Notes;

  • to give us as much code to learn from as possible
    • we combine unlabelled test data with training data
    • we use a low valid percent
      • TODO: explain why this is ok for LM training

TODO:

  • add some logic to preserve white space semantics
In [16]:
lm_df = pd.concat([train_df[['code']], test_df[['code']]])
In [17]:
len(train_df), len(test_df), len(lm_df)
Out[17]:
(53065, 9277, 62342)
In [18]:
dls_lm = TextDataLoaders.from_df(lm_df, text_col='code', is_lm=True, valid_pct=0.1)
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)
In [19]:
dls_lm.show_batch()
text text_
0 xxbos if node.next xxrep 3 = tail { yylinebreak yylinebreak xxwrep 5 yy2space tail = node yylinebreak yylinebreak xxwrep 4 yy2space } yylinebreak yylinebreak xxwrep 4 yy2space node.next = node.next?.next yylinebreak yylinebreak xxwrep 3 yy2space } xxbos yy2space for ( const auto & q : queries ) { yylinebreak yylinebreak yy2space yy2space int type = q[yyint ] ; yylinebreak yylinebreak yy2space yy2space if ( type = = yyint ) { yylinebreak xxbos if node.next xxrep 3 = tail { yylinebreak yylinebreak xxwrep 5 yy2space tail = node yylinebreak yylinebreak xxwrep 4 yy2space } yylinebreak yylinebreak xxwrep 4 yy2space node.next = node.next?.next yylinebreak yylinebreak xxwrep 3 yy2space } xxbos yy2space for ( const auto & q : queries ) { yylinebreak yylinebreak yy2space yy2space int type = q[yyint ] ; yylinebreak yylinebreak yy2space yy2space if ( type = = yyint ) { yylinebreak xxbos xxwrep
1 , from_rod , aux_rod ) ; yylinebreak yylinebreak yy2space yy2space stringbuilder.setlength(yyint ) ; yylinebreak yylinebreak yy2space yy2space stringbuilder.append(integer.tostring(n ) ) ; yylinebreak xxbos yy2space yy2space assert ( ! isarmstrong(yyint ) ) ; yylinebreak yylinebreak yy2space } yylinebreak yylinebreak yy2space / * * yylinebreak yylinebreak yy2space yy2space * xxmaj checks whether a given number is an armstrong number or not . yylinebreak yylinebreak yy2space yy2space * yylinebreak xxbos yy2space yy2space yyint yylinebreak yylinebreak from_rod , aux_rod ) ; yylinebreak yylinebreak yy2space yy2space stringbuilder.setlength(yyint ) ; yylinebreak yylinebreak yy2space yy2space stringbuilder.append(integer.tostring(n ) ) ; yylinebreak xxbos yy2space yy2space assert ( ! isarmstrong(yyint ) ) ; yylinebreak yylinebreak yy2space } yylinebreak yylinebreak yy2space / * * yylinebreak yylinebreak yy2space yy2space * xxmaj checks whether a given number is an armstrong number or not . yylinebreak yylinebreak yy2space yy2space * yylinebreak xxbos yy2space yy2space yyint yylinebreak yylinebreak yy2space
2 yy2space testmod ( ) yylinebreak xxbos yy2space return xxunk / yyint ) ) + xxmaj string(num % yyint ) yylinebreak yylinebreak } yylinebreak xxbos # [ yyint , ] yy2space yyfloat yy2space yyfloat yy2space yyfloat yy2space yyfloat yylinebreak yylinebreak # standardize yylinebreak yylinebreak apply(as.matrix(iris),yyint , standardization ) yylinebreak yylinebreak # sepal.length sepal.width petal.length yy2space petal.width xxbos yy2space yy2space xxmaj check if the triangle given by the points xxmaj xxunk , y1 ) testmod ( ) yylinebreak xxbos yy2space return xxunk / yyint ) ) + xxmaj string(num % yyint ) yylinebreak yylinebreak } yylinebreak xxbos # [ yyint , ] yy2space yyfloat yy2space yyfloat yy2space yyfloat yy2space yyfloat yylinebreak yylinebreak # standardize yylinebreak yylinebreak apply(as.matrix(iris),yyint , standardization ) yylinebreak yylinebreak # sepal.length sepal.width petal.length yy2space petal.width xxbos yy2space yy2space xxmaj check if the triangle given by the points xxmaj xxunk , y1 ) ,
3 ] < < endl ; yylinebreak xxbos * @brief xxmaj main function yylinebreak yylinebreak yy2space * xxmaj xxunk a dummy graph of a small size with yylinebreak yylinebreak yy2space * a few edges between random nodes . yylinebreak yylinebreak yy2space * xxmaj on applying the algorithm , it checks if the instantiated yylinebreak yylinebreak yy2space * graph is bipartite or not . yylinebreak yylinebreak yy2space * @returns yyint on exit yylinebreak xxbos < < endl ; yylinebreak xxbos * @brief xxmaj main function yylinebreak yylinebreak yy2space * xxmaj xxunk a dummy graph of a small size with yylinebreak yylinebreak yy2space * a few edges between random nodes . yylinebreak yylinebreak yy2space * xxmaj on applying the algorithm , it checks if the instantiated yylinebreak yylinebreak yy2space * graph is bipartite or not . yylinebreak yylinebreak yy2space * @returns yyint on exit yylinebreak xxbos yy2space
4 if xxunk ) = = yyint : yylinebreak yylinebreak xxwrep 8 yy2space if self.is_left ( ) and xxunk ( ) : yylinebreak yylinebreak xxwrep 10 yy2space self.parent.rotate_right ( ) yylinebreak xxbos } ; yylinebreak yylinebreak class xxmaj b : public xxup a { yylinebreak yylinebreak public : yylinebreak yylinebreak yy2space yy2space void f ( ) { xxunk : xxunk ; } yylinebreak yylinebreak } ; yylinebreak yylinebreak int main ( ) { xxunk ) = = yyint : yylinebreak yylinebreak xxwrep 8 yy2space if self.is_left ( ) and xxunk ( ) : yylinebreak yylinebreak xxwrep 10 yy2space self.parent.rotate_right ( ) yylinebreak xxbos } ; yylinebreak yylinebreak class xxmaj b : public xxup a { yylinebreak yylinebreak public : yylinebreak yylinebreak yy2space yy2space void f ( ) { xxunk : xxunk ; } yylinebreak yylinebreak } ; yylinebreak yylinebreak int main ( ) { yylinebreak
5 yylinebreak def lcm(a , b ) yylinebreak xxbos xxwrep 4 yy2space } yylinebreak yylinebreak xxwrep 4 yy2space xxrep 3 / < summary > yylinebreak yylinebreak xxwrep 4 yy2space xxrep 3 / yy2space yy2space xxmaj single term for e^x function approximation : xxunk / i ! . yylinebreak yylinebreak xxwrep 4 yy2space xxrep 3 / < / summary > yylinebreak xxbos * xxmaj second way to reverses the string yylinebreak yylinebreak yy2space * def lcm(a , b ) yylinebreak xxbos xxwrep 4 yy2space } yylinebreak yylinebreak xxwrep 4 yy2space xxrep 3 / < summary > yylinebreak yylinebreak xxwrep 4 yy2space xxrep 3 / yy2space yy2space xxmaj single term for e^x function approximation : xxunk / i ! . yylinebreak yylinebreak xxwrep 4 yy2space xxrep 3 / < / summary > yylinebreak xxbos * xxmaj second way to reverses the string yylinebreak yylinebreak yy2space * /
6 > ) ) yylinebreak yylinebreak / / xxunk : < ) yylinebreak xxbos yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup xxunk xxup xxunk yylinebreak yylinebreak yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup xxunk xxup xxunk yylinebreak yylinebreak yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup xxunk xxup xxunk yylinebreak yylinebreak yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup ) ) yylinebreak yylinebreak / / xxunk : < ) yylinebreak xxbos yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup xxunk xxup xxunk yylinebreak yylinebreak yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup xxunk xxup xxunk yylinebreak yylinebreak yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup xxunk xxup xxunk yylinebreak yylinebreak yy2space yy2space xxmaj decryption using xxmaj key # yyint : xxup xxunk
7 yylinebreak yylinebreak yytab yytab if len(s ) = = maxlen { yylinebreak xxbos yy2space it('expects to reverse a string with spaces in between ' , ( ) = > { yylinebreak yylinebreak yy2space yy2space xxunk xxunk xxunk ' ) yylinebreak yylinebreak yy2space } ) yylinebreak xxbos } yylinebreak yylinebreak void destroyqueue ( ) { q.front = q.rear = xxup null ; } yylinebreak yylinebreak int main ( ) yylinebreak yylinebreak { yylinebreak yylinebreak yytab yytab if len(s ) = = maxlen { yylinebreak xxbos yy2space it('expects to reverse a string with spaces in between ' , ( ) = > { yylinebreak yylinebreak yy2space yy2space xxunk xxunk xxunk ' ) yylinebreak yylinebreak yy2space } ) yylinebreak xxbos } yylinebreak yylinebreak void destroyqueue ( ) { q.front = q.rear = xxup null ; } yylinebreak yylinebreak int main ( ) yylinebreak yylinebreak { yylinebreak yylinebreak
8 does n't matter what you leave beyond the new length . yylinebreak yylinebreak # yylinebreak yylinebreak # xxmaj example yylinebreak yylinebreak # yylinebreak yylinebreak # xxmaj input : nums = [ yyint , yyint , yyint , yyint ] , val = yyint xxbos xxwrep 6 yy2space var r = right.read ( ) ; yylinebreak yylinebreak xxwrep 6 yy2space while ( true ) yylinebreak yylinebreak xxwrep 6 yy2space { yylinebreak yylinebreak xxwrep n't matter what you leave beyond the new length . yylinebreak yylinebreak # yylinebreak yylinebreak # xxmaj example yylinebreak yylinebreak # yylinebreak yylinebreak # xxmaj input : nums = [ yyint , yyint , yyint , yyint ] , val = yyint xxbos xxwrep 6 yy2space var r = right.read ( ) ; yylinebreak yylinebreak xxwrep 6 yy2space while ( true ) yylinebreak yylinebreak xxwrep 6 yy2space { yylinebreak yylinebreak xxwrep 8
In [20]:
learn_lm = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()
100.00% [105070592/105067061 00:02<00:00]

Note: we use the training "protocol" from the text tutorial, running lr_find just fyi

In [21]:
learn_lm.lr_find()
Out[21]:
SuggestedLRs(valley=0.0063095735386013985)
In [22]:
learn_lm.fit_one_cycle(1, 1e-2)
epoch train_loss valid_loss accuracy perplexity time
0 2.984010 2.698834 0.480730 14.862394 02:17
In [23]:
learn_lm.save('lm_1epoch')
Out[23]:
Path('models/lm_1epoch.pth')
In [24]:
learn_lm = learn_lm.load('lm_1epoch')
learn_lm.unfreeze()
learn_lm.lr_find()
Out[24]:
SuggestedLRs(valley=0.0003981071640737355)

Note: we use SaveModelCallback in case we train for too many epochs (not great with one-cycle but better than getting stuck with an over-cooked model) - this might save us having to re-train from "lm_1epoch"

In [25]:
learn_lm = learn_lm.load('lm_1epoch')
learn_lm.unfreeze()
learn_lm.fit_one_cycle(5, 1e-3, cbs=SaveModelCallback(fname='lm_best_model'))
epoch train_loss valid_loss accuracy perplexity time
0 2.276998 2.079367 0.595968 7.999401 02:21
1 1.940612 1.783637 0.640432 5.951463 02:21
2 1.752768 1.646227 0.662328 5.187373 02:21
3 1.635882 1.578700 0.672922 4.848650 02:21
4 1.591390 1.564359 0.676720 4.779610 02:21
Better model found at epoch 0 with valid_loss value: 2.079366683959961.
Better model found at epoch 1 with valid_loss value: 1.7836370468139648.
Better model found at epoch 2 with valid_loss value: 1.6462273597717285.
Better model found at epoch 3 with valid_loss value: 1.5787004232406616.
Better model found at epoch 4 with valid_loss value: 1.5643589496612549.
In [26]:
learn_lm.recorder.plot_loss()
In [27]:
learn_lm.save('lm_finetuned')
learn_lm.save_encoder('lm_encoder_finetuned')
torch.save(dls_lm, 'models/lm_dls.pkl')

using just the training data and a default valid_pct learn_lm.fine_tune(4, 1e-2) gave us

epoch train_loss valid_loss accuracy perplexity time
0 4.127742 3.569793 0.337202 35.509224 04:24
epoch train_loss valid_loss accuracy perplexity time
0 3.252794 2.976321 0.425069 19.615513 05:02
1 2.826681 2.602097 0.480698 13.492004 05:08
2 2.573639 2.458117 0.500367 11.682790 05:10
3 2.453192 2.424428 0.506975 11.295764 05:04

classifier

calculate class weights with

n_samples / (n_classes * np.bincount(y))

see: https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html

In [28]:
n_samples = len(train_df)
classes = sorted(train_df['language'].unique())
n_classes = len(classes)
class_weight_map = {}
for language, bincount in train_df['language'].value_counts().iteritems():
    class_weight_map[language] = n_samples / (n_classes * bincount)
class_weights = tensor([class_weight_map[c] for c in classes]).cuda()
print(classes)
print(class_weights)
['R', 'c', 'c-plus-plus', 'c-sharp', 'dart', 'f-sharp', 'go', 'java', 'javascript', 'julia', 'php', 'python', 'ruby', 'scala', 'swift']
tensor([4.4221, 0.7750, 0.3137, 0.9094, 1.7291, 2.8762, 1.7858, 0.8193, 1.3098,
        1.7600, 2.7213, 0.2790, 1.5836, 4.8132, 2.7213], device='cuda:0')
In [29]:
dls_clas = TextDataLoaders.from_df(train_df, text_col='code', label_col='language', text_vocab=dls_lm.vocab) # TODO: valid_col for stratified/repeateable split
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)
In [30]:
dls_clas.show_batch()
text category
0 xxbos } yylinebreak yylinebreak / * yylinebreak yylinebreak yy2space * xxmaj test bubblesort yylinebreak yylinebreak yy2space * / yylinebreak yylinebreak $ array = [ yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , php
1 xxbos yy2space yy2space { ' a ' , yyint } , yy2space { ' b ' , yyint } , yy2space { ' c ' , yyint } , yy2space { 'd ' , yyint } , yy2space { ' e ' , yyint } , yy2space { ' f ' , yyint } , yy2space { ' g ' , yyint } , yylinebreak yylinebreak yy2space yy2space { ' h ' , yyint } , yy2space { ' i ' , yyint } , yy2space { ' j ' , yyint } , { ' k ' , yyint } , { ' l ' , yyint } , { ' m ' , yyint } , { ' n ' , yyint } , yylinebreak yylinebreak yy2space yy2space { ' o ' , yyint } , { ' p ' , yyint } , { ' q ' c-plus-plus
2 xxbos yy2space yy2space and map vertically yylinebreak yylinebreak yy2space yy2space xxrep 3 > xxunk " , " university " ) yy2space # doctest : + normalize_whitespace yylinebreak yylinebreak yy2space yy2space { ' a ' : ' c ' , ' b ' : ' a ' , ' c ' : ' i ' , ' d ' : ' p ' , ' e ' : ' u ' , ' f ' : ' z ' , ' g ' : ' o ' , ' h ' : ' b ' , yylinebreak yylinebreak xxwrep 3 yy2space ' i ' : ' j ' , ' j ' : ' q ' , ' k ' : ' v ' , ' l ' : ' l ' , ' m ' : ' d ' , ' n ' : ' k ' , ' o python
3 xxbos xxwrep 6 yy2space 0x8d , 0x01 , 0x02 , 0x04 , 0x08 , 0x10 , 0x20 , 0x40 , 0x80 , 0x1b , 0x36 , 0x6c , 0xd8 , 0xab , 0x4d , 0x9a , yylinebreak yylinebreak xxwrep 6 yy2space 0x2f , 0x5e , 0xbc , 0x63 , 0xc6 , 0x97 , 0x35 , 0x6a , 0xd4 , 0xb3 , 0x7d , 0xfa , 0xef , 0xc5 , 0x91 , 0x39 , yylinebreak yylinebreak xxwrep 6 yy2space 0x72 , 0xe4 , 0xd3 , 0xbd , 0x61 , 0xc2 , 0x9f , 0x25 , 0x4a , 0x94 , 0x33 , 0x66 , 0xcc , 0x83 , 0x1d , 0x3a , yylinebreak yylinebreak xxwrep 6 yy2space 0x74 , 0xe8 , 0xcb , 0x8d , 0x01 , 0x02 , 0x04 , 0x08 , 0x10 , 0x20 , 0x40 , 0x80 , 0x1b , 0x36 , 0x6c , 0xd8 , yylinebreak yylinebreak xxwrep java
4 xxbos xxwrep 14 yy2space yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yy2space yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yy2space yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yylinebreak yylinebreak xxwrep 14 yy2space yyint , yy2space yyint , yyint , yy2space yyint , yy2space yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yylinebreak yylinebreak xxwrep 14 yy2space yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space yyint , yyint , yy2space c-sharp
5 xxbos xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , c-sharp
6 xxbos # xxwrep 4 yy2space backtrack return with ( ) ( ) yyint yyint [ " xxrep 3 ( xxrep 3 ) " , " ( ( ) ( ) ) " , " ( ( ) ) ( ) " , " ( ) ( ( ) ) " , " ( ) ( ) ( ) " ] yylinebreak yylinebreak # xxwrep 3 yy2space backtrack return with ( ) ( yyint yyint [ " xxrep 3 ( xxrep 3 ) " , " ( ( ) ( ) ) " , " ( ( ) ) ( ) " , " ( ) ( ( ) ) " , " ( ) ( ) ( ) " ] yylinebreak yylinebreak # yy2space yy2space backtrack return with ( ) yyint yyint [ " xxrep 3 ( xxrep 3 ) " , " ( ( ) ( ) ) " ruby
7 xxbos yy2space yy2space " yyint " : [ " yyint " , " yyint " , " yyint " , " yyint " , " yyint " ] , yylinebreak yylinebreak yy2space yy2space " yyint " : [ " yyint " , " yyint " , " yyint " , " yyint " , " yyint " ] , yylinebreak yylinebreak yy2space yy2space " yyint " : [ " yyint " , " yyint " , " yyint " , " yyint " , " yyint " ] , yylinebreak yylinebreak yy2space yy2space " yyint " : [ " yyint " , " yyint " , " yyint " , " yyint " , " yyint " ] , yylinebreak yylinebreak yy2space yy2space " yyint " : [ " yyint " , " yyint " , " yyint " , " yyint " , " yyint " ] , yylinebreak python
8 xxbos xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yyint , yylinebreak yylinebreak xxwrep 6 yy2space yyint , yyint , yyint , c-sharp
In [31]:
# NOT: using FocalLossFlat(weight=class_weights)
learn_clas = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy], loss_func=FocalLossFlat(), wd=1e-2)
In [32]:
learn_clas.loss_func
Out[32]:
FlattenedLoss of FocalLoss()
In [33]:
learn_clas = learn_clas.load_encoder('lm_encoder_finetuned')
In [34]:
learn_clas.lr_find()
Out[34]:
SuggestedLRs(valley=0.0014454397605732083)
In [35]:
learn_clas.fit_one_cycle(1, 2e-2)
epoch train_loss valid_loss accuracy time
0 0.273634 0.188316 0.867804 01:11
In [36]:
learn_clas.save('clas_step1')
Out[36]:
Path('models/clas_step1.pth')
In [37]:
learn_clas.freeze_to(-2)
learn_clas.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
epoch train_loss valid_loss accuracy time
0 0.176099 0.130678 0.906718 01:15
In [38]:
learn_clas.save('clas_step2')
Out[38]:
Path('models/clas_step2.pth')
In [39]:
learn_clas.freeze_to(-3)
learn_clas.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
epoch train_loss valid_loss accuracy time
0 0.118745 0.102697 0.923961 01:30
In [58]:
learn_clas.save('clas_step3')
Out[58]:
Path('models/clas_step3.pth')
In [131]:
learn_clas.unfreeze()
learn_clas.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
epoch train_loss valid_loss accuracy time
0 0.050023 0.079542 0.944219 01:50
1 0.042437 0.078133 0.944125 01:50
  • label
    • final after unfreeze
    • leader board
  • full dataset
    • 0.888329
    • 0.785
  • baseline (small dataset)
    • 0.490387 0.492859 0.822333
    • 0.678
  • yyint
    • 0.530182 0.499005 0.813743
  • yyfloat and yyint
    • 0.476658 0.544016 0.810579
    • 0.685
  • yyfloat and yyint and whitespace
    • 0.411087 0.378865 0.860759
    • 0.849
  • yyfloat and yyint and whitespace (weighted focal loss)
    • 0.243585 0.270471 0.825045
    • 0.814
  • full dataset with under-rep aug: yyfloat and yyint and whitespace (focal loss)

In [132]:
learn_clas.save('clas_step4_finetuned')
torch.save(dls_clas, 'models/clas_dls.pkl')
In [133]:
learn_clas.recorder.plot_loss()

Prediction Phase ✈

In [134]:
test_df = processs_df(pd.read_csv("data/test.csv"))
test_df.shape, test_df.columns
Out[134]:
((9277, 2), Index(['id', 'code'], dtype='object'))
In [135]:
test_df.iloc[1]['code']
Out[135]:
' yy2space yy2space yy2space this.path = path; yylinebreak yylinebreak yy2space yy2space yy2space this.estimated = estimated; yylinebreak yylinebreak yy2space yy2space } yylinebreak yylinebreak yy2space yy2space public int getDistance() { yylinebreak yylinebreak yy2space yy2space yy2space return distance; yylinebreak yylinebreak yy2space yy2space } yylinebreak '
In [136]:
learn_clas.predict(test_df.iloc[1]['code'])[0]
Out[136]:
'java'
In [137]:
test_dl = learn_clas.dls.test_dl(test_df['code'])
In [138]:
preds_with_decoded = learn_clas.get_preds(dl=test_dl, with_decoded=True)
In [139]:
preds_with_decoded[2]
Out[139]:
TensorText([11,  7,  3,  ...,  3,  2,  7])
In [140]:
labels = dls_clas.vocab[1] # sorted(train_df['language'].unique())
In [141]:
target = preds_with_decoded[2].detach().cpu().numpy()
target
Out[141]:
array([11,  7,  3, ...,  3,  2,  7])
In [142]:
test_df['target'] = target
test_df.head()
Out[142]:
id code target
0 10684 yyint = yyint + yyint + yyint yylinebreak yylinebreak yyint = yyint + yyint + yyint yylinebreak yylinebreak yyint = yyint + yyint + yyint yylinebreak yylinebreak yyint = yyint + yyint + yyint yylinebreak yylinebreak How many numbers below fifty million can be expressed as the sum of a prime square, yylinebreak 11
1 17536 yy2space yy2space yy2space this.path = path; yylinebreak yylinebreak yy2space yy2space yy2space this.estimated = estimated; yylinebreak yylinebreak yy2space yy2space } yylinebreak yylinebreak yy2space yy2space public int getDistance() { yylinebreak yylinebreak yy2space yy2space yy2space return distance; yylinebreak yylinebreak yy2space yy2space } yylinebreak 7
2 26383 yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space { yylinebreak yylinebreak yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space tmp += 'yyint'; yylinebreak yylinebreak yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space } yylinebreak 3
3 29090 /** yylinebreak yylinebreak yy2space * Class for converting from "any" base to "any" other base, when "any" means from yyint-yyint. Works by yylinebreak yylinebreak yy2space * going from base yyint to decimal to base yyint. Includes auxiliary method for determining whether a yylinebreak yylinebreak yy2space * number is valid for a given base. yylinebreak yylinebreak yy2space * yylinebreak 8
4 10482 yy2space yy2space yy2space yy2space { cout<<"Destructing base \n"; } yy2space yy2space yy2space yylinebreak yylinebreak }; yylinebreak yylinebreak class derived: public base { yylinebreak yylinebreak yy2space yy2space public: yylinebreak yylinebreak yy2space yy2space yy2space yy2space derived() yy2space yy2space yy2space yylinebreak yylinebreak yy2space yy2space yy2space yy2space { cout<<"Constructing derived \n"; } yylinebreak 2
In [143]:
prediction = [labels[t] for t in target]
In [144]:
test_df["prediction"] = prediction
test_df.head()
Out[144]:
id code target prediction
0 10684 yyint = yyint + yyint + yyint yylinebreak yylinebreak yyint = yyint + yyint + yyint yylinebreak yylinebreak yyint = yyint + yyint + yyint yylinebreak yylinebreak yyint = yyint + yyint + yyint yylinebreak yylinebreak How many numbers below fifty million can be expressed as the sum of a prime square, yylinebreak 11 python
1 17536 yy2space yy2space yy2space this.path = path; yylinebreak yylinebreak yy2space yy2space yy2space this.estimated = estimated; yylinebreak yylinebreak yy2space yy2space } yylinebreak yylinebreak yy2space yy2space public int getDistance() { yylinebreak yylinebreak yy2space yy2space yy2space return distance; yylinebreak yylinebreak yy2space yy2space } yylinebreak 7 java
2 26383 yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space { yylinebreak yylinebreak yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space tmp += 'yyint'; yylinebreak yylinebreak yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space yy2space } yylinebreak 3 c-sharp
3 29090 /** yylinebreak yylinebreak yy2space * Class for converting from "any" base to "any" other base, when "any" means from yyint-yyint. Works by yylinebreak yylinebreak yy2space * going from base yyint to decimal to base yyint. Includes auxiliary method for determining whether a yylinebreak yylinebreak yy2space * number is valid for a given base. yylinebreak yylinebreak yy2space * yylinebreak 8 javascript
4 10482 yy2space yy2space yy2space yy2space { cout<<"Destructing base \n"; } yy2space yy2space yy2space yylinebreak yylinebreak }; yylinebreak yylinebreak class derived: public base { yylinebreak yylinebreak yy2space yy2space public: yylinebreak yylinebreak yy2space yy2space yy2space yy2space derived() yy2space yy2space yy2space yylinebreak yylinebreak yy2space yy2space yy2space yy2space { cout<<"Constructing derived \n"; } yylinebreak 2 c-plus-plus
In [145]:
pd.concat([test_df, pd.read_csv("data/test.csv")], axis='columns').to_csv('submission2.csv', index=False)

Generating Prediction File

In [146]:
# TODO: why would we sample test_df?? <- this just shuffles the data - so why do we want to shuffle?
# test_df = test_df.sample(frac=1)
# test_df.head()
In [ ]:
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
%aicrowd notebook submit -c programming-language-classification -a assets --no-verify
In [ ]:


Comments

You must login before you can post a comment.

Execute