Loading

ADDI Alzheimers Detection Challenge

Simple EDA and Baseline - LB 0.66 (0.616 with a magic)

Simple EDA and Baseline - LB 0.66 (0.616 with a magic)

moto

This notebook contains 1) a simple analysis 2) a simple feature engineering 3) a simple k-fold model whose CV is 0.76 and LB 0.66.

The magic is the ratio in cell 15. Change it from "nb_neg = nb_pos" to "nb_neg = nb_pos*2" will score 0.616 LB

 

Drawing

Simple EDA and baseline models

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

How to use this notebook? 📝

notebook overview

  • Update the config parameters. You can define the common variables here
Variable Description
AICROWD_DATASET_PATH Path to the file containing test data (The data will be available at /ds_shared_drive/ on aridhia workspace). This should be an absolute path.
AICROWD_PREDICTIONS_PATH Path to write the output to.
AICROWD_ASSETS_DIR In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
AICROWD_API_KEY In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me
  • Installing packages. Please use the Install packages 🗃 section to install the packages
  • Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

Setup AIcrowd Utilities 🛠

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [1]:
!pip install -q -U aicrowd-cli
In [2]:
%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /ds_shared_drive on the workspace.

In [3]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"
AICROWD_API_KEY = "" # Get your key from https://www.aicrowd.com/participants/me

Install packages 🗃

Please add all pacakage installations in this section

In [4]:
!pip install numpy pandas
!pip install seaborn lightgbm scikit-learn
Requirement already satisfied: numpy in ./conda/lib/python3.8/site-packages (1.20.2)
Requirement already satisfied: pandas in ./conda/lib/python3.8/site-packages (1.2.4)
Requirement already satisfied: python-dateutil>=2.7.3 in ./conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: seaborn in ./conda/lib/python3.8/site-packages (0.11.1)
Requirement already satisfied: lightgbm in ./conda/lib/python3.8/site-packages (3.2.1)
Requirement already satisfied: scikit-learn in ./conda/lib/python3.8/site-packages (0.24.2)
Requirement already satisfied: matplotlib>=2.2 in ./conda/lib/python3.8/site-packages (from seaborn) (3.4.1)
Requirement already satisfied: numpy>=1.15 in ./conda/lib/python3.8/site-packages (from seaborn) (1.20.2)
Requirement already satisfied: scipy>=1.0 in ./conda/lib/python3.8/site-packages (from seaborn) (1.6.3)
Requirement already satisfied: pandas>=0.23 in ./conda/lib/python3.8/site-packages (from seaborn) (1.2.4)
Requirement already satisfied: wheel in ./conda/lib/python3.8/site-packages (from lightgbm) (0.35.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./conda/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
Requirement already satisfied: joblib>=0.11 in ./conda/lib/python3.8/site-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: pillow>=6.2.0 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing>=2.2.1 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)
Requirement already satisfied: cycler>=0.10 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas>=0.23->seaborn) (2021.1)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)

Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

Import common packages

Please import packages that are common for training and prediction phases here.

In [5]:
import numpy as np
import pandas as pd
In [6]:
# some precessing code
In [7]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

import joblib

import warnings
warnings.filterwarnings("ignore")

Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

In [8]:
# model = define_your_model

Load training data

In [9]:
# load your data
In [10]:
AICROWD_DATASET_PATH
Out[10]:
'/ds_shared_drive/validation.csv'
In [11]:
target_col = "diagnosis"
key_col = "row_id"
cat_cols = ['intersection_pos_rel_centre']
seed = 2021

target_values = ["normal", "post_alzheimer", "pre_alzheimer"]

train = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "train"))
train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True)


print(train.shape)
features = train.columns[1:-1].to_list()

numeric_features = [c for c in features if c not in cat_cols]
for c in numeric_features:
    train[c] = train[c].astype(float)

print(train[target_col].value_counts())
train.tail(3)
(32777, 122)
normal            31208
post_alzheimer     1149
pre_alzheimer       420
Name: diagnosis, dtype: int64
Out[11]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hand_count_dummy hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre intersection_pos_rel_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit final_rotation_angle ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect diagnosis
32774 YKCXR5L3HUEXI9129 11.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 NaN 415.609492 NaN 404.931167 343.933496 350.160677 411.903508 408.573127 388.002900 342.225437 372.795185 383.355253 NaN 6.460009 5.59 13.248901 0.086013 0.39 1.474322 9.459343 4.29 1.136238 NaN 10.27 NaN 2520.0 1815.0 4823.0 4440.0 2970.0 2166.0 3096.0 2360.0 3250.0 NaN 5150.0 NaN 40.0 55.0 91.0 74.0 66.0 57.0 72.0 59.0 65.0 NaN 50.0 NaN 63.0 33.0 53.0 60.0 45.0 38.0 43.0 40.0 50.0 NaN 103.0 362.272727 181.690909 1216108.964 5.135000 360.0 29.466579 360.0 301.658467 NaN 9727.113012 1.0 0.0 2.0 2.0 77.481745 98.597951 NaN 1.272531 21.116206 75.380406 14.119013 TL NaN 26.572249 10.0 11.0 2.0 2.0 0.0 86.631195 125.0 1.000000 0.0 125.735462 115.510618 118.221225 122.488967 0.910695 0.089303 0.690915 0.015356 2.0 2.0 0.0 1.0 60.0 1.0 normal
32775 0MFBMF7ZRBSAH8ASA 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 433.959099 441.208568 408.001225 405.226171 417.833101 365.256759 339.988970 364.785553 397.165583 451.738863 NaN NaN 5.641735 7.531402 20.67 9.179425 9.302904 25.87 48.801311 NaN NaN 20.980509 27.633149 8.19 1881.0 1505.0 2537.0 2911.0 3477.0 3337.0 2640.0 NaN NaN 2800.0 2376.0 3050.0 57.0 43.0 59.0 71.0 61.0 71.0 60.0 NaN NaN 56.0 54.0 50.0 33.0 35.0 43.0 41.0 57.0 47.0 44.0 NaN NaN 50.0 44.0 61.0 76.944444 73.511111 377425.600 18.243333 360.0 2357.648327 NaN 148.585224 NaN 148.585224 1.0 0.0 1.0 1.0 NaN NaN 46.699503 NaN NaN NaN NaN NaN NaN NaN NaN 11.0 NaN 2.0 0.0 77.348992 84.0 1.000000 1.0 129.456141 112.602114 121.981285 118.401611 0.538221 0.461447 0.525810 0.473859 0.0 1.0 0.0 1.0 NaN NaN normal
32776 NMOMZBPRJMJCFOONV 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 439.300581 488.530961 NaN 438.130688 453.176014 424.410474 380.000000 371.031333 307.591450 358.568334 377.794256 429.549182 111.504807 109.183899 98.93 85.563310 87.320320 79.95 92.758482 98.168063 109.85 NaN NaN 113.75 1680.0 1927.0 2256.0 2992.0 3366.0 7755.0 2870.0 2795.0 5655.0 1833.0 NaN 1681.0 28.0 41.0 47.0 44.0 51.0 55.0 41.0 43.0 65.0 39.0 NaN 41.0 60.0 47.0 48.0 68.0 66.0 141.0 70.0 65.0 87.0 47.0 NaN 41.0 777.618182 91.800000 3610308.273 100.620000 360.0 493.127248 NaN NaN NaN NaN NaN NaN 2.0 2.0 94.780253 100.060823 NaN 1.055714 5.280570 81.236698 22.476805 BL NaN 0.096989 10.0 11.0 2.0 2.0 270.0 84.200577 124.0 0.714286 0.0 126.783309 110.895126 111.380356 126.069024 0.498043 0.501635 0.541883 0.457771 2.0 0.0 0.0 1.0 60.0 0.0 normal

Target

In [12]:
sns.countplot(x=target_col, data=train);

Numerical features

In [13]:
nb_shown = len(numeric_features)
fig, ax = plt.subplots(nb_shown, 1, figsize=(20,5*nb_shown))

colors = ["Green", "Blue", "Red"]
for i, col in enumerate(numeric_features[:nb_shown]):
    for value, color in zip(target_values, colors):
        sns.distplot(train.loc[train[target_col]==value, col], 
                     ax=ax[i], color=color, norm_hist=True)
        ax[i].set_title("Train {}".format(col))
    ax[i].set_xlabel("")
    ax[i].set_xlabel("")

Categorical features

There is only 1 single categorical feature

In [14]:
sns.countplot(x=cat_cols[0], hue=target_col, data=train[cat_cols+[target_col]].fillna("NA"));

Balance the dataset and see the the distribution again

In [15]:
df_pos = train[train[target_col].isin(target_values[1:])]
nb_pos = df_pos.shape[0]
nb_neg = nb_pos
df_neg = train[train[target_col] == "normal"].sample(n=nb_neg, random_state=seed)
df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)

sns.countplot(x=cat_cols[0], hue=target_col, data=df_samples[cat_cols+[target_col]].fillna("NA"));

Train your model

In [16]:
# model.fit(train_data)
In [17]:
# some custom code block

Simple FE

In [18]:
print(cat_cols)
for c in cat_cols:
    df_samples[c].fillna("NA", inplace=True)
    
df_dummies = pd.get_dummies(df_samples[cat_cols], columns=cat_cols, dummy_na=True).add_prefix('CAT_')
dummy_cols = df_dummies.columns.to_list()
print(dummy_cols)

df_samples = pd.concat([df_samples, df_dummies], axis=1)
df_samples['cnt_NaN'] = df_samples[numeric_features].isna().sum(axis=1)

df_samples.fillna(-1, inplace=True)
df_samples.head(3)
['intersection_pos_rel_centre']
['CAT_intersection_pos_rel_centre_BL', 'CAT_intersection_pos_rel_centre_BR', 'CAT_intersection_pos_rel_centre_NA', 'CAT_intersection_pos_rel_centre_TL', 'CAT_intersection_pos_rel_centre_TR', 'CAT_intersection_pos_rel_centre_nan']
Out[18]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hand_count_dummy hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre intersection_pos_rel_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit final_rotation_angle ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect diagnosis CAT_intersection_pos_rel_centre_BL CAT_intersection_pos_rel_centre_BR CAT_intersection_pos_rel_centre_NA CAT_intersection_pos_rel_centre_TL CAT_intersection_pos_rel_centre_TR CAT_intersection_pos_rel_centre_nan cnt_NaN
0 XSR3WB69PLAS5HY96 8.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 -1.000000 353.839653 -1.000000 286.768635 306.894118 263.427409 247.449490 255.840282 315.791466 375.300213 -1.000000 -1.000000 -1.000000 6.868341 3.90 5.471075 15.703004 27.17 11.775069 -1.000000 -1.00 56.485580 -1.000000 53.56 -1.0 5040.0 5568.0 7254.0 12782.0 7636.0 7884.0 -1.0 -1.0 15943.0 -1.0 4104.0 -1.0 72.0 96.0 93.0 83.0 92.0 108.0 -1.0 -1.0 107.0 -1.0 57.0 -1.0 70.0 58.0 78.0 154.0 83.0 73.0 -1.0 -1.0 149.0 -1.0 72.0 1395.839286 300.857143 1.655789e+07 28.210000 360.0 6911.674299 360.0 1699.618351 -1.0 1699.618351 1.0 0.0 2.0 2.0 43.776619 49.741608 -1.00000 1.136260 5.964989 90.610112 44.245656 BL -1.000000 126.904753 12.0 11.0 10.0 2.0 0.0 68.819736 82.0 1.0 1.0 117.409654 91.673970 105.874053 98.859675 0.469047 0.530561 0.619334 0.380319 0.0 0.0 0.0 1.0 -100.0 0.0 pre_alzheimer 1 0 0 0 0 0 23
1 PIAYSCOQO68RFJBWJ 11.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 314.650600 322.321656 282.287974 313.882940 -1.000000 319.712762 319.845275 315.300254 330.351706 338.691969 341.685894 319.531689 16.803673 -1.000000 18.72 1.532490 1.311990 28.60 26.586651 18.092173 1.17 4.061573 11.432072 4.03 2772.0 -1.0 5472.0 4960.0 6188.0 6776.0 5440.0 6370.0 2623.0 13020.0 5005.0 9125.0 66.0 -1.0 96.0 80.0 91.0 121.0 85.0 130.0 61.0 105.0 65.0 73.0 42.0 -1.0 57.0 62.0 68.0 56.0 64.0 49.0 43.0 124.0 77.0 125.0 841.218182 525.272727 8.402984e+06 13.130000 360.0 506.146665 360.0 178.435934 -1.0 178.435934 1.0 0.0 2.0 2.0 61.442021 101.526588 -1.00000 1.652397 40.084566 66.705532 16.297846 BL 6.619794 -1.000000 11.0 11.0 1.0 2.0 0.0 77.391542 79.0 1.0 1.0 106.447204 99.524720 105.483966 100.332536 0.462181 0.537421 0.539200 0.460415 1.0 0.0 0.0 1.0 5.0 0.0 normal 1 0 0 0 0 0 8
2 YU1BFHD48SJV3ARKE 10.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 289.084763 383.589950 364.310993 351.458746 350.508916 -1.000000 228.985807 -1.000000 174.997857 348.621571 222.946182 348.686464 10.794822 46.127917 -1.00 40.293903 -1.000000 41.47 22.570652 12.023504 36.01 0.362155 6.851571 10.79 2944.0 6715.0 -1.0 13221.0 -1.0 6642.0 5580.0 13056.0 5518.0 7070.0 9315.0 15030.0 92.0 85.0 -1.0 113.0 -1.0 82.0 90.0 102.0 89.0 70.0 81.0 90.0 32.0 79.0 -1.0 117.0 -1.0 81.0 62.0 128.0 62.0 101.0 115.0 167.0 1532.044444 138.266667 1.592449e+07 29.423333 360.0 153.635052 -1.0 11020.682810 -1.0 11020.682810 0.0 0.0 1.0 1.0 -1.000000 -1.000000 72.24913 -1.000000 -1.000000 -1.000000 -1.000000 NA -1.000000 -1.000000 -1.0 11.0 -1.0 2.0 210.0 75.590420 83.0 1.0 1.0 109.178156 92.266413 97.093513 102.445313 0.554095 0.445525 0.571322 0.428276 1.0 1.0 0.0 1.0 -1.0 -1.0 normal 0 0 1 0 0 0 24
In [19]:
model_features = df_samples.columns.to_list()
model_features = [c for c in model_features if c not in [key_col, target_col] + cat_cols]

unique_value_cols = []
for c in model_features:
    if df_samples[c].unique().shape[0] == 1:
        unique_value_cols.append(c)
        
print(unique_value_cols)
model_features = [c for c in model_features if c not in unique_value_cols]
print(len(model_features))
['actual_hour_digit', 'actual_minute_digit', 'CAT_intersection_pos_rel_centre_nan']
123

Train models with 5 folds

In [20]:
X_train = df_samples[model_features]
y_train = df_samples[target_col].map(dict(zip(target_values, list(range(len(target_values))))))

skf = StratifiedKFold(n_splits=5, random_state=2021, shuffle=True)
preds = 0.0

params = {
          "objective" : "multiclass",
          "num_class" : len(target_values),
          "bagging_seed" : 2021,
          "verbosity" : 1 }

clfs = []
for fold, (itrain, ivalid) in enumerate(skf.split(X_train, y_train)):
    print("-"*40)
    print(f"Running for fold {fold}")
    lgb_train = lgb.Dataset(X_train.iloc[itrain], y_train.iloc[itrain])
    lgb_eval  = lgb.Dataset(X_train.iloc[ivalid], y_train.iloc[ivalid], reference = lgb_train)
    clf = lgb.train(params, lgb_train, 1000, valid_sets=[lgb_eval], 
                    early_stopping_rounds=100, verbose_eval=200)

    clfs.append(clf)
----------------------------------------
Running for fold 0
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006633 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 18673
[LightGBM] [Info] Number of data points in the train set: 2510, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693147
[LightGBM] [Info] Start training from score -1.004752
[LightGBM] [Info] Start training from score -2.010927
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[18]	valid_0's multi_logloss: 0.767179
----------------------------------------
Running for fold 1
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001831 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18692
[LightGBM] [Info] Number of data points in the train set: 2510, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693147
[LightGBM] [Info] Start training from score -1.004752
[LightGBM] [Info] Start training from score -2.010927
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[17]	valid_0's multi_logloss: 0.76898
----------------------------------------
Running for fold 2
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001798 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18674
[LightGBM] [Info] Number of data points in the train set: 2510, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693147
[LightGBM] [Info] Start training from score -1.004752
[LightGBM] [Info] Start training from score -2.010927
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[25]	valid_0's multi_logloss: 0.72574
----------------------------------------
Running for fold 3
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002106 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18680
[LightGBM] [Info] Number of data points in the train set: 2511, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693546
[LightGBM] [Info] Start training from score -1.004063
[LightGBM] [Info] Start training from score -2.011325
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[23]	valid_0's multi_logloss: 0.751315
----------------------------------------
Running for fold 4
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002589 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18684
[LightGBM] [Info] Number of data points in the train set: 2511, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.692749
[LightGBM] [Info] Start training from score -1.005150
[LightGBM] [Info] Start training from score -2.011325
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[15]	valid_0's multi_logloss: 0.77959

Let's see the features importance of a model

In [21]:
lgb.plot_importance(clf, max_num_features=20);

Save your trained model

In [22]:
# model.save()
In [23]:
for i, clf in enumerate(clfs):
    model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{i}.pkl'
    joblib.dump(clf, model_filename)
In [24]:
meta = {
    "numeric_features": numeric_features,
    "cat_cols": cat_cols,
    "dummy_cols": dummy_cols,
    "model_features": model_features
}
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
joblib.dump(meta, meta_filename)
Out[24]:
['assets/model_lgb_meta.pkl']

Prediction phase 🔎

Please make sure to save the weights from the training section in your assets directory and load them in this section

In [25]:
# model = load_model_from_assets_dir(AIcrowdConfig.ASSETS_DIR)
In [26]:
nb_folds = 5 # skf.n_splits
clfs = []
for fold in range(nb_folds):
    print("-"*40)
    print(f"Running for fold {fold}")
    model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{fold}.pkl'
    
    clf = joblib.load(model_filename)
    clfs.append(clf)
    
print("-"*40)
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
meta = joblib.load(meta_filename)
print(meta.keys())

numeric_features = meta['numeric_features']
cat_cols = meta['cat_cols']
dummy_cols = meta['dummy_cols']
model_features = meta['model_features']
----------------------------------------
Running for fold 0
----------------------------------------
Running for fold 1
----------------------------------------
Running for fold 2
----------------------------------------
Running for fold 3
----------------------------------------
Running for fold 4
----------------------------------------
dict_keys(['numeric_features', 'cat_cols', 'dummy_cols', 'model_features'])

Load test data

In [27]:
test_data = pd.read_csv(AICROWD_DATASET_PATH)
test_data.head()
Out[27]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hand_count_dummy hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre intersection_pos_rel_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit final_rotation_angle ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect
0 LA9JQ1JZMJ9D2MBZV 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 314.649805 NaN 408.240125 323.348110 321.706776 264.496219 203.330396 205.081082 282.015070 343.657169 416.716030 435.900218 6.119758 25.267069 17.29 6.006505 10.246421 14.43 4.778738 43.124586 46.80 NaN 67.293643 3.90 2001.0 4180.0 6318.0 6528.0 6370.0 8127.0 5610.0 3312.0 9372.0 NaN 3500.0 6336.0 69.0 95.0 117.0 128.0 98.0 129.0 102.0 69.0 142.0 NaN 70.0 72.0 29.0 44.0 54.0 51.0 65.0 63.0 55.0 48.0 66.0 NaN 50.0 88.0 225.618182 730.963636 4.773900e+06 20.605000 360.0 854.199907 NaN 8623.343673 NaN 8623.343673 0.0 0.0 3.0 3.0 NaN NaN 183.844962 NaN NaN NaN NaN NaN NaN NaN NaN 11 NaN 2 0.0 84.753550 106 1.000000 0 118.971780 106.379109 111.720745 112.581495 0.500272 0.499368 0.553194 0.446447 0 0 0 1 NaN NaN
1 PSSRCWAPTAG72A1NT 6.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 NaN NaN 235.663425 NaN NaN 325.616722 NaN NaN 288.257264 292.027396 334.951116 370.648756 NaN NaN 22.88 NaN NaN 72.80 72.787316 20.133319 96.33 NaN 60.955820 NaN NaN NaN 12390.0 NaN NaN 8848.0 5632.0 10434.0 7739.0 NaN 11834.0 NaN NaN NaN 118.0 NaN NaN 79.0 64.0 94.0 71.0 NaN 97.0 NaN NaN NaN 105.0 NaN NaN 112.0 88.0 111.0 109.0 NaN 122.0 NaN 126.166667 391.766667 6.631428e+06 64.003333 NaN 5998.258485 NaN 16273.285540 NaN 16273.285540 0.0 0.0 1.0 1.0 NaN NaN 99.180032 NaN NaN NaN NaN NaN NaN NaN NaN 11 NaN 2 180.0 73.359021 99 1.000000 0 123.968624 99.208099 104.829045 114.955335 0.572472 0.427196 0.496352 0.503273 0 1 0 1 NaN NaN
2 GCTODIZJB42VCBZRZ 11.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 438.627689 429.789774 447.455305 447.033835 409.185166 361.946474 359.824957 NaN 345.937133 366.201106 375.225266 427.154831 112.333641 100.371900 86.45 86.234478 NaN 89.57 94.556399 97.331146 111.02 111.411562 116.061975 116.22 3182.0 4473.0 4554.0 5032.0 NaN 5355.0 4148.0 4320.0 4420.0 7290.0 2726.0 5184.0 43.0 71.0 69.0 68.0 NaN 51.0 68.0 48.0 52.0 81.0 47.0 81.0 74.0 63.0 66.0 74.0 NaN 105.0 61.0 90.0 85.0 90.0 58.0 64.0 228.072727 192.618182 1.418911e+06 100.815000 360.0 315.683251 NaN 257.619483 NaN 257.619483 1.0 0.0 2.0 2.0 42.707325 78.437307 NaN 1.836624 35.729983 106.779868 55.597531 BL 6.15111 0.57766 11.0 11 2.0 2 270.0 86.346225 120 1.000000 0 124.134670 120.392100 122.909870 121.542463 0.494076 0.505583 0.503047 0.496615 1 0 0 0 0.0 0.0
3 7YMVQGV1CDB1WZFNE 3.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 NaN NaN NaN 408.827592 272.472476 NaN 195.714716 NaN NaN NaN NaN NaN NaN 2.506574 NaN 4.353660 NaN NaN NaN NaN NaN NaN NaN 12.48 NaN 1794.0 NaN 3416.0 NaN NaN NaN NaN NaN NaN NaN 3360.0 NaN 39.0 NaN 56.0 NaN NaN NaN NaN NaN NaN NaN 56.0 NaN 46.0 NaN 61.0 NaN NaN NaN NaN NaN NaN NaN 60.0 70.333333 96.333333 8.477293e+05 12.480000 360.0 NaN 360.0 11194.405100 NaN 11194.405100 1.0 0.0 3.0 3.0 NaN NaN 204.987534 NaN NaN NaN NaN NaN NaN NaN NaN 11 NaN 2 30.0 51.132436 16 0.800000 1 69.766987 53.627186 53.983727 69.002438 0.555033 0.444633 0.580023 0.419575 0 1 0 1 NaN NaN
4 PHEQC6DV3LTFJYIJU 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 436.069089 NaN NaN NaN NaN NaN NaN NaN NaN 113.252059 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25542.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 129.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 198.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN 0.0 NaN 0.0 0.0 2.0 2.0 77.405367 92.911356 NaN 1.200322 15.505989 100.478258 8.853306 TR NaN NaN 8.0 11 8.0 2 30.0 54.115853 18 0.666667 1 112.043734 87.607876 94.088846 101.540792 0.603666 0.395976 0.494990 0.504604 0 0 0 1 150.0 0.0

Generate predictions

In [28]:
test_data = test_data.copy()

for c in numeric_features:
    test_data[c] = test_data[c].astype(float)
    
for c in cat_cols:
    test_data[c].fillna("NA", inplace=True)
    
df_test_dummies = pd.get_dummies(test_data[cat_cols], columns=cat_cols, dummy_na=True).add_prefix('CAT_')
test_data = pd.concat([test_data, df_test_dummies], axis=1)
test_data['cnt_NaN'] = test_data[numeric_features].isna().sum(axis=1)

test_data.fillna(-1, inplace=True)

for c in dummy_cols:
    if c not in test_data.columns:
        test_data[c] = 0

print("Missing columns:", [c for c in model_features if c not in test_data.columns])
test_data.head(3)

X_test = test_data[model_features]

preds = 0.0
nb_folds = 5 # skf.n_splits
for fold, clf in enumerate(clfs):
    print("-"*40)
    print(f"Running for fold {fold}")
    pred = clf.predict(X_test)
    preds += pred/nb_folds
    
print(preds.shape)
Missing columns: []
----------------------------------------
Running for fold 0
----------------------------------------
Running for fold 1
----------------------------------------
Running for fold 2
----------------------------------------
Running for fold 3
----------------------------------------
Running for fold 4
(362, 3)
In [29]:
predictions = {
    "row_id": test_data["row_id"].values,
    "normal_diagnosis_probability": preds[:,0],
    "post_alzheimer_diagnosis_probability": preds[:,1],
    "pre_alzheimer_diagnosis_probability": preds[:,2]
}

predictions_df = pd.DataFrame.from_dict(predictions)

Save predictions 📨

In [30]:
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

Submit to AIcrowd 🚀

NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)

In [31]:
!aicrowd login --api-key $AICROWD_API_KEY
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge
Error: --api-key option requires an argument
Using notebook: /home/desktop0/public_baseline.ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 4345 bytes to /home/desktop0/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 54460 bytes to /home/desktop0/submission/predict.nbconvert.ipynb
submission.zip ━━━━━━━━━━━━━━━━━━━━━━━━ 100.0%8.2/8.2 MB2.7 MB/s0:00:00[0m • 0:00:01[36m0:00:01
                                                 ╭─────────────────────────╮                                                 
                                                 │ Successfully submitted! │                                                 
                                                 ╰─────────────────────────╯                                                 
                                                       Important links                                                       
┌──────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions/133893              │
│                  │                                                                                                        │
│  All submissions │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions?my_submissions=true │
│                  │                                                                                                        │
│      Leaderboard │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/leaderboards                    │
│                  │                                                                                                        │
│ Discussion forum │ https://discourse.aicrowd.com/c/addi-alzheimers-detection-challenge                                    │
│                  │                                                                                                        │
│   Challenge page │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge                                 │
└──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘
In [ ]:


Comments

sai_bhargav
Almost 3 years ago

Hey on submission I am receiving an error stating keyerror —————— train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True) ——————

KeyError: ‘diagnosis’

I have used your notebook as a reference for loading the data. Can you help me out resolve the issue?

You must login before you can post a comment.

Execute