Loading

ADDI Alzheimers Detection Challenge

Simple EDA and Baseline - LB 0.66 (0.616 with a magic)

Simple EDA and Baseline - LB 0.66 (0.616 with a magic)

moto

This notebook contains 1) a simple analysis 2) a simple feature engineering 3) a simple k-fold model whose CV is 0.76 and LB 0.66.

The magic is the ratio in cell 15. Change it from "nb_neg = nb_pos" to "nb_neg = nb_pos*2" will score 0.616 LB

 

Drawing

Simple EDA and baseline models

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

How to use this notebook? 📝

notebook overview

  • Update the config parameters. You can define the common variables here
Variable Description
AICROWD_DATASET_PATH Path to the file containing test data (The data will be available at /ds_shared_drive/ on aridhia workspace). This should be an absolute path.
AICROWD_PREDICTIONS_PATH Path to write the output to.
AICROWD_ASSETS_DIR In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
AICROWD_API_KEY In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me
  • Installing packages. Please use the Install packages 🗃 section to install the packages
  • Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

Setup AIcrowd Utilities 🛠

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [1]:
!pip install -q -U aicrowd-cli
In [2]:
%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /ds_shared_drive on the workspace.

In [3]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"
AICROWD_API_KEY = "" # Get your key from https://www.aicrowd.com/participants/me

Install packages 🗃

Please add all pacakage installations in this section

In [4]:
!pip install numpy pandas
!pip install seaborn lightgbm scikit-learn
Requirement already satisfied: numpy in ./conda/lib/python3.8/site-packages (1.20.2)
Requirement already satisfied: pandas in ./conda/lib/python3.8/site-packages (1.2.4)
Requirement already satisfied: python-dateutil>=2.7.3 in ./conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: seaborn in ./conda/lib/python3.8/site-packages (0.11.1)
Requirement already satisfied: lightgbm in ./conda/lib/python3.8/site-packages (3.2.1)
Requirement already satisfied: scikit-learn in ./conda/lib/python3.8/site-packages (0.24.2)
Requirement already satisfied: matplotlib>=2.2 in ./conda/lib/python3.8/site-packages (from seaborn) (3.4.1)
Requirement already satisfied: numpy>=1.15 in ./conda/lib/python3.8/site-packages (from seaborn) (1.20.2)
Requirement already satisfied: scipy>=1.0 in ./conda/lib/python3.8/site-packages (from seaborn) (1.6.3)
Requirement already satisfied: pandas>=0.23 in ./conda/lib/python3.8/site-packages (from seaborn) (1.2.4)
Requirement already satisfied: wheel in ./conda/lib/python3.8/site-packages (from lightgbm) (0.35.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./conda/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
Requirement already satisfied: joblib>=0.11 in ./conda/lib/python3.8/site-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: pillow>=6.2.0 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing>=2.2.1 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)
Requirement already satisfied: cycler>=0.10 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas>=0.23->seaborn) (2021.1)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)

Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

Import common packages

Please import packages that are common for training and prediction phases here.

In [5]:
import numpy as np
import pandas as pd
In [6]:
# some precessing code
In [7]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

import joblib

import warnings
warnings.filterwarnings("ignore")

Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

In [8]:
# model = define_your_model

Load training data

In [9]:
# load your data
In [10]:
AICROWD_DATASET_PATH
Out[10]:
'/ds_shared_drive/validation.csv'
In [11]:
target_col = "diagnosis"
key_col = "row_id"
cat_cols = ['intersection_pos_rel_centre']
seed = 2021

target_values = ["normal", "post_alzheimer", "pre_alzheimer"]

train = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "train"))
train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True)


print(train.shape)
features = train.columns[1:-1].to_list()

numeric_features = [c for c in features if c not in cat_cols]
for c in numeric_features:
    train[c] = train[c].astype(float)

print(train[target_col].value_counts())
train.tail(3)
(32777, 122)
normal            31208
post_alzheimer     1149
pre_alzheimer       420
Name: diagnosis, dtype: int64
Out[11]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hand_count_dummy hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre intersection_pos_rel_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit final_rotation_angle ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect diagnosis
32774 YKCXR5L3HUEXI9129 11.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 NaN 415.609492 NaN 404.931167 343.933496 350.160677 411.903508 408.573127 388.002900 342.225437 372.795185 383.355253 NaN 6.460009 5.59 13.248901 0.086013 0.39 1.474322 9.459343 4.29 1.136238 NaN 10.27 NaN 2520.0 1815.0 4823.0 4440.0 2970.0 2166.0 3096.0 2360.0 3250.0 NaN 5150.0 NaN 40.0 55.0 91.0 74.0 66.0 57.0 72.0 59.0 65.0 NaN 50.0 NaN 63.0 33.0 53.0 60.0 45.0 38.0 43.0 40.0 50.0 NaN 103.0 362.272727 181.690909 1216108.964 5.135000 360.0 29.466579 360.0 301.658467 NaN 9727.113012 1.0 0.0 2.0 2.0 77.481745 98.597951 NaN 1.272531 21.116206 75.380406 14.119013 TL NaN 26.572249 10.0 11.0 2.0 2.0 0.0 86.631195 125.0 1.000000 0.0 125.735462 115.510618 118.221225 122.488967 0.910695 0.089303 0.690915 0.015356 2.0 2.0 0.0 1.0 60.0 1.0 normal
32775 0MFBMF7ZRBSAH8ASA 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 433.959099 441.208568 408.001225 405.226171 417.833101 365.256759 339.988970 364.785553 397.165583 451.738863 NaN NaN 5.641735 7.531402 20.67 9.179425 9.302904 25.87 48.801311 NaN NaN 20.980509 27.633149 8.19 1881.0 1505.0 2537.0 2911.0 3477.0 3337.0 2640.0 NaN NaN 2800.0 2376.0 3050.0 57.0 43.0 59.0 71.0 61.0 71.0 60.0 NaN NaN 56.0 54.0 50.0 33.0 35.0 43.0 41.0 57.0 47.0 44.0 NaN NaN 50.0 44.0 61.0 76.944444 73.511111 377425.600 18.243333 360.0 2357.648327 NaN 148.585224 NaN 148.585224 1.0 0.0 1.0 1.0 NaN NaN 46.699503 NaN NaN NaN NaN NaN NaN NaN NaN 11.0 NaN 2.0 0.0 77.348992 84.0 1.000000 1.0 129.456141 112.602114 121.981285 118.401611 0.538221 0.461447 0.525810 0.473859 0.0 1.0 0.0 1.0 NaN NaN normal
32776 NMOMZBPRJMJCFOONV 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 439.300581 488.530961 NaN 438.130688 453.176014 424.410474 380.000000 371.031333 307.591450 358.568334 377.794256 429.549182 111.504807 109.183899 98.93 85.563310 87.320320 79.95 92.758482 98.168063 109.85 NaN NaN 113.75 1680.0 1927.0 2256.0 2992.0 3366.0 7755.0 2870.0 2795.0 5655.0 1833.0 NaN 1681.0 28.0 41.0 47.0 44.0 51.0 55.0 41.0 43.0 65.0 39.0 NaN 41.0 60.0 47.0 48.0 68.0 66.0 141.0 70.0 65.0 87.0 47.0 NaN 41.0 777.618182 91.800000 3610308.273 100.620000 360.0 493.127248 NaN NaN NaN NaN NaN NaN 2.0 2.0 94.780253 100.060823 NaN 1.055714 5.280570 81.236698 22.476805 BL NaN 0.096989 10.0 11.0 2.0 2.0 270.0 84.200577 124.0 0.714286 0.0 126.783309 110.895126 111.380356 126.069024 0.498043 0.501635 0.541883 0.457771 2.0 0.0 0.0 1.0 60.0 0.0 normal

Target

In [12]:
sns.countplot(x=target_col, data=train);

Numerical features

In [13]:
nb_shown = len(numeric_features)
fig, ax = plt.subplots(nb_shown, 1, figsize=(20,5*nb_shown))

colors = ["Green", "Blue", "Red"]
for i, col in enumerate(numeric_features[:nb_shown]):
    for value, color in zip(target_values, colors):
        sns.distplot(train.loc[train[target_col]==value, col], 
                     ax=ax[i], color=color, norm_hist=True)
        ax[i].set_title("Train {}".format(col))
    ax[i].set_xlabel("")
    ax[i].set_xlabel("")