ADDI Alzheimers Detection Challenge

Grid search + voting classifier

this is just a starter notebook for sklearn. sampling and parameters must be tuned for gaining better score.

What is the notebook about?¶

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

How to use this notebook? 📝¶

notebook overview

Update the config parameters. You can define the common variables here

Variable	Description
`AICROWD_DATASET_PATH`	Path to the file containing test data (The data will be available at `/ds_shared_drive/` on aridhia workspace). This should be an absolute path.
`AICROWD_PREDICTIONS_PATH`	Path to write the output to.
`AICROWD_ASSETS_DIR`	In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY`	In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

Installing packages. Please use the Install packages 🗃 section to install the packages
Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

Setup AIcrowd Utilities 🛠¶

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [79]:

!pip install -q -U aicrowd-cli

In [80]:

%load_ext aicrowd.magic

The aicrowd.magic extension is already loaded. To reload it, use:
  %reload_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷¶

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /ds_shared_drive on the workspace.

In [81]:

import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"

pip install -U scikit-learn# Install packages 🗃

Please add all pacakage installations in this section

In [82]:

pip install numpy pandas scikit-learn

Requirement already satisfied: numpy in ./conda/lib/python3.8/site-packages (1.20.2)
Requirement already satisfied: pandas in ./conda/lib/python3.8/site-packages (1.2.4)
Requirement already satisfied: scikit-learn in ./conda/lib/python3.8/site-packages (0.24.2)
Requirement already satisfied: python-dateutil>=2.7.3 in ./conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./conda/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
Requirement already satisfied: joblib>=0.11 in ./conda/lib/python3.8/site-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: scipy>=0.19.1 in ./conda/lib/python3.8/site-packages (from scikit-learn) (1.6.3)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Note: you may need to restart the kernel to use updated packages.

Define preprocessing code 💻¶

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

Import common packages¶

Please import packages that are common for training and prediction phases here.

In [83]:

import numpy as np
import pandas as pd
import pickle

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from joblib import parallel_backend

In [84]:

# some precdef read(uri):
def read(uri):

    x_df = pd.read_csv(uri)
    x_df.drop('row_id', axis=1, inplace=True)

    cmap = {'normal': 0, 'post_alzheimer': 1, 'pre_alzheimer': 2, 'TL': 1, 'BL': 2, 'TR': 3, 'BR': 4}
    x_df = x_df.applymap(lambda s: cmap.get(s) if s in cmap else s)
    x_df = x_df.fillna(0)
    return x_df

Training phase ⚙️¶

You can define your training code here. This sections will be skipped during evaluation.

In [85]:

clf0 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.2, max_depth=4, random_state=0)
clf1 = RandomForestClassifier(n_estimators=100)
clf2 = RandomForestClassifier(n_estimators=100)

params = {'c0__learning_rate': [0.1, 0.2, 0.3],
          'c0__n_estimators': [50, 100, 150, 200],
          'c0__max_depth': [3, 4, 5],
          'c0__subsample': [1.0, 0.9, 0.7],
          'c1__n_estimators': [50, 100, 150, 200],
          'c1__min_samples_split': [2, 3]}

eclf = VotingClassifier(estimators=[('c0', clf0), ('c1', clf1), ('c2', clf2)], voting='soft', weights=[1, 2, 1])
clf = GridSearchCV(estimator=eclf, param_grid=params, cv=5, verbose=1)

Load training data¶

In [86]:

x_df = read('/ds_shared_drive/train.csv')

s_low = x_df[x_df['diagnosis'].isin([1, 2])]
s_high = x_df[x_df['diagnosis'] == 0].sample(n=s_low.shape[0], random_state=2048)
x_df = pd.concat([s_low, s_high]).sample(frac=1).reset_index(drop=True)

y_df = x_df['diagnosis']
x_df.drop('diagnosis', axis=1, inplace=True)

# scaler = preprocessing.MinMaxScaler()
# x_df = scaler.fit_transform(x_df)

# print(list(train_df.columns))
# train_df.describe()

# x_train, x_test, y_train, y_test = train_test_split(x_df, y_df ,test_size = 0.2, shuffle=True)
x_train = x_df
y_train = y_df

Train your model¶

In [ ]:

with parallel_backend('threading', n_jobs=4):
    clf.fit(x_train, y_train)

Fitting 5 folds for each of 864 candidates, totalling 4320 fits

In [ ]:

# some custom code block

Save your trained model¶

In [ ]:

with open('assets/model.pkl', 'wb') as file:
    pickle.dump(clf, file)

Prediction phase 🔎¶

Please make sure to save the weights from the training section in your assets directory and load them in this section

In [ ]:

with open('assets/model.pkl', 'rb') as file:
    clf = pickle.load(file)

Load test data¶

In [ ]:

test_data = pd.read_csv(AICROWD_DATASET_PATH)
test_data.head()

x_df = read(AICROWD_DATASET_PATH)
# x_df = scaler.transform(x_df)

Generate predictions¶

In [ ]:

predictions = {
    "row_id": test_data["row_id"].values,
    "normal_diagnosis_probability": np.zeros(len(test_data["row_id"].values)),
    "post_alzheimer_diagnosis_probability": np.zeros(len(test_data["row_id"].values)),
    "pre_alzheimer_diagnosis_probability": np.zeros(len(test_data["row_id"].values)),
}

prd = clf.predict_proba(x_df)
r_map = {0: 'normal_diagnosis_probability', 1: 'post_alzheimer_diagnosis_probability', 2: 'pre_alzheimer_diagnosis_probability'}
for i in range(prd.shape[0]):
    for j in range(prd.shape[1]):
        predictions[r_map[j]][i] = prd[i][j]

predictions_df = pd.DataFrame.from_dict(predictions)

Save predictions 📨¶

In [ ]:

predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

Submit to AIcrowd 🚀¶

NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)

In [ ]:

!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge

In [ ]:

Content

3692

Show Comments

Comments

You must login before you can post a comment.