ADDI Alzheimers Detection Challenge
EDA, FE, HPO - All you need (LB: 0.640)
Detailed EDA, FE with Class Balancing, Hyper-Parameter Optimization of XGBoost using Optuna
This notebook explains feature-level exploratory data analysis along with observation comments, simple feature engineering including class balancing and XGBoost hyper-parameter optimization using HPO framework Optuna.
What is the notebook about?¶
The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:
1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)
In machine learning terms: this is a 3-class classification task.
How to use this notebook? 📝¶
- Update the config parameters. You can define the common variables here
Variable | Description |
---|---|
AICROWD_DATASET_PATH |
Path to the file containing test data (The data will be available at /ds_shared_drive/ on aridhia workspace). This should be an absolute path. |
AICROWD_PREDICTIONS_PATH |
Path to write the output to. |
AICROWD_ASSETS_DIR |
In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation. |
AICROWD_API_KEY |
In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me |
- Installing packages. Please use the Install packages 🗃 section to install the packages
- Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section
Content:¶
- Exploratory Data Analysis
- Feature Engineering
- Hyper-parameter Optimization
- Training Best Parameters Model
- Final Prediction and submission
Introduction:¶
Hello I am Jyot Makadiya, a pre-final year student pursuing bachelor of technology in computer science & engineering. I have been experimenting with data for 1 year now and so far the journey has been smooth and I learned a lot on the way.
This challenge can be assumed to be a multiclass classification problem with 3 classes ( Normal, Pre-Alzheimer’s, Post-Alzheimer’s). The main tasks to achieve a good score include having a good cross-validation with balanced dataset, good feature engineering and Fine-tuning hyper-parameters along with ensembling. </br>
This notebook covers my approach for this competition starting with exploratory data analysis. Then it covers simple feature engineering for a few features (I'll expand the idea of FE and ensemble in next part/walkthrough blog). Finally we use Optuna for hyper-parameter optimization. </br>
The aim of this notebook is to introduce you with the variety of concepts including but not limited to hyper-parameter optimization aka AutoML tools, Simple but feature level EDA and FE.
</br>
For a better view of graphs and plots, open this notebook in colab using open in colab
button
Setup AIcrowd Utilities 🛠¶
We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.
!pip install -q -U aicrowd-cli
%load_ext aicrowd.magic
AIcrowd Runtime Configuration 🧷¶
Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR
. We will copy the contents of this directory to your final submission file 🙂
The dataset is available under /ds_shared_drive
on the workspace.
import os
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "Z:/challenge-data/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "Z:/challenge-data/predictions.csv")
AICROWD_ASSETS_DIR = "assets"
Install packages 🗃¶
Please add all pacakage installations in this section
!pip install -q numpy pandas
!pip install -q xgboost scikit-learn seaborn lightgbm optuna
Define preprocessing code 💻¶
The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.
Import common packages¶
Please import packages that are common for training and prediction phases here.
import xgboost as xgb
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.color_palette("rocket_r")
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 1000)
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, log_loss, f1_score
import joblib
import warnings
warnings.filterwarnings("ignore")
# df
# with open(AICROWD_DATASET_PATH) as f:
# f.read()
# some precessing code
# os.listdir('Z:/challenge-data/')
#Pre Processing functions
Training phase ⚙️¶
You can define your training code here. This sections will be skipped during evaluation.
# model = define_your_model
Load training data¶
df_orig = pd.read_csv("Z:/challenge-data/train.csv")
df_valid = pd.read_csv("Z:/challenge-data/validation.csv")
df_valid_target = pd.read_csv("Z:/challenge-data/validation_ground_truth.csv")
df = df_orig.copy()
df.describe()
# list(df.columns)
Exploratory Data Analysis¶
# Final Rotation Angle in degrees
feat_col = df['final_rotation_angle']
feat_col.fillna(-5,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(x = 'final_rotation_angle',data=df, palette='rocket_r', hue='diagnosis')
fig.set_xlabel("Rotation Angle in Degree",size=15)
fig.set_ylabel("Angle Frequency",size=15)
plt.title('Angle frequencies for all samples',size = 20)
plt.show()
We can notice that there are only 13 discrete values in rotation angles, instead of using these, we can resample that to 4 different columns each representing 90 degrees range or 1 quarter of circle angles.
print(f"number of unique values for rotation angles: {feat_col.nunique()}")
#now we can change that to 4 different quarters columns
df['rotation_angle_90'] = (feat_col <= 90).astype('int') #we will also include NaN in this column
df['rotation_angle_180'] = (90 < feat_col) & (feat_col <= 180).astype('int')
df['rotation_angle_270'] = (180 < feat_col) & (feat_col <= 270).astype('int')
df['rotation_angle_360'] = (feat_col > 270).astype('int')
#We care not using this currently instead we will use two columns for below 180 and above 180
# number of digits
feat_col = df['number_of_digits']
feat_col.fillna(-1,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(data=df, x="number_of_digits",palette='rocket', hue="diagnosis" )
fig.set_xlabel("number of digits",size=15)
fig.set_ylabel("Digits Frequency",size=15)
plt.title('Num Digits frequencies for all samples',size = 20)
plt.show()
print(f"number of unique values for number digits: {df['number_of_digits'].nunique()}")
We can notice that most of the values lie in 10,11,12 count range which is good indicator for large normal part of our dataset. And so maybe a new feature with either 10 or 11 or 12 true maybe useful
#Let's look at some of the features with categorical values of repeating multiple instances
#For missing Digit values
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"missing_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
sns.countplot(data=df, x=feature,palette='rocket' )
plt.xlabel(f"Count of values for {feature}", fontsize=12);# plt.legend()
plt.show()
The ratio is same for almost all the digits with around 5000 values being missing. We can notice the large portion in missing_digit_1 & missing_digit_5 variable
#Let's look at Euclidean distance from digits
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"euc_dist_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-10,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency of values for {feature}", fontsize=12);# plt.legend()
plt.show()
#Let's look at Euclidean distance from center(512,512) to digits
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"{i} dist from cen" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-10,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution of values for {feature}", fontsize=12);# plt.legend()
plt.show()
The distribution seems to have variance around 200 with balanced gaussian distribution. Another thing to notice is that there are a lot of missing values in those variables.
#Next set of variables are area for each digit bounding boxes
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"area_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()
We can notice the distributions have large variance and the distributions seem to be skewed. We may use some feature engineering to mkae it right.
#Next set of variables are height of each digit bounding boxes
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"height_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()
There is a lot of variance in height of bounding boxes. This may explain the different sizes of bounding boxes as we can see the size will be diferent for some digits and 11, 12.
#Next set of variables are width for each digit bounding boxes
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"width_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution for {feature}", fontsize=12); # plt.legend()
plt.show()