What is the notebook about?¶

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

How to use this notebook? 📝¶

Update the config parameters. You can define the common variables here

Variable	Description
`AICROWD_DATASET_PATH`	Path to the file containing test data (The data will be available at `/ds_shared_drive/` on aridhia workspace). This should be an absolute path.
`AICROWD_PREDICTIONS_PATH`	Path to write the output to.
`AICROWD_ASSETS_DIR`	In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY`	In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

Installing packages. Please use the Install packages 🗃 section to install the packages
Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

Content:¶

Exploratory Data Analysis
Feature Engineering
Hyper-parameter Optimization
Training Best Parameters Model
Final Prediction and submission

Introduction:¶

Hello I am Jyot Makadiya, a pre-final year student pursuing bachelor of technology in computer science & engineering. I have been experimenting with data for 1 year now and so far the journey has been smooth and I learned a lot on the way.
This challenge can be assumed to be a multiclass classification problem with 3 classes ( Normal, Pre-Alzheimer’s, Post-Alzheimer’s). The main tasks to achieve a good score include having a good cross-validation with balanced dataset, good feature engineering and Fine-tuning hyper-parameters along with ensembling. </br>
This notebook covers my approach for this competition starting with exploratory data analysis. Then it covers simple feature engineering for a few features (I'll expand the idea of FE and ensemble in next part/walkthrough blog). Finally we use Optuna for hyper-parameter optimization. </br>
The aim of this notebook is to introduce you with the variety of concepts including but not limited to hyper-parameter optimization aka AutoML tools, Simple but feature level EDA and FE.
</br> For a better view of graphs and plots, open this notebook in colab using open in colab button

Setup AIcrowd Utilities 🛠¶

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [ ]:

!pip install -q -U aicrowd-cli

In [ ]:

%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷¶

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /ds_shared_drive on the workspace.

In [ ]:

import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "Z:/challenge-data/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "Z:/challenge-data/predictions.csv")
AICROWD_ASSETS_DIR = "assets"

Install packages 🗃¶

Please add all pacakage installations in this section

In [ ]:

!pip install -q numpy pandas

In [ ]:

!pip install -q xgboost scikit-learn seaborn lightgbm optuna

Define preprocessing code 💻¶

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

Import common packages¶

Please import packages that are common for training and prediction phases here.

In [ ]:

import xgboost as xgb
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.color_palette("rocket_r")
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 1000)

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, log_loss, f1_score
import joblib

import warnings
warnings.filterwarnings("ignore")
# df
# with open(AICROWD_DATASET_PATH) as f:
#     f.read()
# some precessing code

In [ ]:

# os.listdir('Z:/challenge-data/')
#Pre Processing functions

Training phase ⚙️¶

You can define your training code here. This sections will be skipped during evaluation.

In [ ]:

# model = define_your_model

Load training data¶

In [ ]:

df_orig = pd.read_csv("Z:/challenge-data/train.csv")

df_valid = pd.read_csv("Z:/challenge-data/validation.csv")
df_valid_target = pd.read_csv("Z:/challenge-data/validation_ground_truth.csv")
df = df_orig.copy()
df.describe()

Out[ ]:

	number_of_digits	missing_digit_1	missing_digit_2	missing_digit_3	missing_digit_4	missing_digit_5	missing_digit_6	missing_digit_7	missing_digit_8	missing_digit_9	missing_digit_10	missing_digit_11	missing_digit_12	1 dist from cen	10 dist from cen	11 dist from cen	12 dist from cen	2 dist from cen	3 dist from cen	4 dist from cen	5 dist from cen	6 dist from cen	7 dist from cen	8 dist from cen	9 dist from cen	euc_dist_digit_1	euc_dist_digit_2	euc_dist_digit_3	euc_dist_digit_4	euc_dist_digit_5	euc_dist_digit_6	euc_dist_digit_7	euc_dist_digit_8	euc_dist_digit_9	euc_dist_digit_10	euc_dist_digit_11	euc_dist_digit_12	area_digit_1	area_digit_2	area_digit_3	area_digit_4	area_digit_5	area_digit_6	area_digit_7	area_digit_8	area_digit_9	area_digit_10	area_digit_11	area_digit_12	height_digit_1	height_digit_2	height_digit_3	height_digit_4	height_digit_5	height_digit_6	height_digit_7	height_digit_8	height_digit_9	height_digit_10	height_digit_11	height_digit_12	width_digit_1	width_digit_2	width_digit_3	width_digit_4	width_digit_5	width_digit_6	width_digit_7	width_digit_8	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	sequence_flag_cw	sequence_flag_ccw	number_of_hands	hand_count_dummy	hour_hand_length	minute_hand_length	single_hand_length	clockhand_ratio	clockhand_diff	angle_between_hands	deviation_from_centre	hour_proximity_from_11	minute_proximity_from_2	hour_pointing_digit	actual_hour_digit	minute_pointing_digit	actual_minute_digit	final_rotation_angle	ellipse_circle_ratio	count_defects	percentage_inside_ellipse	pred_tremor	double_major	double_minor	vertical_dist	horizontal_dist	top_area_perc	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	eleven_ten_error	other_error	time_diff	centre_dot_detect
count	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	32703.000000	25448.000000	27882.000000	27201.000000	28937.000000	27855.000000	28612.000000	27251.000000	26092.000000	28407.000000	28555.000000	28755.000000	26974.000000	25400.000000	27800.000000	28603.000000	27238.000000	26082.000000	28394.000000	28491.000000	28641.000000	2.693600e+04	27838.000000	27151.000000	28909.000000	25448.000000	27855.000000	28612.000000	27251.000000	26092.000000	28407.000000	28555.000000	28755.000000	26974.000000	27882.000000	27201.000000	28937.000000	25448.000000	27855.000000	28612.000000	27251.000000	26092.000000	28407.000000	28555.000000	28755.000000	26974.000000	27882.000000	27201.000000	28937.000000	25448.000000	27855.000000	28612.000000	27251.000000	26092.000000	28407.000000	28555.000000	28755.000000	26974.000000	27882.000000	27201.000000	28937.000000	32313.000000	32313.000000	3.231300e+04	32198.000000	29141.000000	30870.000000	20027.000000	32085.000000	844.000000	32085.000000	32474.000000	32474.000000	30623.000000	30623.000000	22861.000000	22861.000000	7741.000000	22650.000000	22835.000000	22861.000000	22793.000000	20191.000000	19919.000000	22677.000000	32777.0	22678.000000	32777.0	32703.000000	3.203900e+04	32777.000000	32472.000000	32777.000000	32540.000000	3.277600e+04	3.258900e+04	3.273500e+04	31218.000000	31218.000000	31218.000000	31218.000000	32777.000000	32777.000000	32777.000000	32777.000000	22526.000000	22826.000000
mean	10.299422	0.221845	0.148243	0.125096	0.166713	0.202153	0.131364	0.126839	0.120723	0.175183	0.147418	0.168241	0.115158	361.869732	367.418424	368.235873	370.796838	349.116177	337.542587	336.085919	335.550313	353.017822	368.547709	370.329200	375.631690	30.287315	32.834984	33.031035	32.049520	30.724226	28.135344	30.886070	32.250843	3.125026e+01	33.247571	32.644335	28.629239	2308.107671	4616.101562	5046.115231	5793.115665	7214.179250	6035.063259	4942.821748	5697.203373	5678.539964	6647.253927	5393.460167	6998.064450	59.880541	70.994184	80.247973	87.709479	88.637130	88.011054	81.457468	85.024761	89.002818	81.400330	75.654792	77.071742	40.342109	62.648717	61.032259	65.411545	79.223402	68.243391	60.447242	66.558233	63.786943	78.795890	69.117569	87.386426	363.578878	324.115546	5.148403e+06	32.202820	352.139508	2587.128279	355.100767	3081.777480	243.825427	3157.421099	0.750385	0.018538	1.772132	1.770369	60.538409	80.874117	74.602333	1.375478	20.270851	90.170001	17.420096	24.922338	33.267558	9.047449	11.0	4.393377	2.0	65.737088	7.911654e+01	93.489459	0.939555	0.317052	120.238950	1.065362e+02	1.115766e+02	1.139164e+02	0.519007	0.465878	0.525769	0.464433	0.693230	0.762211	0.025231	0.612655	105.199325	0.241172
std	2.345710	0.415494	0.355346	0.330832	0.372725	0.401612	0.337803	0.332797	0.325810	0.380129	0.354527	0.374086	0.319217	50.310698	48.060878	48.425983	48.005863	53.313076	51.175381	47.456872	46.910977	47.096105	50.956366	51.562665	45.795291	33.877417	31.828580	33.060628	31.662544	30.055328	31.245333	33.028061	33.840305	3.440961e+01	34.375507	34.165306	35.018626	1070.213451	2365.657591	2569.549735	2641.521129	3474.474015	2742.576668	2221.963276	2741.527329	2563.651971	3161.975614	2633.392241	3525.529979	20.742269	21.127381	23.071334	26.389729	23.653629	27.994521	24.236242	25.677814	27.629884	21.714878	20.775003	21.834246	17.562823	17.698763	19.128573	19.661431	24.229453	21.233585	19.571260	22.255969	21.164594	21.110245	20.374876	25.649241	306.449113	302.509846	6.805541e+06	32.276635	52.263430	5675.602203	41.688130	5648.419057	168.298221	5784.787525	0.432797	0.134888	0.457020	0.448046	14.191507	13.311371	35.873579	0.299056	13.359890	23.522379	19.001146	39.291724	46.534051	3.661448	0.0	3.965286	0.0	110.472325	1.453976e+01	39.504488	0.169569	0.465335	19.864539	1.295960e+01	1.900768e+01	1.467719e+01	0.180912	0.178807	0.160916	0.160926	0.675787	0.699355	0.156829	0.487151	205.429390	0.427804
min	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.354102	5.852350	11.335784	22.102036	7.905694	15.206906	6.519202	7.826238	3.535534	14.422205	8.139410	14.115594	0.000760	0.003261	0.000000	0.000760	0.002345	0.000000	0.000491	0.000515	2.960000e-14	0.002010	0.001071	0.000000	640.000000	768.000000	828.000000	1036.000000	1152.000000	805.000000	777.000000	1054.000000	870.000000	888.000000	780.000000	1089.000000	19.000000	21.000000	29.000000	31.000000	30.000000	28.000000	24.000000	28.000000	24.000000	31.000000	26.000000	28.000000	18.000000	26.000000	23.000000	26.000000	32.000000	23.000000	21.000000	24.000000	25.000000	24.000000	24.000000	30.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	23.164818	33.186431	18.017081	1.000000	0.000000	0.029409	0.100852	0.000000	0.000000	1.000000	11.0	1.000000	2.0	0.000000	5.060000e-10	1.000000	0.000000	0.000000	9.696612	4.210000e-10	4.210000e-10	2.600000e-09	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-110.000000	0.000000
25%	10.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	336.580321	343.945581	342.212288	348.353987	320.153479	308.950947	309.714788	309.358914	328.613374	342.429190	345.253711	353.542784	6.680591	8.930655	8.190000	8.141702	7.908380	6.240000	6.903923	7.413237	6.890000e+00	7.541077	7.369695	5.200000	1537.000000	2942.000000	3240.000000	3915.000000	4758.000000	4041.000000	3312.000000	3760.000000	3848.000000	4380.000000	3540.000000	4473.000000	45.000000	56.000000	63.000000	69.000000	72.000000	67.000000	63.000000	67.000000	69.000000	66.000000	61.000000	62.000000	29.000000	50.000000	47.000000	52.000000	61.000000	54.000000	47.000000	50.000000	50.000000	64.000000	54.000000	69.000000	171.515152	148.277778	1.575504e+06	9.880000	360.000000	102.884207	360.000000	51.690940	0.000000	52.068825	1.000000	0.000000	2.000000	2.000000	50.270850	71.700628	53.825024	1.138048	9.375855	82.023132	8.079066	2.334051	1.728408	10.000000	11.0	2.000000	2.0	0.000000	7.802811e+01	78.000000	1.000000	0.000000	115.425250	1.027451e+02	1.077028e+02	1.099097e+02	0.472774	0.480002	0.502471	0.457726	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	11.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	367.434688	372.683512	372.667412	377.180328	353.802911	343.432854	340.306479	339.694716	358.558224	372.874309	375.153635	381.106284	14.935799	20.012485	18.980000	19.097071	18.527444	14.300000	15.724508	16.984068	1.586000e+01	17.561152	17.209259	11.960000	2065.000000	4104.000000	4508.000000	5270.000000	6525.000000	5580.000000	4559.000000	5145.000000	5280.000000	6075.000000	4872.000000	6240.000000	59.000000	68.000000	78.000000	86.000000	87.000000	86.000000	80.000000	84.000000	88.000000	79.000000	73.000000	74.000000	34.000000	60.000000	57.000000	61.000000	75.000000	64.000000	57.000000	62.000000	59.000000	76.000000	66.000000	84.000000	282.787879	246.386364	3.114518e+06	16.163333	360.000000	296.051705	360.000000	165.733916	360.000000	167.238094	1.000000	0.000000	2.000000	2.000000	59.852542	80.769660	68.324616	1.307708	18.837195	91.649346	13.104333	5.189904	4.258455	11.000000	11.0	2.000000	2.0	0.000000	8.375303e+01	103.000000	1.000000	0.000000	120.289760	1.093602e+02	1.130389e+02	1.160801e+02	0.493087	0.505231	0.520833	0.478135	1.000000	1.000000	0.000000	1.000000	0.000000	0.000000
75%	12.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	393.898464	397.112940	399.011278	401.186366	383.428285	371.767737	367.008855	366.316888	384.157520	400.218222	401.983364	404.633476	37.094044	46.669563	47.840000	45.860710	44.763078	35.100000	44.079657	46.748061	4.085250e+01	50.597531	48.329409	31.330000	2816.000000	5716.000000	6231.000000	7105.500000	8840.000000	7560.000000	6120.000000	7000.000000	7038.000000	8232.000000	6640.000000	8701.000000	73.000000	83.000000	94.000000	104.000000	103.000000	107.000000	98.000000	101.000000	107.000000	94.000000	88.000000	89.000000	47.000000	72.000000	71.000000	75.000000	93.000000	78.000000	70.000000	78.000000	72.000000	91.000000	80.000000	102.000000	456.363636	404.100000	6.090066e+06	44.200000	360.000000	2402.861831	360.000000	5215.235174	360.000000	5224.038548	1.000000	0.000000	2.000000	2.000000	70.048169	89.989302	85.473987	1.542887	29.315735	101.621967	20.451729	16.677090	77.721980	11.000000	11.0	10.000000	2.0	90.000000	8.742738e+01	121.000000	1.000000	1.000000	124.313176	1.135460e+02	1.163918e+02	1.201251e+02	0.516964	0.525196	0.540867	0.496303	1.000000	1.000000	0.000000	1.000000	60.000000	0.000000
max	17.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	618.025889	628.776988	613.843832	659.571073	568.624876	611.333379	580.975473	520.454849	586.950168	666.132119	608.481717	620.016935	119.957644	119.906309	119.860000	119.937391	119.643227	119.730000	119.997479	119.838808	1.199900e+02	119.915642	119.855309	119.990000	9870.000000	25088.000000	31482.000000	29946.000000	32200.000000	27378.000000	22866.000000	29503.000000	34524.000000	35280.000000	30338.000000	28362.000000	143.000000	213.000000	212.000000	241.000000	218.000000	225.000000	195.000000	256.000000	271.000000	213.000000	241.000000	287.000000	164.000000	193.000000	185.000000	204.000000	220.000000	209.000000	206.000000	277.000000	207.000000	219.000000	197.000000	220.000000	5408.000000	6844.500000	1.530196e+08	125.710000	360.000000	63116.006320	360.000000	63259.726600	360.000000	63259.726600	1.000000	1.000000	8.000000	3.000000	123.519704	133.691585	292.853059	2.498143	69.872251	179.518624	298.723197	179.774464	179.928116	12.000000	11.0	12.000000	2.0	330.000000	9.997281e+01	176.000000	1.000000	1.000000	499.391604	3.051857e+02	4.993892e+02	4.928847e+02	1.000000	1.000000	1.000000	1.000000	3.000000	3.000000	1.000000	1.000000	605.000000	1.000000

In [ ]:

# list(df.columns)

Exploratory Data Analysis¶

In [ ]:

# Final Rotation Angle in degrees

feat_col = df['final_rotation_angle']
feat_col.fillna(-5,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(x = 'final_rotation_angle',data=df, palette='rocket_r', hue='diagnosis')
fig.set_xlabel("Rotation Angle in Degree",size=15)
fig.set_ylabel("Angle Frequency",size=15)
plt.title('Angle frequencies for all samples',size = 20)
plt.show()

We can notice that there are only 13 discrete values in rotation angles, instead of using these, we can resample that to 4 different columns each representing 90 degrees range or 1 quarter of circle angles.

In [ ]:

print(f"number of unique values for rotation angles: {feat_col.nunique()}")

#now we can change that to 4 different quarters columns
df['rotation_angle_90'] = (feat_col <= 90).astype('int')    #we will also include NaN in this column
df['rotation_angle_180'] = (90 < feat_col) & (feat_col <= 180).astype('int') 
df['rotation_angle_270'] = (180 < feat_col) & (feat_col <= 270).astype('int') 
df['rotation_angle_360'] = (feat_col > 270).astype('int')   

#We care not using this currently instead we will use two columns for below 180 and above 180

number of unique values for rotation angles: 13

In [ ]:

# number of digits 
feat_col = df['number_of_digits']
feat_col.fillna(-1,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(data=df, x="number_of_digits",palette='rocket', hue="diagnosis" )
fig.set_xlabel("number of digits",size=15)
fig.set_ylabel("Digits Frequency",size=15)
plt.title('Num Digits frequencies for all samples',size = 20)
plt.show()

In [ ]:

print(f"number of unique values for number digits: {df['number_of_digits'].nunique()}")

number of unique values for number digits: 18

We can notice that most of the values lie in 10,11,12 count range which is good indicator for large normal part of our dataset. And so maybe a new feature with either 10 or 11 or 12 true maybe useful

In [ ]:

#Let's look at some of the features with categorical values of repeating multiple instances
#For missing Digit values
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"missing_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    sns.countplot(data=df, x=feature,palette='rocket' )
    plt.xlabel(f"Count of values for {feature}", fontsize=12);# plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

The ratio is same for almost all the digits with around 5000 values being missing. We can notice the large portion in missing_digit_1 & missing_digit_5 variable

In [ ]:

#Let's look at Euclidean distance from digits 
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"euc_dist_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-10,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency of values for {feature}", fontsize=12);# plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

In [ ]:

#Let's look at Euclidean distance from center(512,512) to digits 
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"{i} dist from cen" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-10,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution of values for {feature}", fontsize=12);# plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

The distribution seems to have variance around 200 with balanced gaussian distribution. Another thing to notice is that there are a lot of missing values in those variables.

In [ ]:

#Next set of variables are area for each digit bounding boxes

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"area_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

We can notice the distributions have large variance and the distributions seem to be skewed. We may use some feature engineering to mkae it right.

In [ ]:

#Next set of variables are height of each digit bounding boxes

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"height_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

There is a lot of variance in height of bounding boxes. This may explain the different sizes of bounding boxes as we can see the size will be diferent for some digits and 11, 12.

In [ ]:

#Next set of variables are width for each digit bounding boxes

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"width_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution for {feature}", fontsize=12); # plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

Again we can notice some skewness and a large portion of missing values inside variables. The variance is also different for most of the variables.

In [ ]:

# we will look into the varinace of features distribution of width and height now, to get insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.distplot(df['variance_height'],color="blue", kde=True,bins=120, label='variance_height')
sns.distplot(df['variance_width'],color="red", kde=True,bins=120, label='variance_width')
# sns.distplot(df['variance_area'],color="green", kde=True,bins=120, label='variance_area')
plt.title('Variance in height and width features',size = 20)
plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

Surprisingly, they are almost identical which is good as we can atleast know there is good correlation with height and width variables. Another thing to notice is that we can extract area variable by multiplying height and width features as area = H*W for a bounding box. (Sadly we can't get the missing values in Area from H & W as they are also missing in both other variables)

In [ ]:

# we will look into the varinace of area, to get insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.distplot(df['variance_area'],color="green", kde=True,bins=120, label='variance_area')
plt.title('Variance in height and width features',size = 20)
plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

In [ ]:

#Next set of variables are Angle calculated as counterclockwise and clockwise sum,variance

plt.figure()
fig, ax = plt.subplots(2, 1,figsize=(14, 8))
cont_features = ['between_digits_angle_cw_sum','between_digits_angle_ccw_sum']
for i,feature in enumerate(cont_features):
    plt.subplot(2, 1,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.countplot(data=df, x=feature,palette='rocket')
    plt.xlabel(f"count values Frequency distribution for {feature}", fontsize=12); # plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

In [ ]:

#same with variance variable 

plt.figure()
fig, ax = plt.subplots(2, 1,figsize=(14, 8))
cont_features = ['between_digits_angle_cw_sum','between_digits_angle_ccw_sum']
for i,feature in enumerate(cont_features):
    plt.subplot(2, 1,i+1)
    df[feature].fillna(-1,inplace=True)
#     sns.distplot(df[feature],color="blue", kde=True,bins=120, label='sum')
    sns.distplot(df[feature.replace('sum','var')],color="red", kde=True,bins=120, label='var')
    plt.xlabel(f"Frequency distribution for {feature.replace('sum','var')}", fontsize=12); # plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

Majority of above values are concentrated at value 0 in both the cases in variance, that indicates the presence of very precise data or large number of missing values which we can confirm from Sum countplot.

In [ ]:

features = df_orig.columns[1:-1].to_list()
for f in features:
    print(f" {f} is having : {df[f].nunique()} distinct values")

 number_of_digits is having : 18 distinct values
 missing_digit_1 is having : 2 distinct values
 missing_digit_2 is having : 2 distinct values
 missing_digit_3 is having : 2 distinct values
 missing_digit_4 is having : 2 distinct values
 missing_digit_5 is having : 2 distinct values
 missing_digit_6 is having : 2 distinct values
 missing_digit_7 is having : 2 distinct values
 missing_digit_8 is having : 2 distinct values
 missing_digit_9 is having : 2 distinct values
 missing_digit_10 is having : 2 distinct values
 missing_digit_11 is having : 2 distinct values
 missing_digit_12 is having : 2 distinct values
 1 dist from cen is having : 21148 distinct values
 10 dist from cen is having : 22765 distinct values
 11 dist from cen is having : 22258 distinct values
 12 dist from cen is having : 21357 distinct values
 2 dist from cen is having : 22905 distinct values
 3 dist from cen is having : 23065 distinct values
 4 dist from cen is having : 22437 distinct values
 5 dist from cen is having : 21245 distinct values
 6 dist from cen is having : 21664 distinct values
 7 dist from cen is having : 23415 distinct values
 8 dist from cen is having : 23604 distinct values
 9 dist from cen is having : 20996 distinct values
 euc_dist_digit_1 is having : 23913 distinct values
 euc_dist_digit_2 is having : 26417 distinct values
 euc_dist_digit_3 is having : 912 distinct values
 euc_dist_digit_4 is having : 25887 distinct values
 euc_dist_digit_5 is having : 24739 distinct values
 euc_dist_digit_6 is having : 889 distinct values
 euc_dist_digit_7 is having : 26646 distinct values
 euc_dist_digit_8 is having : 27017 distinct values
 euc_dist_digit_9 is having : 919 distinct values
 euc_dist_digit_10 is having : 26196 distinct values
 euc_dist_digit_11 is having : 25628 distinct values
 euc_dist_digit_12 is having : 912 distinct values
 area_digit_1 is having : 1966 distinct values
 area_digit_2 is having : 3201 distinct values
 area_digit_3 is having : 3624 distinct values
 area_digit_4 is having : 3816 distinct values
 area_digit_5 is having : 4275 distinct values
 area_digit_6 is having : 3960 distinct values
 area_digit_7 is having : 3400 distinct values
 area_digit_8 is having : 3884 distinct values
 area_digit_9 is having : 3707 distinct values
 area_digit_10 is having : 3733 distinct values
 area_digit_11 is having : 3450 distinct values
 area_digit_12 is having : 4332 distinct values
 height_digit_1 is having : 124 distinct values
 height_digit_2 is having : 163 distinct values
 height_digit_3 is having : 170 distinct values
 height_digit_4 is having : 178 distinct values
 height_digit_5 is having : 171 distinct values
 height_digit_6 is having : 177 distinct values
 height_digit_7 is having : 157 distinct values
 height_digit_8 is having : 185 distinct values
 height_digit_9 is having : 195 distinct values
 height_digit_10 is having : 162 distinct values
 height_digit_11 is having : 151 distinct values
 height_digit_12 is having : 169 distinct values
 width_digit_1 is having : 127 distinct values
 width_digit_2 is having : 141 distinct values
 width_digit_3 is having : 147 distinct values
 width_digit_4 is having : 154 distinct values
 width_digit_5 is having : 177 distinct values
 width_digit_6 is having : 169 distinct values
 width_digit_7 is having : 148 distinct values
 width_digit_8 is having : 166 distinct values
 width_digit_9 is having : 161 distinct values
 width_digit_10 is having : 159 distinct values
 width_digit_11 is having : 162 distinct values
 width_digit_12 is having : 180 distinct values
 variance_width is having : 24755 distinct values
 variance_height is having : 24130 distinct values
 variance_area is having : 32296 distinct values
 deviation_dist_from_mid_axis is having : 4336 distinct values
 between_axis_digits_angle_sum is having : 74 distinct values
 between_axis_digits_angle_var is having : 30816 distinct values
 between_digits_angle_cw_sum is having : 5 distinct values
 between_digits_angle_cw_var is having : 32079 distinct values
 between_digits_angle_ccw_sum is having : 4 distinct values
 between_digits_angle_ccw_var is having : 32079 distinct values
 sequence_flag_cw is having : 2 distinct values
 sequence_flag_ccw is having : 2 distinct values
 number_of_hands is having : 7 distinct values
 hand_count_dummy is having : 3 distinct values
 hour_hand_length is having : 11931 distinct values
 minute_hand_length is having : 13480 distinct values
 single_hand_length is having : 6678 distinct values
 clockhand_ratio is having : 22642 distinct values
 clockhand_diff is having : 22834 distinct values
 angle_between_hands is having : 22838 distinct values
 deviation_from_centre is having : 22793 distinct values
 intersection_pos_rel_centre is having : 4 distinct values
 hour_proximity_from_11 is having : 20153 distinct values
 minute_proximity_from_2 is having : 19899 distinct values
 hour_pointing_digit is having : 12 distinct values
 actual_hour_digit is having : 1 distinct values
 minute_pointing_digit is having : 12 distinct values
 actual_minute_digit is having : 1 distinct values
 final_rotation_angle is having : 13 distinct values
 ellipse_circle_ratio is having : 32038 distinct values
 count_defects is having : 172 distinct values
 percentage_inside_ellipse is having : 70 distinct values
 pred_tremor is having : 2 distinct values
 double_major is having : 32328 distinct values
 double_minor is having : 32469 distinct values
 vertical_dist is having : 32534 distinct values
 horizontal_dist is having : 32597 distinct values
 top_area_perc is having : 31153 distinct values
 bottom_area_perc is having : 31149 distinct values
 left_area_perc is having : 31163 distinct values
 right_area_perc is having : 31162 distinct values
 hor_count is having : 4 distinct values
 vert_count is having : 4 distinct values
 eleven_ten_error is having : 2 distinct values
 other_error is having : 2 distinct values
 time_diff is having : 140 distinct values
 centre_dot_detect is having : 2 distinct values

In [ ]:

#Now we will take a look at how the different categorical features with only a few values hold as an countplot distribution

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = ['sequence_flag_cw',
 'sequence_flag_ccw',
 'number_of_hands',
 'hand_count_dummy',
 'pred_tremor',
 'hor_count',
 'vert_count',
 'eleven_ten_error',
 'other_error',
 'centre_dot_detect']
for i,feature in enumerate(cont_features):
    plt.subplot(4,3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.countplot(data=df, x=feature,palette='rocket')
    plt.xlabel(f"Count Values for {feature}", fontsize=12); # plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

The above count plots explain all the categorical values with categories less than 7. we can already see some odd patterns in hand_count_dummy and number_of_hands. Upon further checking the abnormal values (greater than 2) seem to come from normal diagnosis labels.

In [ ]:

#check the null values in training data
print(f" Training data has Null values : {df_orig.isnull().sum()}")

 Training data has Null values : row_id                               0
number_of_digits                    74
missing_digit_1                     74
missing_digit_2                     74
missing_digit_3                     74
missing_digit_4                     74
missing_digit_5                     74
missing_digit_6                     74
missing_digit_7                     74
missing_digit_8                     74
missing_digit_9                     74
missing_digit_10                    74
missing_digit_11                    74
missing_digit_12                    74
1 dist from cen                   7329
10 dist from cen                  4895
11 dist from cen                  5576
12 dist from cen                  3840
2 dist from cen                   4922
3 dist from cen                   4165
4 dist from cen                   5526
5 dist from cen                   6685
6 dist from cen                   4370
7 dist from cen                   4222
8 dist from cen                   4022
9 dist from cen                   5803
euc_dist_digit_1                  7377
euc_dist_digit_2                  4977
euc_dist_digit_3                  4174
euc_dist_digit_4                  5539
euc_dist_digit_5                  6695
euc_dist_digit_6                  4383
euc_dist_digit_7                  4286
euc_dist_digit_8                  4136
euc_dist_digit_9                  5841
euc_dist_digit_10                 4939
euc_dist_digit_11                 5626
euc_dist_digit_12                 3868
area_digit_1                      7329
area_digit_2                      4922
area_digit_3                      4165
area_digit_4                      5526
area_digit_5                      6685
area_digit_6                      4370
area_digit_7                      4222
area_digit_8                      4022
area_digit_9                      5803
area_digit_10                     4895
area_digit_11                     5576
area_digit_12                     3840
height_digit_1                    7329
height_digit_2                    4922
height_digit_3                    4165
height_digit_4                    5526
height_digit_5                    6685
height_digit_6                    4370
height_digit_7                    4222
height_digit_8                    4022
height_digit_9                    5803
height_digit_10                   4895
height_digit_11                   5576
height_digit_12                   3840
width_digit_1                     7329
width_digit_2                     4922
width_digit_3                     4165
width_digit_4                     5526
width_digit_5                     6685
width_digit_6                     4370
width_digit_7                     4222
width_digit_8                     4022
width_digit_9                     5803
width_digit_10                    4895
width_digit_11                    5576
width_digit_12                    3840
variance_width                     464
variance_height                    464
variance_area                      464
deviation_dist_from_mid_axis       579
between_axis_digits_angle_sum     3636
between_axis_digits_angle_var     1907
between_digits_angle_cw_sum      12750
between_digits_angle_cw_var        692
between_digits_angle_ccw_sum     31933
between_digits_angle_ccw_var       692
sequence_flag_cw                   303
sequence_flag_ccw                  303
number_of_hands                   2154
hand_count_dummy                  2154
hour_hand_length                  9916
minute_hand_length                9916
single_hand_length               25036
clockhand_ratio                  10127
clockhand_diff                    9942
angle_between_hands               9916
deviation_from_centre             9984
intersection_pos_rel_centre       9916
hour_proximity_from_11           12586
minute_proximity_from_2          12858
hour_pointing_digit              10100
actual_hour_digit                    0
minute_pointing_digit            10099
actual_minute_digit                  0
final_rotation_angle                74
ellipse_circle_ratio               738
count_defects                        0
percentage_inside_ellipse          305
pred_tremor                          0
double_major                       237
double_minor                         1
vertical_dist                      188
horizontal_dist                     42
top_area_perc                     1559
bottom_area_perc                  1559
left_area_perc                    1559
right_area_perc                   1559
hor_count                            0
vert_count                           0
eleven_ten_error                     0
other_error                          0
time_diff                        10251
centre_dot_detect                 9951
diagnosis                            0
dtype: int64

In [ ]:

# Now finally we take a look the remaining feature distribution as they contain large number of distinct values suitable for a distribution plot
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = ['deviation_dist_from_mid_axis',
 'between_axis_digits_angle_sum',
 'between_axis_digits_angle_var',
 'hour_hand_length',
 'minute_hand_length',
 'single_hand_length',
 'clockhand_ratio',
 'clockhand_diff',
 'angle_between_hands',
 'deviation_from_centre',
 'hour_proximity_from_11',
 'minute_proximity_from_2',
 ]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='blue')
    plt.xlabel(f"Distribution for {feature}", fontsize=12); # plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

In [ ]:

cont_features = ['hour_pointing_digit',
 'minute_pointing_digit',
 'final_rotation_angle',
 'ellipse_circle_ratio',
 'count_defects',
 'percentage_inside_ellipse',
 'double_major',
 'double_minor',
 'vertical_dist',
 'horizontal_dist',
 'top_area_perc',
 'bottom_area_perc',
 'left_area_perc',
 'right_area_perc',
 'time_diff']
plt.figure()
fig, ax = plt.subplots(5, 3,figsize=(14, 20))
for i,feature in enumerate(cont_features):
    plt.subplot(5, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='blue')
    plt.xlabel(f"Distribution for {feature}", fontsize=12); # plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

In [ ]:

#one categorical feature with true categorical values

# intersection_pos_rel_centre
feat_col = df['intersection_pos_rel_centre']
feat_col.fillna(-1,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(data=df, x="intersection_pos_rel_centre",palette='rocket', hue="diagnosis" )
fig.set_xlabel("categories in intersection_pos_rel_centre",size=15)
fig.set_ylabel("Frequency values",size=15)
plt.title('Categorical values distribution with classes',size = 20)
plt.show()

In [ ]:

# we will look into the vfinal target variable to get more insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.countplot(data=df, x='diagnosis',palette='rocket')
plt.title('Distribution of target variable',size = 20)
plt.legend()
plt.show()

No handles with labels found to put in legend.

<Figure size 432x288 with 0 Axes>

We can notice a very large imbalance in data classes which we will address later during feature engineering to make distribution more even

In [ ]:

def CorrMtx(df, dropDuplicates = True):

    # Your dataset is already a correlation matrix.
    # If you have a dateset where you need to include the calculation
    # of a correlation matrix, just uncomment the line below:
    df = df.corr()

    # Exclude duplicate correlations by masking uper right values
    if dropDuplicates:    
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True

    # Set background color / chart style
    sns.set_style(style = 'white')

    # Set up  matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Add diverging colormap from red to blue
    cmap = sns.diverging_palette(250, 10, as_cmap=True)

    # Draw correlation plot with or without duplicates
    if dropDuplicates:
        sns.heatmap(df, mask=mask, cmap=cmap, 
                square=True,
                linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
    else:
        sns.heatmap(df, cmap=cmap, 
                square=True,
                linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
df.fillna(-1,inplace=True)
CorrMtx(df, dropDuplicates =False)

The correlation plot looks interesting to me as it gives a lot of insight into data, for example we can notice some clustered features are interlinked and having high correlation,others seem to have negative correlation with a few.

Feature Engineering and Data Preparation¶

FE - Part I: Creating new features¶

In [ ]:

# Now we apply some feature engineering from the conclusions drawn from above EDA
df = df_orig.copy()

# Standardize features
def standardize(df):
    numeric = df.select_dtypes(include=['int64', 'float64'])
    
    # subtracy mean and divide by std
    df[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
    
    return df
 
#we will use -999 to fill up the missing values as of now
df.fillna(-999,inplace=True)

#Create more features from categorical features
df_dummies = pd.get_dummies(df['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
                          dummy_na=False).add_prefix('c_i_')
df = df.drop('intersection_pos_rel_centre', axis=1)
df = pd.concat([df, df_dummies], axis=1)

df_dummies = pd.get_dummies(df['hand_count_dummy'], columns='hand_count_dummy',
                          dummy_na=False).add_prefix('c_h_')
df = df.drop('hand_count_dummy', axis=1)
df = pd.concat([df, df_dummies], axis=1)

feat_col = df['final_rotation_angle']
df['rotation_angle_180'] = (feat_col <= 180).astype('int')    #we will also include NaN in this column
df['rotation_angle_360'] = (feat_col > 180).astype('int') 
df = df.drop('final_rotation_angle', axis=1)

features =df.columns[1:].to_list()
features.remove('diagnosis')

#currently we are not using standardize but you can use that by uncommeting below line
# df = standardize(df)

features

Out[ ]:

['number_of_digits',
 'missing_digit_1',
 'missing_digit_2',
 'missing_digit_3',
 'missing_digit_4',
 'missing_digit_5',
 'missing_digit_6',
 'missing_digit_7',
 'missing_digit_8',
 'missing_digit_9',
 'missing_digit_10',
 'missing_digit_11',
 'missing_digit_12',
 '1 dist from cen',
 '10 dist from cen',
 '11 dist from cen',
 '12 dist from cen',
 '2 dist from cen',
 '3 dist from cen',
 '4 dist from cen',
 '5 dist from cen',
 '6 dist from cen',
 '7 dist from cen',
 '8 dist from cen',
 '9 dist from cen',
 'euc_dist_digit_1',
 'euc_dist_digit_2',
 'euc_dist_digit_3',
 'euc_dist_digit_4',
 'euc_dist_digit_5',
 'euc_dist_digit_6',
 'euc_dist_digit_7',
 'euc_dist_digit_8',
 'euc_dist_digit_9',
 'euc_dist_digit_10',
 'euc_dist_digit_11',
 'euc_dist_digit_12',
 'area_digit_1',
 'area_digit_2',
 'area_digit_3',
 'area_digit_4',
 'area_digit_5',
 'area_digit_6',
 'area_digit_7',
 'area_digit_8',
 'area_digit_9',
 'area_digit_10',
 'area_digit_11',
 'area_digit_12',
 'height_digit_1',
 'height_digit_2',
 'height_digit_3',
 'height_digit_4',
 'height_digit_5',
 'height_digit_6',
 'height_digit_7',
 'height_digit_8',
 'height_digit_9',
 'height_digit_10',
 'height_digit_11',
 'height_digit_12',
 'width_digit_1',
 'width_digit_2',
 'width_digit_3',
 'width_digit_4',
 'width_digit_5',
 'width_digit_6',
 'width_digit_7',
 'width_digit_8',
 'width_digit_9',
 'width_digit_10',
 'width_digit_11',
 'width_digit_12',
 'variance_width',
 'variance_height',
 'variance_area',
 'deviation_dist_from_mid_axis',
 'between_axis_digits_angle_sum',
 'between_axis_digits_angle_var',
 'between_digits_angle_cw_sum',
 'between_digits_angle_cw_var',
 'between_digits_angle_ccw_sum',
 'between_digits_angle_ccw_var',
 'sequence_flag_cw',
 'sequence_flag_ccw',
 'number_of_hands',
 'hour_hand_length',
 'minute_hand_length',
 'single_hand_length',
 'clockhand_ratio',
 'clockhand_diff',
 'angle_between_hands',
 'deviation_from_centre',
 'hour_proximity_from_11',
 'minute_proximity_from_2',
 'hour_pointing_digit',
 'actual_hour_digit',
 'minute_pointing_digit',
 'actual_minute_digit',
 'ellipse_circle_ratio',
 'count_defects',
 'percentage_inside_ellipse',
 'pred_tremor',
 'double_major',
 'double_minor',
 'vertical_dist',
 'horizontal_dist',
 'top_area_perc',
 'bottom_area_perc',
 'left_area_perc',
 'right_area_perc',
 'hor_count',
 'vert_count',
 'eleven_ten_error',
 'other_error',
 'time_diff',
 'centre_dot_detect',
 'c_i_-999',
 'c_i_BL',
 'c_i_BR',
 'c_i_TL',
 'c_i_TR',
 'c_h_-999.0',
 'c_h_1.0',
 'c_h_2.0',
 'c_h_3.0',
 'rotation_angle_180',
 'rotation_angle_360']

FE - Part II: Dealing with Class Imbalance¶

In [ ]:

#Now we will use one of the methods described in https://www.aicrowd.com/showcase/dealing-with-class-imbalance
#and used by https://www.aicrowd.com/showcase/dealing-with-class-imbalance
#check those out, great notebooks

df_final = pd.concat([
    df.loc[df.diagnosis == 'pre_alzheimer'],
    df.loc[df.diagnosis == 'post_alzheimer'],
    df.loc[df.diagnosis == 'normal'].sample(frac=1/6),
]).reset_index().drop('index', axis=1)



train_data = df_final[features]

target_dict = {'normal':0, 'post_alzheimer':1, 'pre_alzheimer':2}
remap_vals = {0:'normal', 1:'post_alzheimer',2:'pre_alzheimer'}
train_labels = df_final['diagnosis'].map(target_dict).astype('int')
train_data.describe()

Out[ ]:

	number_of_digits	missing_digit_1	missing_digit_2	missing_digit_3	missing_digit_4	missing_digit_5	missing_digit_6	missing_digit_7	missing_digit_8	missing_digit_9	missing_digit_10	missing_digit_11	missing_digit_12	1 dist from cen	10 dist from cen	11 dist from cen	12 dist from cen	2 dist from cen	3 dist from cen	4 dist from cen	5 dist from cen	6 dist from cen	7 dist from cen	8 dist from cen	9 dist from cen	euc_dist_digit_1	euc_dist_digit_2	euc_dist_digit_3	euc_dist_digit_4	euc_dist_digit_5	euc_dist_digit_6	euc_dist_digit_7	euc_dist_digit_8	euc_dist_digit_9	euc_dist_digit_10	euc_dist_digit_11	euc_dist_digit_12	area_digit_1	area_digit_2	area_digit_3	area_digit_4	area_digit_5	area_digit_6	area_digit_7	area_digit_8	area_digit_9	area_digit_10	area_digit_11	area_digit_12	height_digit_1	height_digit_2	height_digit_3	height_digit_4	height_digit_5	height_digit_6	height_digit_7	height_digit_8	height_digit_9	height_digit_10	height_digit_11	height_digit_12	width_digit_1	width_digit_2	width_digit_3	width_digit_4	width_digit_5	width_digit_6	width_digit_7	width_digit_8	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	sequence_flag_cw	sequence_flag_ccw	number_of_hands	hour_hand_length	minute_hand_length	single_hand_length	clockhand_ratio	clockhand_diff	angle_between_hands	deviation_from_centre	hour_proximity_from_11	minute_proximity_from_2	hour_pointing_digit	actual_hour_digit	minute_pointing_digit	actual_minute_digit	ellipse_circle_ratio	count_defects	percentage_inside_ellipse	pred_tremor	double_major	double_minor	vertical_dist	horizontal_dist	top_area_perc	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	eleven_ten_error	other_error	time_diff	centre_dot_detect	c_i_-999	c_i_BL	c_i_BR	c_i_TL	c_i_TR	c_h_-999.0	c_h_1.0	c_h_2.0	c_h_3.0	rotation_angle_180	rotation_angle_360
count	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.00000	6770.00000	6770.000000	6770.000000	6770.000000	6.770000e+03	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.0	6770.000000	6770.0	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000	6770.000000
mean	2.205465	-7.251551	-7.361300	-7.353767	-7.318168	-7.284786	-7.345495	-7.356721	-7.366765	-7.297932	-7.329985	-7.308715	-7.380059	-22.195355	85.820879	57.993805	158.878009	115.693513	95.045635	45.866884	0.618808	95.241264	122.723867	138.123734	49.233093	-260.504807	-144.774563	-150.807905	-188.369042	-224.448232	-163.786662	-151.936662	-142.008238	-210.628043	-176.689947	-199.641637	-127.947171	1403.209601	3704.183900	3996.926588	4343.154210	5229.291581	4716.534417	3895.291285	4670.187740	4162.444313	5161.030428	4045.102068	5928.502954	-238.270015	-112.619055	-113.378582	-146.426440	-181.362038	-116.460266	-109.724668	-95.389808	-167.023191	-137.998818	-165.289217	-86.609749	-251.928656	-119.627917	-128.924520	-163.127622	-188.362629	-131.741507	-126.184786	-109.935303	-185.789217	-139.82644	-169.98449	-78.567947	339.246680	297.689181	5.505836e+06	-5.602571	161.265643	2925.461328	-221.238988	3939.049096	-935.865731	4029.426765	-13.038257	-13.686263	-108.993353	-350.062017	-337.836826	-703.105294	-393.609485	-376.250703	-332.746307	-378.064557	-457.367168	-446.508045	-387.701773	11.0	-390.715953	2.0	49.553896	90.502806	-16.050169	0.351256	108.758189	105.726612	101.984333	112.154405	-52.487678	-52.529321	-52.477276	-52.535546	0.634417	0.703397	0.029985	0.675037	-329.566470	-388.681093	0.387592	0.137371	0.071344	0.268390	0.135303	0.110635	0.263811	0.612408	0.013146	0.806942	0.193058
std	87.278148	86.411620	86.401703	86.402388	86.405617	86.408632	86.403140	86.402120	86.401206	86.407446	86.404547	86.406472	86.399994	613.451077	550.043566	570.670442	494.856191	510.664305	514.112826	548.877680	576.482521	528.160353	521.823106	510.976726	583.188112	465.404599	393.170745	397.830391	425.453682	446.657144	402.940222	396.300214	389.499337	439.277714	417.979123	432.997542	372.519664	1767.826721	3103.843143	3340.847532	3690.797373	4749.247871	3699.724917	3047.743663	3642.054339	3728.097547	4299.486305	3650.198329	4513.071386	476.762205	404.418243	414.760512	446.948512	470.622550	425.089259	412.340230	404.560842	462.192311	435.362023	449.024706	388.646802	468.113349	401.068195	407.332302	437.938456	466.615689	417.432748	404.533581	397.901991	451.560504	434.40546	446.50366	392.231714	409.979873	406.970543	8.156777e+06	200.561033	469.422919	6543.878818	668.442520	6925.440770	276.609069	7079.345020	116.371406	116.294178	313.929704	516.422340	526.139065	479.932534	489.028573	496.811455	530.473060	496.803408	513.079050	518.163306	492.436627	0.0	490.465431	0.0	173.934939	40.263078	129.222672	0.477398	115.397321	19.316798	103.804854	43.576149	223.997320	223.987464	223.999767	223.985977	0.671055	0.693385	0.170559	0.468396	567.379782	487.238454	0.487237	0.344264	0.257418	0.443154	0.342072	0.313703	0.440731	0.487237	0.113909	0.394727	0.394727
min	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.00000	-999.00000	-999.000000	-999.000000	-999.000000	-9.990000e+02	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	11.0	-999.000000	2.0	-999.000000	1.000000	-999.000000	0.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	0.000000	0.000000	0.000000	0.000000	-999.000000	-999.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	9.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-999.000000	276.467545	260.469886	309.404670	272.446325	256.152779	238.876537	108.130784	277.323332	291.351686	300.264778	238.596499	-999.000000	3.367697	3.120000	1.670401	0.062912	1.950000	2.649224	3.084697	0.520000	1.696822	0.889803	2.470000	-999.000000	2160.000000	2295.000000	2405.250000	1774.000000	2684.000000	2378.000000	2829.750000	1841.250000	2816.000000	2024.250000	3422.000000	-999.000000	46.000000	52.000000	48.000000	40.250000	49.000000	49.000000	51.000000	40.000000	51.000000	44.000000	53.000000	-999.000000	43.000000	39.000000	41.000000	37.000000	43.000000	39.000000	43.000000	35.000000	50.00000	41.00000	59.000000	162.265828	137.811688	1.486458e+06	9.880000	360.000000	80.625211	-999.000000	56.718975	-999.000000	57.144170	0.000000	0.000000	1.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	11.0	-999.000000	2.0	75.499059	74.000000	1.000000	0.000000	114.799143	101.724518	106.522653	109.362514	0.464936	0.470940	0.497451	0.449839	0.000000	0.000000	0.000000	0.000000	-999.000000	-999.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
50%	11.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	342.639388	355.813505	353.080904	366.623104	339.164783	327.199862	321.226749	316.376750	342.074735	356.309907	360.365959	362.882729	8.703903	16.609486	15.210000	13.757949	11.869623	11.180000	12.582122	14.012915	10.855000	12.929675	11.513083	10.140000	1674.000000	3599.000000	3940.500000	4480.000000	5368.500000	4872.000000	4032.000000	4606.000000	4390.500000	5211.000000	4160.000000	5656.500000	49.000000	63.000000	71.000000	75.000000	76.000000	77.000000	72.000000	77.000000	77.000000	72.000000	66.000000	71.000000	30.000000	56.000000	53.000000	56.000000	66.000000	60.000000	52.000000	58.000000	53.000000	70.00000	60.00000	79.000000	280.875758	241.621212	3.088450e+06	17.355000	360.000000	315.974661	360.000000	235.903932	-999.000000	237.594904	1.000000	0.000000	2.000000	47.181941	68.624733	-999.000000	1.086670	6.190447	76.210632	6.873668	0.558894	0.454038	2.000000	11.0	2.000000	2.0	82.958432	100.000000	1.000000	0.000000	120.067894	108.892404	112.513835	115.876461	0.490146	0.503934	0.518955	0.475576	1.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000
75%	12.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	383.768891	388.613481	388.896194	396.084587	376.630124	364.555723	357.789393	354.439787	375.583629	390.380263	394.259749	396.124585	26.665341	43.126611	41.340000	39.214867	35.956487	30.225000	37.757273	39.778335	32.077500	40.320899	35.761962	27.430000	2516.000000	5390.000000	5828.750000	6500.000000	8022.750000	7040.000000	5733.000000	6696.000000	6480.000000	7654.000000	6132.000000	8286.500000	67.000000	80.000000	90.000000	97.000000	97.000000	100.000000	93.000000	98.000000	100.000000	90.000000	84.000000	87.000000	40.000000	70.000000	68.000000	71.000000	87.000000	75.000000	67.000000	75.000000	68.000000	87.00000	76.00000	100.000000	466.358333	413.868182	6.414914e+06	48.100000	360.000000	2790.897863	360.000000	7494.666046	-999.000000	7519.096753	1.000000	0.000000	2.000000	63.744688	83.514959	40.275668	1.363993	21.825065	94.489077	15.576374	5.973103	5.561037	11.000000	11.0	2.000000	2.0	87.013714	119.000000	1.000000	1.000000	124.292840	113.400288	116.133562	120.156781	0.515807	0.525680	0.540454	0.495097	1.000000	1.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	1.000000	0.000000	1.000000	0.000000
max	17.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	611.804707	587.053873	600.393621	659.571073	544.839655	524.434219	520.622704	490.549946	505.302385	666.132119	608.481717	593.485467	119.014060	119.906309	118.560000	119.546475	117.821062	115.700000	119.321309	119.838808	118.950000	118.928227	119.644143	119.730000	9782.000000	25088.000000	24480.000000	26850.000000	32200.000000	21812.000000	22866.000000	26085.000000	25584.000000	31622.000000	28000.000000	28362.000000	143.000000	196.000000	193.000000	223.000000	218.000000	202.000000	176.000000	256.000000	248.000000	201.000000	208.000000	287.000000	149.000000	193.000000	184.000000	181.000000	210.000000	209.000000	206.000000	192.000000	207.000000	202.00000	182.00000	220.000000	4232.000000	5832.000000	1.530196e+08	119.470000	360.000000	63116.006320	360.000000	63259.726600	360.000000	63259.726600	1.000000	1.000000	8.000000	110.544344	128.171298	292.853059	2.484998	69.628202	179.356082	295.987900	179.275012	179.599020	12.000000	11.0	12.000000	2.0	99.972806	170.000000	1.000000	1.000000	472.390325	253.593660	471.591459	471.512542	1.000000	1.000000	1.000000	1.000000	2.000000	2.000000	1.000000	1.000000	605.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

We mainly used very simple feature engineering as of now but in susequent notebooks (probably part 2 or workthrough blog/video), I'll explain more methods of feature engineering and try to dig deeper into how we can leverage the different FE techniques, now let's focus on hyper parameter optimization

Redundant Code¶

In [ ]:

# features = df_orig.columns[1:-1].to_list()
# cont_f = []
# for f in features:
#     print(f" {f} is having : {df[f].nunique()}")
#     if df[f].nunique() >= 7:
#         cont_f.append(f)

In [ ]:

# train = df[features]
# train = train.drop(['intersection_pos_rel_centre'],axis = 1)
# train.fillna(-1, inplace=True)
# # train_data = (train_data-train_data.mean())/train_data.std()
# train.describe()

In [ ]:

# target_values = list(df_orig['diagnosis'].unique())
# target_col = 'diagnosis'
# df_pos = df_orig[df_orig[target_col].isin(target_values[1:])]
# nb_pos = df_pos.shape[0]
# nb_neg = nb_pos*2
# df_neg = df_orig[df_orig[target_col] == "normal"].sample(n=nb_neg, random_state=42)
# df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)

# train_data = df_samples[features]
# train_data.drop(['intersection_pos_rel_centre'],axis = 1, inplace=True)
# train_data.fillna(-1, inplace=True)
# # train_data = (train_data-train_data.mean())/train_data.std()
# train_data.describe()

In [ ]:

# df_orig['diagnosis'].unique()

In [ ]:

# target_dict = {'normal':0, 'post_alzheimer':1, 'pre_alzheimer':2}
# remap_vals = {0:'normal', 1:'post_alzheimer',2:'pre_alzheimer'}
# train_labels = df_samples['diagnosis'].map(target_dict).astype('int')
# train_labels

Train your model¶

Part I: Hyper-parameter Optimization using Optuna¶

In [ ]:

#use 10% train data for validation while tuning hyperparamters
X_train, X_test, Y_train, y_test = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)


#For tuning hyperparameters we are using default sampler and pruner of Optunafor simplicity, you can find moer info about them 
#at https://github.com/optuna/optuna/ [ps: I am one of the contributors so feel free to ask any queries or give feedback]
import optuna


def objective(trial):
    train_x, valid_x, train_y, valid_y = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "eval_metric":"mlogloss",
        "use_label_encoder":False,
        # L2 regularization weight.
        "reg_lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "reg_alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 1.0, log=True),
        "max_depth": trial.suggest_int("max_depth", 8, 20),
        "n_estimators": trial.suggest_int("n_estimators", 50, 200),
    }
    model = xgb.XGBClassifier(**param)
    model.fit(train_x,train_y)
    pred_labels = model.predict_proba(valid_x)
    return log_loss(valid_y, pred_labels)

In [ ]:

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[I 2021-05-12 15:59:38,641] A new study created in memory with name: no-name-22701c7a-c3c0-4464-ad97-3aff52082ff5
[I 2021-05-12 15:59:43,741] Trial 0 finished with value: 0.9195696366414362 and parameters: {'lambda': 1.6349867526984736e-08, 'alpha': 1.3660860062064696e-06, 'subsample': 0.20079966534758473, 'colsample_bytree': 0.37119951076379953, 'learning_rate': 0.0038901255071057336, 'max_depth': 20, 'n_estimators': 70}. Best is trial 0 with value: 0.9195696366414362.
[I 2021-05-12 16:00:04,196] Trial 1 finished with value: 0.9635377036768014 and parameters: {'lambda': 0.004574820718124166, 'alpha': 2.9722885826155193e-05, 'subsample': 0.9208011834964993, 'colsample_bytree': 0.7567158878933506, 'learning_rate': 0.00129718056690989, 'max_depth': 11, 'n_estimators': 146}. Best is trial 0 with value: 0.9195696366414362.
[I 2021-05-12 16:00:23,199] Trial 2 finished with value: 0.6012728019910514 and parameters: {'lambda': 0.06532063043901398, 'alpha': 7.219229185669857e-08, 'subsample': 0.8064491134239615, 'colsample_bytree': 0.6532285969703563, 'learning_rate': 0.007698231074099377, 'max_depth': 9, 'n_estimators': 199}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:00:36,058] Trial 3 finished with value: 1.0188958269333888 and parameters: {'lambda': 4.100103321864437e-06, 'alpha': 1.469784619079097e-06, 'subsample': 0.46997280290451515, 'colsample_bytree': 0.9817330486559255, 'learning_rate': 0.37994346952157776, 'max_depth': 8, 'n_estimators': 183}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:07,492] Trial 4 finished with value: 0.7157922250657526 and parameters: {'lambda': 0.008884359462830286, 'alpha': 0.0013918845053601467, 'subsample': 0.8952897423649306, 'colsample_bytree': 0.7656061476589351, 'learning_rate': 0.005564061425281412, 'max_depth': 20, 'n_estimators': 149}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:10,933] Trial 5 finished with value: 0.9841397848969459 and parameters: {'lambda': 0.002146058060039168, 'alpha': 0.18073047641261855, 'subsample': 0.397484066252481, 'colsample_bytree': 0.7971449141596241, 'learning_rate': 0.711811587096646, 'max_depth': 8, 'n_estimators': 52}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:15,631] Trial 6 finished with value: 1.079011461771409 and parameters: {'lambda': 0.3172815083490704, 'alpha': 0.2074018202573672, 'subsample': 0.3852548230234022, 'colsample_bytree': 0.4810423024971776, 'learning_rate': 0.772503478998581, 'max_depth': 13, 'n_estimators': 129}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:25,493] Trial 7 finished with value: 0.6409120366829854 and parameters: {'lambda': 0.00010606366275323564, 'alpha': 1.3231854083926023e-08, 'subsample': 0.744251168195859, 'colsample_bytree': 0.4299455255176272, 'learning_rate': 0.07107340110329274, 'max_depth': 13, 'n_estimators': 99}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:32,129] Trial 8 finished with value: 0.5439758313217227 and parameters: {'lambda': 0.33083813627745107, 'alpha': 1.5911831635866442e-07, 'subsample': 0.82765124048449, 'colsample_bytree': 0.2279829749365476, 'learning_rate': 0.035925260456468336, 'max_depth': 12, 'n_estimators': 95}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:38,463] Trial 9 finished with value: 1.0134956397943775 and parameters: {'lambda': 6.776274672080595e-05, 'alpha': 0.056795678572885214, 'subsample': 0.6102348185161552, 'colsample_bytree': 0.976707107422057, 'learning_rate': 0.8199557848196738, 'max_depth': 15, 'n_estimators': 78}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:45,583] Trial 10 finished with value: 0.5941905979131743 and parameters: {'lambda': 0.9957324360369868, 'alpha': 2.076680191304403e-07, 'subsample': 0.684544680253138, 'colsample_bytree': 0.2286114677754202, 'learning_rate': 0.06678423843359269, 'max_depth': 16, 'n_estimators': 103}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:53,293] Trial 11 finished with value: 0.558872301090217 and parameters: {'lambda': 0.6552347455123961, 'alpha': 2.3169385235616206e-07, 'subsample': 0.653234773049951, 'colsample_bytree': 0.24031308330446927, 'learning_rate': 0.039180808153754215, 'max_depth': 17, 'n_estimators': 106}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:59,901] Trial 12 finished with value: 0.5558765758725882 and parameters: {'lambda': 0.7369091485314381, 'alpha': 2.8126831683363716e-05, 'subsample': 0.5439057556282585, 'colsample_bytree': 0.21214216238324685, 'learning_rate': 0.023287391978836985, 'max_depth': 17, 'n_estimators': 102}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:05,345] Trial 13 finished with value: 0.6327375523096702 and parameters: {'lambda': 0.060321088785949904, 'alpha': 0.00010613432399749767, 'subsample': 0.5053679835529601, 'colsample_bytree': 0.20209023588551986, 'learning_rate': 0.01594804011056962, 'max_depth': 18, 'n_estimators': 82}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:09,340] Trial 14 finished with value: 0.7116914281530954 and parameters: {'lambda': 2.5453952810966554e-08, 'alpha': 2.2966164218465713e-05, 'subsample': 0.9843494524124975, 'colsample_bytree': 0.3233026473291353, 'learning_rate': 0.209234484896145, 'max_depth': 11, 'n_estimators': 54}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:17,439] Trial 15 finished with value: 0.552392530850681 and parameters: {'lambda': 5.014880568541396e-07, 'alpha': 0.002565296674701982, 'subsample': 0.22883167938347604, 'colsample_bytree': 0.5263465314609196, 'learning_rate': 0.019551956408425334, 'max_depth': 18, 'n_estimators': 124}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:23,813] Trial 16 finished with value: 0.6979143200293344 and parameters: {'lambda': 1.2973372711244673e-06, 'alpha': 0.0017945347028109787, 'subsample': 0.24368971537894468, 'colsample_bytree': 0.5417594117009952, 'learning_rate': 0.13422959042553706, 'max_depth': 11, 'n_estimators': 127}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:46,457] Trial 17 finished with value: 0.5715046930876582 and parameters: {'lambda': 6.711648112115581e-07, 'alpha': 0.008114864419414643, 'subsample': 0.8071338719437253, 'colsample_bytree': 0.6280043772146156, 'learning_rate': 0.012410805891684423, 'max_depth': 14, 'n_estimators': 157}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:53,307] Trial 18 finished with value: 0.9546845668880063 and parameters: {'lambda': 1.9473381877226924e-07, 'alpha': 0.00032124477770214166, 'subsample': 0.2826032414232876, 'colsample_bytree': 0.3416417950739769, 'learning_rate': 0.0017414056809881896, 'max_depth': 19, 'n_estimators': 120}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:19,611] Trial 19 finished with value: 0.6713276867807376 and parameters: {'lambda': 3.4324133372634834e-05, 'alpha': 0.013171732785893786, 'subsample': 0.9816867890170315, 'colsample_bytree': 0.8889916979277526, 'learning_rate': 0.04229477429241404, 'max_depth': 12, 'n_estimators': 167}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:39,319] Trial 20 finished with value: 0.8613529729174018 and parameters: {'lambda': 7.43126857356227e-06, 'alpha': 3.022852631363205e-06, 'subsample': 0.8594906080158236, 'colsample_bytree': 0.5404173064590668, 'learning_rate': 0.003275403546508423, 'max_depth': 15, 'n_estimators': 118}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:48,251] Trial 21 finished with value: 0.5813804973395572 and parameters: {'lambda': 0.0003931298740367695, 'alpha': 9.307805111877995e-06, 'subsample': 0.5483616600559615, 'colsample_bytree': 0.28287453314700883, 'learning_rate': 0.019630774046001352, 'max_depth': 18, 'n_estimators': 90}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:52,858] Trial 22 finished with value: 0.631024004261462 and parameters: {'lambda': 0.06258532953868075, 'alpha': 0.00039906590899308164, 'subsample': 0.3285257743497002, 'colsample_bytree': 0.43167280657786383, 'learning_rate': 0.018793427580528474, 'max_depth': 17, 'n_estimators': 68}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:07,027] Trial 23 finished with value: 0.5692939623114563 and parameters: {'lambda': 1.0680239353387206e-07, 'alpha': 3.5566316565534124e-08, 'subsample': 0.716256689759055, 'colsample_bytree': 0.262616241837632, 'learning_rate': 0.030667941305320325, 'max_depth': 16, 'n_estimators': 139}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:15,588] Trial 24 finished with value: 0.6535982637991338 and parameters: {'lambda': 0.25448068138959007, 'alpha': 0.007304168983292869, 'subsample': 0.5768148453722621, 'colsample_bytree': 0.38980608904710046, 'learning_rate': 0.08499556493790805, 'max_depth': 18, 'n_estimators': 107}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:26,722] Trial 25 finished with value: 0.7011930414721744 and parameters: {'lambda': 0.0009091508617729856, 'alpha': 7.5790509217935e-05, 'subsample': 0.4456615729447536, 'colsample_bytree': 0.6745486061016551, 'learning_rate': 0.009612821423414911, 'max_depth': 14, 'n_estimators': 92}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:35,375] Trial 26 finished with value: 0.5509703861387168 and parameters: {'lambda': 0.014241710013410082, 'alpha': 0.0012737675771777972, 'subsample': 0.7825569112681832, 'colsample_bytree': 0.20596670917273524, 'learning_rate': 0.030404545660587586, 'max_depth': 19, 'n_estimators': 110}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:45,130] Trial 27 finished with value: 0.6254864225344268 and parameters: {'lambda': 0.02987454745886535, 'alpha': 0.0020637668494017125, 'subsample': 0.7707754748217122, 'colsample_bytree': 0.3119176940308491, 'learning_rate': 0.05358713313962714, 'max_depth': 19, 'n_estimators': 115}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:56,542] Trial 28 finished with value: 0.7235978848434658 and parameters: {'lambda': 0.01403596509504278, 'alpha': 0.03954710306945208, 'subsample': 0.8526279550262742, 'colsample_bytree': 0.5412666921490603, 'learning_rate': 0.11223971630025333, 'max_depth': 10, 'n_estimators': 134}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:06,999] Trial 29 finished with value: 0.9039875398358355 and parameters: {'lambda': 0.0004793778331294463, 'alpha': 0.0003814881557834932, 'subsample': 0.9348677563121771, 'colsample_bytree': 0.394508057603869, 'learning_rate': 0.0040076023340996384, 'max_depth': 20, 'n_estimators': 75}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:19,847] Trial 30 finished with value: 0.5543004662854984 and parameters: {'lambda': 1.5420576306444612e-08, 'alpha': 0.8286272487868824, 'subsample': 0.7911696988792145, 'colsample_bytree': 0.7172112890802471, 'learning_rate': 0.030544422871830627, 'max_depth': 13, 'n_estimators': 89}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:32,768] Trial 31 finished with value: 0.556463176673457 and parameters: {'lambda': 2.7694083953665918e-08, 'alpha': 0.3250902110632494, 'subsample': 0.8062160348412188, 'colsample_bytree': 0.6994586113543141, 'learning_rate': 0.0315697435069218, 'max_depth': 13, 'n_estimators': 90}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:42,328] Trial 32 finished with value: 0.6990377457490524 and parameters: {'lambda': 1.1291930864690077e-07, 'alpha': 0.040749304109750056, 'subsample': 0.6613774240008639, 'colsample_bytree': 0.8537777704403738, 'learning_rate': 0.012848315693840632, 'max_depth': 11, 'n_estimators': 69}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:58,927] Trial 33 finished with value: 0.555768231196849 and parameters: {'lambda': 1.3020269233111845e-08, 'alpha': 0.00422071844876201, 'subsample': 0.8483766635056502, 'colsample_bytree': 0.7403406186829168, 'learning_rate': 0.02596946182725228, 'max_depth': 12, 'n_estimators': 110}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:07,535] Trial 34 finished with value: 0.83272786814846 and parameters: {'lambda': 9.15007686586003e-06, 'alpha': 0.00013203434182416723, 'subsample': 0.9259142796153228, 'colsample_bytree': 0.5949503388916598, 'learning_rate': 0.007617066478424526, 'max_depth': 12, 'n_estimators': 60}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:16,988] Trial 35 finished with value: 0.7677976213070284 and parameters: {'lambda': 5.682319585356447e-07, 'alpha': 4.892213591879229e-07, 'subsample': 0.7769562242874049, 'colsample_bytree': 0.8284801770616087, 'learning_rate': 0.23353518546832233, 'max_depth': 10, 'n_estimators': 85}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:34,124] Trial 36 finished with value: 0.5686744241156729 and parameters: {'lambda': 0.0042326990122688755, 'alpha': 0.0010704938541182879, 'subsample': 0.7027872732016271, 'colsample_bytree': 0.9217383865380415, 'learning_rate': 0.03527130956513971, 'max_depth': 15, 'n_estimators': 95}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:57,322] Trial 37 finished with value: 0.6038816892873823 and parameters: {'lambda': 7.167199590663346e-08, 'alpha': 0.7384480724372152, 'subsample': 0.8777126909623407, 'colsample_bytree': 0.7446172967069293, 'learning_rate': 0.04854967241859484, 'max_depth': 19, 'n_estimators': 142}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:14,198] Trial 38 finished with value: 0.7140833424236432 and parameters: {'lambda': 0.1787103327358404, 'alpha': 2.7145901635824887e-06, 'subsample': 0.8074491095165193, 'colsample_bytree': 0.6039569973938803, 'learning_rate': 0.0067836313605354704, 'max_depth': 13, 'n_estimators': 123}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:23,361] Trial 39 finished with value: 0.5999528267138817 and parameters: {'lambda': 0.016342787430727198, 'alpha': 0.9006287926906134, 'subsample': 0.618906799988347, 'colsample_bytree': 0.45741160128240516, 'learning_rate': 0.01236147269936341, 'max_depth': 9, 'n_estimators': 131}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:41,681] Trial 40 finished with value: 0.6999605190981472 and parameters: {'lambda': 1.1330966109970059e-08, 'alpha': 1.0272576483174874e-08, 'subsample': 0.7301847063299857, 'colsample_bytree': 0.7130786059228844, 'learning_rate': 0.09127106702914806, 'max_depth': 20, 'n_estimators': 113}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:58,819] Trial 41 finished with value: 0.5595087209053371 and parameters: {'lambda': 3.6102863506940234e-08, 'alpha': 0.004992089413328167, 'subsample': 0.8429985042560257, 'colsample_bytree': 0.7760310533548564, 'learning_rate': 0.021699116804683727, 'max_depth': 12, 'n_estimators': 111}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:14,185] Trial 42 finished with value: 0.5638894339695991 and parameters: {'lambda': 1.1288425942790915e-08, 'alpha': 0.023954876339781377, 'subsample': 0.9522532371070629, 'colsample_bytree': 0.7461485742183953, 'learning_rate': 0.024212240095413288, 'max_depth': 12, 'n_estimators': 97}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:30,785] Trial 43 finished with value: 0.6308152922441181 and parameters: {'lambda': 3.4051144992833417e-07, 'alpha': 0.1231372766186487, 'subsample': 0.9057837369483771, 'colsample_bytree': 0.8101561490685745, 'learning_rate': 0.05967777310138885, 'max_depth': 14, 'n_estimators': 106}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:44,386] Trial 44 finished with value: 0.5663630858261793 and parameters: {'lambda': 1.2663189371629173e-06, 'alpha': 0.0027520548094786424, 'subsample': 0.7591503243644138, 'colsample_bytree': 0.6603098788602672, 'learning_rate': 0.03484101617568957, 'max_depth': 13, 'n_estimators': 100}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:54,557] Trial 45 finished with value: 0.6342182246585716 and parameters: {'lambda': 1.1354052583809866e-08, 'alpha': 0.0007241911751782543, 'subsample': 0.8227250109167384, 'colsample_bytree': 0.7082449288272578, 'learning_rate': 0.015151960040659704, 'max_depth': 10, 'n_estimators': 83}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:09:20,560] Trial 46 finished with value: 0.5718747880750031 and parameters: {'lambda': 5.271621351182447e-08, 'alpha': 0.003749125413858633, 'subsample': 0.8868298808233133, 'colsample_bytree': 0.867231453380404, 'learning_rate': 0.02473438018044617, 'max_depth': 13, 'n_estimators': 149}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:09:39,029] Trial 47 finished with value: 0.6066211906527933 and parameters: {'lambda': 2.5434925432329653e-05, 'alpha': 0.01830860135864865, 'subsample': 0.6706894847247397, 'colsample_bytree': 0.945898504387016, 'learning_rate': 0.04864244304813086, 'max_depth': 12, 'n_estimators': 124}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:09:55,072] Trial 48 finished with value: 0.6615006651624795 and parameters: {'lambda': 0.00015276590378494285, 'alpha': 0.00021910499564107314, 'subsample': 0.7702816765883712, 'colsample_bytree': 0.778173602749846, 'learning_rate': 0.009755895143165334, 'max_depth': 11, 'n_estimators': 111}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:02,468] Trial 49 finished with value: 0.6839060028561622 and parameters: {'lambda': 0.12991973783622773, 'alpha': 0.08567943074799311, 'subsample': 0.6334402537577137, 'colsample_bytree': 0.5121145018681567, 'learning_rate': 0.14973279889924992, 'max_depth': 16, 'n_estimators': 78}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:26,054] Trial 50 finished with value: 0.7531997957381541 and parameters: {'lambda': 2.495253410332865e-06, 'alpha': 5.98390286694122e-05, 'subsample': 0.9719514387358505, 'colsample_bytree': 0.6191629461686738, 'learning_rate': 0.07115781461247034, 'max_depth': 15, 'n_estimators': 196}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:29,560] Trial 51 finished with value: 0.5535589019593231 and parameters: {'lambda': 0.9452244010242742, 'alpha': 8.063659264258198e-06, 'subsample': 0.21896109976066547, 'colsample_bytree': 0.21915155178596238, 'learning_rate': 0.02677416216425438, 'max_depth': 17, 'n_estimators': 101}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:33,122] Trial 52 finished with value: 0.5466886827037113 and parameters: {'lambda': 0.511074951931679, 'alpha': 1.1046897266643477e-05, 'subsample': 0.21033872002280776, 'colsample_bytree': 0.2315174702986938, 'learning_rate': 0.028498064739051473, 'max_depth': 18, 'n_estimators': 100}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:36,803] Trial 53 finished with value: 0.5871077385047337 and parameters: {'lambda': 0.631998286377863, 'alpha': 8.779506220626414e-06, 'subsample': 0.2251500805586211, 'colsample_bytree': 0.20346022427270505, 'learning_rate': 0.017434630377452562, 'max_depth': 18, 'n_estimators': 101}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:40,745] Trial 54 finished with value: 0.5451431829108138 and parameters: {'lambda': 0.4256004603824704, 'alpha': 5.320442103515791e-07, 'subsample': 0.28048490693317624, 'colsample_bytree': 0.2525291803811883, 'learning_rate': 0.0403617734286861, 'max_depth': 19, 'n_estimators': 88}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:45,146] Trial 55 finished with value: 0.5554722776991402 and parameters: {'lambda': 0.45745651179323193, 'alpha': 8.732081498753008e-07, 'subsample': 0.28855667745005215, 'colsample_bytree': 0.25171986922037454, 'learning_rate': 0.04892345427900728, 'max_depth': 19, 'n_estimators': 97}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:48,099] Trial 56 finished with value: 0.5488883606825392 and parameters: {'lambda': 0.11891801544660276, 'alpha': 2.631866846093695e-07, 'subsample': 0.20239950180990837, 'colsample_bytree': 0.2854835903140222, 'learning_rate': 0.039166260641913794, 'max_depth': 17, 'n_estimators': 75}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:52,158] Trial 57 finished with value: 0.6028591805492615 and parameters: {'lambda': 0.09669298185844794, 'alpha': 7.32689458988185e-08, 'subsample': 0.359980360210848, 'colsample_bytree': 0.3494439420225374, 'learning_rate': 0.09271913276260339, 'max_depth': 20, 'n_estimators': 62}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:56,350] Trial 58 finished with value: 0.5448266950049375 and parameters: {'lambda': 0.03626179826209123, 'alpha': 9.27141261424062e-08, 'subsample': 0.27289740535372226, 'colsample_bytree': 0.29509333803783233, 'learning_rate': 0.03990156579946337, 'max_depth': 18, 'n_estimators': 81}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:59,668] Trial 59 finished with value: 0.5726700851754278 and parameters: {'lambda': 0.027306535127362894, 'alpha': 1.0963292098426224e-07, 'subsample': 0.2005543994931825, 'colsample_bytree': 0.29722878734755664, 'learning_rate': 0.06963920969062723, 'max_depth': 17, 'n_estimators': 76}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:04,369] Trial 60 finished with value: 0.5468015766357301 and parameters: {'lambda': 0.005249903271277778, 'alpha': 3.8946739297522934e-07, 'subsample': 0.258429922148393, 'colsample_bytree': 0.2729022015591938, 'learning_rate': 0.041386276225514426, 'max_depth': 19, 'n_estimators': 86}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:08,949] Trial 61 finished with value: 0.5458041890586233 and parameters: {'lambda': 0.0051814458801663285, 'alpha': 2.6751448157951585e-07, 'subsample': 0.2664097284913194, 'colsample_bytree': 0.28030618516547123, 'learning_rate': 0.04390642992502063, 'max_depth': 19, 'n_estimators': 86}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:12,850] Trial 62 finished with value: 0.5454425141435431 and parameters: {'lambda': 0.0036268596066761673, 'alpha': 3.07022674458088e-08, 'subsample': 0.2635304629004013, 'colsample_bytree': 0.26814816336468444, 'learning_rate': 0.04164221516040123, 'max_depth': 18, 'n_estimators': 73}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:17,449] Trial 63 finished with value: 0.532580343227795 and parameters: {'lambda': 0.0021482290862969993, 'alpha': 2.4438454633711583e-08, 'subsample': 0.2658469152130181, 'colsample_bytree': 0.26317295728868534, 'learning_rate': 0.0419633326885014, 'max_depth': 18, 'n_estimators': 85}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:20,881] Trial 64 finished with value: 0.5455619081100958 and parameters: {'lambda': 0.003650530358382242, 'alpha': 2.392416161942314e-08, 'subsample': 0.3178635772980517, 'colsample_bytree': 0.2415398454323338, 'learning_rate': 0.06378468412808477, 'max_depth': 18, 'n_estimators': 61}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:24,207] Trial 65 finished with value: 0.6482702056213386 and parameters: {'lambda': 0.0010119351817173571, 'alpha': 1.72694871830583e-08, 'subsample': 0.3147647964352783, 'colsample_bytree': 0.3355767294713051, 'learning_rate': 0.14912442676143356, 'max_depth': 18, 'n_estimators': 50}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:29,562] Trial 66 finished with value: 0.5619241205103244 and parameters: {'lambda': 0.001820937914975054, 'alpha': 3.4699936491914344e-08, 'subsample': 0.39880437993780493, 'colsample_bytree': 0.36283741635239275, 'learning_rate': 0.0610965678138887, 'max_depth': 19, 'n_estimators': 63}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:33,097] Trial 67 finished with value: 0.5759682530506983 and parameters: {'lambda': 0.006594124584151247, 'alpha': 2.4223562213811096e-08, 'subsample': 0.2624900682695232, 'colsample_bytree': 0.3060381389200454, 'learning_rate': 0.0803100619645167, 'max_depth': 18, 'n_estimators': 66}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:36,145] Trial 68 finished with value: 0.5533519738644054 and parameters: {'lambda': 0.03557073810741478, 'alpha': 9.135883479243865e-08, 'subsample': 0.32710129095579726, 'colsample_bytree': 0.24682253436320883, 'learning_rate': 0.04243453298393327, 'max_depth': 20, 'n_estimators': 57}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:41,463] Trial 69 finished with value: 0.6272342846577711 and parameters: {'lambda': 0.00162710547622667, 'alpha': 4.930773639992504e-08, 'subsample': 0.38217639444528906, 'colsample_bytree': 0.38489488124864235, 'learning_rate': 0.11263050797881029, 'max_depth': 16, 'n_estimators': 73}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:45,279] Trial 70 finished with value: 0.7312466987934071 and parameters: {'lambda': 0.003585991576295025, 'alpha': 1.543017771767456e-07, 'subsample': 0.2960765672390549, 'colsample_bytree': 0.26906814302190557, 'learning_rate': 0.19350213837897506, 'max_depth': 17, 'n_estimators': 82}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:48,725] Trial 71 finished with value: 0.5663126838122504 and parameters: {'lambda': 0.262876504280296, 'alpha': 1.051793788102725e-06, 'subsample': 0.2568083029964536, 'colsample_bytree': 0.2307537593499657, 'learning_rate': 0.05794754990845982, 'max_depth': 18, 'n_estimators': 81}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:53,945] Trial 72 finished with value: 0.5528982526222396 and parameters: {'lambda': 0.0028936756879058673, 'alpha': 1.0616658816812026e-08, 'subsample': 0.3590243821950928, 'colsample_bytree': 0.32233114082842085, 'learning_rate': 0.03617709434845802, 'max_depth': 18, 'n_estimators': 70}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:58,223] Trial 73 finished with value: 0.5432369545517569 and parameters: {'lambda': 0.00843338646501243, 'alpha': 5.741499514487287e-07, 'subsample': 0.23854268259192382, 'colsample_bytree': 0.23484128351263964, 'learning_rate': 0.04584906620859215, 'max_depth': 19, 'n_estimators': 93}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:02,276] Trial 74 finished with value: 0.6367677213760224 and parameters: {'lambda': 0.010844554491811436, 'alpha': 5.915165138977224e-07, 'subsample': 0.24184496404379216, 'colsample_bytree': 0.2585588133507941, 'learning_rate': 0.11152519775737416, 'max_depth': 19, 'n_estimators': 92}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:06,379] Trial 75 finished with value: 0.5571174169526032 and parameters: {'lambda': 0.042075498669974995, 'alpha': 1.8733088524762418e-07, 'subsample': 0.3047555197534811, 'colsample_bytree': 0.20218612033890965, 'learning_rate': 0.049669689099621245, 'max_depth': 20, 'n_estimators': 86}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:11,393] Trial 76 finished with value: 0.6556599639367141 and parameters: {'lambda': 0.0003522880299796923, 'alpha': 2.4280882846385074e-06, 'subsample': 0.4410602341125248, 'colsample_bytree': 0.28346940514865787, 'learning_rate': 0.021210083183288837, 'max_depth': 19, 'n_estimators': 54}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:15,214] Trial 77 finished with value: 0.5779121745648784 and parameters: {'lambda': 0.007670335696446554, 'alpha': 5.144509968986996e-08, 'subsample': 0.27519849108934746, 'colsample_bytree': 0.2389379456094626, 'learning_rate': 0.07381013405515663, 'max_depth': 17, 'n_estimators': 79}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:22,014] Trial 78 finished with value: 0.5549856231711596 and parameters: {'lambda': 0.0008757698131320319, 'alpha': 2.0515435403873646e-08, 'subsample': 0.3350514149254721, 'colsample_bytree': 0.3027040689611254, 'learning_rate': 0.04064504360586133, 'max_depth': 20, 'n_estimators': 92}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:27,515] Trial 79 finished with value: 0.5645061856076573 and parameters: {'lambda': 0.016481327315946676, 'alpha': 1.1755913686334436e-07, 'subsample': 0.2747045691434007, 'colsample_bytree': 0.41647395260867737, 'learning_rate': 0.0541934018953002, 'max_depth': 19, 'n_estimators': 88}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:30,453] Trial 80 finished with value: 0.6106096981273843 and parameters: {'lambda': 0.021306925673849987, 'alpha': 3.085746990267951e-07, 'subsample': 0.237785290213129, 'colsample_bytree': 0.21865281284685356, 'learning_rate': 0.1012157564069846, 'max_depth': 18, 'n_estimators': 72}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:33,804] Trial 81 finished with value: 0.5450685095850879 and parameters: {'lambda': 0.3910441214359621, 'alpha': 1.4063486528810884e-06, 'subsample': 0.20763777384337134, 'colsample_bytree': 0.23345908182468905, 'learning_rate': 0.031095479288094544, 'max_depth': 18, 'n_estimators': 95}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:38,391] Trial 82 finished with value: 0.5479409344428311 and parameters: {'lambda': 0.055843623571873634, 'alpha': 7.672594084092638e-07, 'subsample': 0.34480903792607503, 'colsample_bytree': 0.2002849540511683, 'learning_rate': 0.032522772321552575, 'max_depth': 19, 'n_estimators': 93}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:42,271] Trial 83 finished with value: 0.5591167967458375 and parameters: {'lambda': 0.0025904239680238704, 'alpha': 3.487891481775715e-08, 'subsample': 0.31192301212712625, 'colsample_bytree': 0.25824144894328327, 'learning_rate': 0.06299170255298323, 'max_depth': 18, 'n_estimators': 65}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:46,430] Trial 84 finished with value: 0.5489946654949618 and parameters: {'lambda': 0.010711188709130822, 'alpha': 5.941148571370244e-08, 'subsample': 0.25283688396425996, 'colsample_bytree': 0.3205519203108514, 'learning_rate': 0.04473218024112889, 'max_depth': 17, 'n_estimators': 80}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:51,211] Trial 85 finished with value: 0.5454058441742045 and parameters: {'lambda': 0.21191114637154645, 'alpha': 1.7069021747223223e-06, 'subsample': 0.2816660640852129, 'colsample_bytree': 0.29354203438812176, 'learning_rate': 0.028383203721260002, 'max_depth': 19, 'n_estimators': 95}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:56,916] Trial 86 finished with value: 0.5560013901998057 and parameters: {'lambda': 0.2027037518890819, 'alpha': 4.4484025634973083e-07, 'subsample': 0.2881127096833037, 'colsample_bytree': 0.3584476957079319, 'learning_rate': 0.022970928466720997, 'max_depth': 18, 'n_estimators': 104}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:00,808] Trial 87 finished with value: 0.5452843927936505 and parameters: {'lambda': 0.0724776962530672, 'alpha': 1.6072138624716317e-06, 'subsample': 0.22493911938622116, 'colsample_bytree': 0.23930030823202367, 'learning_rate': 0.029051721782793364, 'max_depth': 20, 'n_estimators': 94}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:05,080] Trial 88 finished with value: 0.6122397920845882 and parameters: {'lambda': 0.3926214919691781, 'alpha': 5.20134245175365e-06, 'subsample': 0.22326360557950037, 'colsample_bytree': 0.33103931635246486, 'learning_rate': 0.01581492357996074, 'max_depth': 20, 'n_estimators': 96}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:09,699] Trial 89 finished with value: 0.54369354052703 and parameters: {'lambda': 0.08440941971756566, 'alpha': 1.71568843694643e-06, 'subsample': 0.23744357112602013, 'colsample_bytree': 0.21973680090136738, 'learning_rate': 0.02774601958698305, 'max_depth': 19, 'n_estimators': 117}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:13,216] Trial 90 finished with value: 0.5441979703216908 and parameters: {'lambda': 0.9995777421898492, 'alpha': 2.206074780945274e-06, 'subsample': 0.20149735524389611, 'colsample_bytree': 0.2297371201587289, 'learning_rate': 0.027868738355782217, 'max_depth': 20, 'n_estimators': 105}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:16,966] Trial 91 finished with value: 0.5474748504313708 and parameters: {'lambda': 0.07379241876467516, 'alpha': 1.3086158444928505e-06, 'subsample': 0.2013149371376218, 'colsample_bytree': 0.21711574870865735, 'learning_rate': 0.028177558449633468, 'max_depth': 20, 'n_estimators': 107}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:21,655] Trial 92 finished with value: 0.5589636626874854 and parameters: {'lambda': 0.9924600023337846, 'alpha': 1.4478910472321209e-06, 'subsample': 0.23754518211674278, 'colsample_bytree': 0.29511020881676076, 'learning_rate': 0.019762485413860297, 'max_depth': 20, 'n_estimators': 117}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:25,492] Trial 93 finished with value: 0.6191651385106928 and parameters: {'lambda': 0.16305634843175618, 'alpha': 4.237776388516178e-06, 'subsample': 0.21777905333493716, 'colsample_bytree': 0.22001140848856016, 'learning_rate': 0.013591279822230651, 'max_depth': 19, 'n_estimators': 104}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:28,917] Trial 94 finished with value: 0.5382495336344623 and parameters: {'lambda': 0.2846161397768814, 'alpha': 2.182145163362304e-06, 'subsample': 0.2027126808301043, 'colsample_bytree': 0.252497971647853, 'learning_rate': 0.03364593401093403, 'max_depth': 19, 'n_estimators': 95}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:31,894] Trial 95 finished with value: 0.5416538699680012 and parameters: {'lambda': 0.33279461471335114, 'alpha': 2.0415387526989004e-05, 'subsample': 0.20508863545504274, 'colsample_bytree': 0.20017135811162498, 'learning_rate': 0.03539778135261181, 'max_depth': 20, 'n_estimators': 90}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:34,983] Trial 96 finished with value: 0.5424150261076022 and parameters: {'lambda': 0.3291621846689235, 'alpha': 4.876706528920014e-06, 'subsample': 0.20585611814346355, 'colsample_bytree': 0.20477895858130177, 'learning_rate': 0.033806928092234574, 'max_depth': 19, 'n_estimators': 89}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:38,195] Trial 97 finished with value: 0.5487368975379442 and parameters: {'lambda': 0.32097427957389696, 'alpha': 1.7400655038010598e-05, 'subsample': 0.20080927372677157, 'colsample_bytree': 0.2073407648496909, 'learning_rate': 0.03323991772208836, 'max_depth': 20, 'n_estimators': 98}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:41,360] Trial 98 finished with value: 0.5682462060513137 and parameters: {'lambda': 0.7285672576968819, 'alpha': 3.781380558946924e-06, 'subsample': 0.23825709299194875, 'colsample_bytree': 0.20286188580937986, 'learning_rate': 0.024154833437132445, 'max_depth': 19, 'n_estimators': 90}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:45,103] Trial 99 finished with value: 0.6603433381067245 and parameters: {'lambda': 0.10708360425328674, 'alpha': 6.635710870254048e-06, 'subsample': 0.20524048569618314, 'colsample_bytree': 0.22566632418201965, 'learning_rate': 0.010550307671215309, 'max_depth': 19, 'n_estimators': 107}. Best is trial 63 with value: 0.532580343227795.

Best trial:
  Value: 0.532580343227795
  Params: 
    lambda: 0.0021482290862969993
    alpha: 2.4438454633711583e-08
    subsample: 0.2658469152130181
    colsample_bytree: 0.26317295728868534
    learning_rate: 0.0419633326885014
    max_depth: 18
    n_estimators: 85

Part II:Using best parameters to train XGBoost¶

In [ ]:

#Task remaining, use stratified folds with Kfolds for training.

params = {'lambda': 0.0021482290862969993,
 'alpha': 2.4438454633711583e-08,
 'subsample': 0.2658469152130181,
 'colsample_bytree': 0.26317295728868534,
 'learning_rate': 0.0419633326885014,
 'max_depth': 18,
 'n_estimators': 85}

X_train, X_test, Y_train, y_test = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
# X_train, X_test, Y_train, y_test = train_test_split(train, y_train, test_size=0.1, random_state=42)
# model = xgb.XGBClassifier(**{'colsample_bylevel': 0.9, 'learning_rate': 0.05, 'max_depth': 20, 'n_estimators': 200,
#                                  'reg_lambda': 15, 'eval_metric':'mlogloss'
#                              }).fit(X_train, Y_train,eval_set=[(X_test,y_test)],verbose=True,early_stopping_rounds=10)
model = xgb.XGBClassifier(**params).fit(X_train, Y_train,eval_set=[(X_test,y_test)],verbose=True,early_stopping_rounds=10)

[0]	validation_0-mlogloss:1.06489
[1]	validation_0-mlogloss:1.03391
[2]	validation_0-mlogloss:1.00446
[3]	validation_0-mlogloss:0.97824
[4]	validation_0-mlogloss:0.95265
[5]	validation_0-mlogloss:0.92869
[6]	validation_0-mlogloss:0.90672
[7]	validation_0-mlogloss:0.88671
[8]	validation_0-mlogloss:0.86817
[9]	validation_0-mlogloss:0.84989
[10]	validation_0-mlogloss:0.83220
[11]	validation_0-mlogloss:0.81559
[12]	validation_0-mlogloss:0.79999
[13]	validation_0-mlogloss:0.78577
[14]	validation_0-mlogloss:0.76979
[15]	validation_0-mlogloss:0.75615
[16]	validation_0-mlogloss:0.74430
[17]	validation_0-mlogloss:0.73394
[18]	validation_0-mlogloss:0.72340
[19]	validation_0-mlogloss:0.71213
[20]	validation_0-mlogloss:0.70240
[21]	validation_0-mlogloss:0.69234
[22]	validation_0-mlogloss:0.68344
[23]	validation_0-mlogloss:0.67649
[24]	validation_0-mlogloss:0.66963
[25]	validation_0-mlogloss:0.66135
[26]	validation_0-mlogloss:0.65426
[27]	validation_0-mlogloss:0.64670
[28]	validation_0-mlogloss:0.64092
[29]	validation_0-mlogloss:0.63407
[30]	validation_0-mlogloss:0.62783
[31]	validation_0-mlogloss:0.62136
[32]	validation_0-mlogloss:0.61609
[33]	validation_0-mlogloss:0.61149
[34]	validation_0-mlogloss:0.60653
[35]	validation_0-mlogloss:0.60186
[36]	validation_0-mlogloss:0.59752
[37]	validation_0-mlogloss:0.59198
[38]	validation_0-mlogloss:0.58920
[39]	validation_0-mlogloss:0.58555
[40]	validation_0-mlogloss:0.58232
[41]	validation_0-mlogloss:0.57922
[42]	validation_0-mlogloss:0.57612
[43]	validation_0-mlogloss:0.57331
[44]	validation_0-mlogloss:0.57131
[45]	validation_0-mlogloss:0.56983
[46]	validation_0-mlogloss:0.56668
[47]	validation_0-mlogloss:0.56498
[48]	validation_0-mlogloss:0.56203
[49]	validation_0-mlogloss:0.55954
[50]	validation_0-mlogloss:0.55627
[51]	validation_0-mlogloss:0.55448
[52]	validation_0-mlogloss:0.55283
[53]	validation_0-mlogloss:0.55103
[54]	validation_0-mlogloss:0.54848
[55]	validation_0-mlogloss:0.54717
[56]	validation_0-mlogloss:0.54580
[57]	validation_0-mlogloss:0.54485
[58]	validation_0-mlogloss:0.54429
[59]	validation_0-mlogloss:0.54348
[60]	validation_0-mlogloss:0.54173
[61]	validation_0-mlogloss:0.54092
[62]	validation_0-mlogloss:0.53995
[63]	validation_0-mlogloss:0.53866
[64]	validation_0-mlogloss:0.53831
[65]	validation_0-mlogloss:0.53765
[66]	validation_0-mlogloss:0.53714
[67]	validation_0-mlogloss:0.53656
[68]	validation_0-mlogloss:0.53638
[69]	validation_0-mlogloss:0.53584
[70]	validation_0-mlogloss:0.53612
[71]	validation_0-mlogloss:0.53557
[72]	validation_0-mlogloss:0.53456
[73]	validation_0-mlogloss:0.53392
[74]	validation_0-mlogloss:0.53278
[75]	validation_0-mlogloss:0.53291
[76]	validation_0-mlogloss:0.53192
[77]	validation_0-mlogloss:0.53182
[78]	validation_0-mlogloss:0.53220
[79]	validation_0-mlogloss:0.53227
[80]	validation_0-mlogloss:0.53185
[81]	validation_0-mlogloss:0.53200
[82]	validation_0-mlogloss:0.53186
[83]	validation_0-mlogloss:0.53255
[84]	validation_0-mlogloss:0.53258

In [ ]:

test_y_orig = model.predict_proba(X_test)
print(test_y_orig.shape)

test_y = np.argmax(test_y_orig,axis=1)
print("acc",accuracy_score(y_test, test_y))
print("f1_score",f1_score(y_test,test_y, labels=[0,1,2],average='macro'))
print("logLoss",log_loss(y_test,test_y_orig))

(677, 3)
acc 0.7858197932053176
f1_score 0.41304707411345726
logLoss 0.5318208347560296

Save your trained model¶

In [ ]:

# model.save()
Filename = f'{AICROWD_ASSETS_DIR}/model_xgb_exp_4-2.pkl'

pickle.dump(model, open(Filename, "wb"))

Prediction phase 🔎¶

Please make sure to save the weights from the training section in your assets directory and load them in this section

In [ ]:

# model = load_model_from_assets_dir(AIcrowdConfig.ASSETS_DIR)
Filename = f'{AICROWD_ASSETS_DIR}/model_xgb_exp_4-2.pkl'
# load model from file
loaded_model = pickle.load(open(Filename, "rb"))

Load test data¶

In [ ]:

test_df = pd.read_csv(AICROWD_DATASET_PATH)
test_df.head()

Out[ ]:

	row_id	number_of_digits	missing_digit_1	missing_digit_2	missing_digit_3	missing_digit_4	missing_digit_5	missing_digit_6	missing_digit_7	missing_digit_8	missing_digit_9	missing_digit_10	missing_digit_11	missing_digit_12	1 dist from cen	10 dist from cen	11 dist from cen	12 dist from cen	2 dist from cen	3 dist from cen	4 dist from cen	5 dist from cen	6 dist from cen	7 dist from cen	8 dist from cen	9 dist from cen	euc_dist_digit_1	euc_dist_digit_2	euc_dist_digit_3	euc_dist_digit_4	euc_dist_digit_5	euc_dist_digit_6	euc_dist_digit_7	euc_dist_digit_8	euc_dist_digit_9	euc_dist_digit_10	euc_dist_digit_11	euc_dist_digit_12	area_digit_1	area_digit_2	area_digit_3	area_digit_4	area_digit_5	area_digit_6	area_digit_7	area_digit_8	area_digit_9	area_digit_10	area_digit_11	area_digit_12	height_digit_1	height_digit_2	height_digit_3	height_digit_4	height_digit_5	height_digit_6	height_digit_7	height_digit_8	height_digit_9	height_digit_10	height_digit_11	height_digit_12	width_digit_1	width_digit_2	width_digit_3	width_digit_4	width_digit_5	width_digit_6	width_digit_7	width_digit_8	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	sequence_flag_cw	number_of_hands	hand_count_dummy	hour_hand_length	minute_hand_length	single_hand_length	clockhand_ratio	clockhand_diff	angle_between_hands	deviation_from_centre	intersection_pos_rel_centre	hour_proximity_from_11	minute_proximity_from_2	hour_pointing_digit	actual_hour_digit	minute_pointing_digit	actual_minute_digit	final_rotation_angle	ellipse_circle_ratio	count_defects	percentage_inside_ellipse	pred_tremor	double_major	double_minor	vertical_dist	horizontal_dist	top_area_perc	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	other_error	time_diff	centre_dot_detect
0	LA9JQ1JZMJ9D2MBZV	11.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	314.649805	NaN	408.240125	323.348110	321.706776	264.496219	203.330396	205.081081	282.015071	343.657169	416.716030	435.900218	6.119758	25.267069	17.29	6.006505	10.246421	14.43	4.778738	43.124586	46.80	NaN	67.293643	3.90	2001.0	4180.0	6318.0	6528.0	6370.0	8127.0	5610.0	3312.0	9372.0	NaN	3500.0	6336.0	69.0	95.0	117.0	128.0	98.0	129.0	102.0	69.0	142.0	NaN	70.0	72.0	29.0	44.0	54.0	51.0	65.0	63.0	55.0	48.0	66.0	NaN	50.0	88.0	225.618182	730.963636	4.773900e+06	20.605000	360.0	854.199907	NaN	8623.343673	NaN	8623.343673	0.0	3.0	3.0	NaN	NaN	183.844962	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11	NaN	2	0.0	84.753550	106	1.000000	0	118.971780	106.379109	111.720745	112.581495	0.500272	0.499368	0.553194	0.446447	0	0	1	NaN	NaN
1	PSSRCWAPTAG72A1NT	6.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	NaN	NaN	235.663425	NaN	NaN	325.616723	NaN	NaN	288.257264	292.027396	334.951116	370.648756	NaN	NaN	22.88	NaN	NaN	72.80	72.787316	20.133319	96.33	NaN	60.955820	NaN	NaN	NaN	12390.0	NaN	NaN	8848.0	5632.0	10434.0	7739.0	NaN	11834.0	NaN	NaN	NaN	118.0	NaN	NaN	79.0	64.0	94.0	71.0	NaN	97.0	NaN	NaN	NaN	105.0	NaN	NaN	112.0	88.0	111.0	109.0	NaN	122.0	NaN	126.166667	391.766667	6.631428e+06	64.003333	NaN	5998.258485	NaN	16273.285540	NaN	16273.285540	0.0	1.0	1.0	NaN	NaN	99.180032	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11	NaN	2	180.0	73.359021	99	1.000000	0	123.968624	99.208099	104.829045	114.955335	0.572472	0.427196	0.496352	0.503273	0	1	1	NaN	NaN
2	GCTODIZJB42VCBZRZ	11.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	438.627689	429.789774	447.455305	447.033835	409.185166	361.946474	359.824957	NaN	345.937133	366.201106	375.225266	427.154831	112.333641	100.371900	86.45	86.234478	NaN	89.57	94.556399	97.331146	111.02	111.411562	116.061975	116.22	3182.0	4473.0	4554.0	5032.0	NaN	5355.0	4148.0	4320.0	4420.0	7290.0	2726.0	5184.0	43.0	71.0	69.0	68.0	NaN	51.0	68.0	48.0	52.0	81.0	47.0	81.0	74.0	63.0	66.0	74.0	NaN	105.0	61.0	90.0	85.0	90.0	58.0	64.0	228.072727	192.618182	1.418911e+06	100.815000	360.0	315.683251	NaN	257.619483	NaN	257.619483	1.0	2.0	2.0	42.707325	78.437307	NaN	1.836624	35.729983	106.779868	55.597531	BL	6.15111	0.57766	11.0	11	2.0	2	270.0	86.346225	120	1.000000	0	124.134670	120.392100	122.909870	121.542463	0.494076	0.505583	0.503047	0.496615	1	0	0	0.0	0.0
3	7YMVQGV1CDB1WZFNE	3.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.0	NaN	NaN	NaN	408.827592	272.472476	NaN	195.714716	NaN	NaN	NaN	NaN	NaN	NaN	2.506574	NaN	4.353660	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12.48	NaN	1794.0	NaN	3416.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3360.0	NaN	39.0	NaN	56.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	56.0	NaN	46.0	NaN	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	70.333333	96.333333	8.477293e+05	12.480000	360.0	NaN	360.0	11194.405100	NaN	11194.405100	1.0	3.0	3.0	NaN	NaN	204.987533	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11	NaN	2	30.0	51.132436	16	0.800000	1	69.766987	53.627186	53.983727	69.002438	0.555033	0.444633	0.580023	0.419575	0	1	1	NaN	NaN
4	PHEQC6DV3LTFJYIJU	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	436.069089	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	113.252059	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	25542.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	129.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	198.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	0.0	NaN	0.0	2.0	2.0	77.405367	92.911356	NaN	1.200322	15.505989	100.478258	8.853306	TR	NaN	NaN	8.0	11	8.0	2	30.0	54.115853	18	0.666667	1	112.043734	87.607876	94.088846	101.540792	0.603666	0.395976	0.494990	0.504604	0	0	1	150.0	0.0

Generate predictions¶

In [ ]:

#Test set Pre processing
test_df.fillna(-999,inplace=True)

#Create more features from categorical features
df_dummies = pd.get_dummies(test_df['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
                          dummy_na=False).add_prefix('c_i_')
test_df = test_df.drop('intersection_pos_rel_centre', axis=1)
test_df = pd.concat([test_df, df_dummies], axis=1)

df_dummies = pd.get_dummies(test_df['hand_count_dummy'], columns='hand_count_dummy',
                          dummy_na=False).add_prefix('c_h_')
test_df = test_df.drop('hand_count_dummy', axis=1)
test_df = pd.concat([test_df, df_dummies], axis=1)

feat_col = test_df['final_rotation_angle']
test_df['rotation_angle_180'] = (feat_col <= 180).astype('int')    #we will also include NaN in this columntest_
test_df['rotation_angle_360'] = (feat_col > 180).astype('int') 
test_df = test_df.drop('final_rotation_angle', axis=1)

features =test_df.columns[1:].to_list()
# test_data = (test_data-test_data.mean())/test_data.std()
test_df.describe()

Out[ ]:

	number_of_digits	missing_digit_1	missing_digit_2	missing_digit_3	missing_digit_4	missing_digit_5	missing_digit_6	missing_digit_7	missing_digit_8	missing_digit_9	missing_digit_10	missing_digit_11	missing_digit_12	1 dist from cen	10 dist from cen	11 dist from cen	12 dist from cen	2 dist from cen	3 dist from cen	4 dist from cen	5 dist from cen	6 dist from cen	7 dist from cen	8 dist from cen	9 dist from cen	euc_dist_digit_1	euc_dist_digit_2	euc_dist_digit_3	euc_dist_digit_4	euc_dist_digit_5	euc_dist_digit_6	euc_dist_digit_7	euc_dist_digit_8	euc_dist_digit_9	euc_dist_digit_10	euc_dist_digit_11	euc_dist_digit_12	area_digit_1	area_digit_2	area_digit_3	area_digit_4	area_digit_5	area_digit_6	area_digit_7	area_digit_8	area_digit_9	area_digit_10	area_digit_11	area_digit_12	height_digit_1	height_digit_2	height_digit_3	height_digit_4	height_digit_5	height_digit_6	height_digit_7	height_digit_8	height_digit_9	height_digit_10	height_digit_11	height_digit_12	width_digit_1	width_digit_2	width_digit_3	width_digit_4	width_digit_5	width_digit_6	width_digit_7	width_digit_8	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	sequence_flag_cw	sequence_flag_ccw	number_of_hands	hour_hand_length	minute_hand_length	single_hand_length	clockhand_ratio	clockhand_diff	angle_between_hands	deviation_from_centre	hour_proximity_from_11	minute_proximity_from_2	hour_pointing_digit	actual_hour_digit	minute_pointing_digit	actual_minute_digit	ellipse_circle_ratio	count_defects	percentage_inside_ellipse	pred_tremor	double_major	double_minor	vertical_dist	horizontal_dist	top_area_perc	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	eleven_ten_error	other_error	time_diff	centre_dot_detect	c_i_-999	c_i_BL	c_i_BR	c_i_TL	c_i_TR	c_h_-999.0	c_h_1.0	c_h_2.0	c_h_3.0	rotation_angle_180	rotation_angle_360
count	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	3.620000e+02	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.0	362.000000	362.0	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000	362.000000
mean	1.162983	-7.991713	-8.102210	-8.107735	-8.030387	-8.019337	-8.088398	-8.082873	-8.093923	-7.991713	-8.044199	-8.058011	-8.091160	-45.680436	31.642033	55.295891	102.210829	92.376567	90.287175	-13.501951	-26.706205	79.147537	85.511666	99.085287	-30.102850	-275.058934	-156.037805	-150.985663	-232.805849	-248.223235	-176.613370	-186.233368	-173.101758	-278.854475	-218.951005	-209.023783	-175.511602	1387.035912	3198.044199	3905.262431	3924.861878	4630.842541	4684.417127	3675.535912	4496.571823	3633.817680	4703.665746	3817.712707	5386.571823	-252.676796	-129.790055	-114.955801	-192.715470	-204.524862	-127.077348	-139.024862	-123.234807	-231.419890	-182.058011	-171.676796	-133.657459	-266.469613	-136.226519	-128.914365	-208.651934	-212.323204	-144.569061	-157.389503	-139.649171	-252.367403	-183.378453	-176.593923	-126.798343	291.317883	274.955367	4.573130e+06	-23.236683	133.085635	3015.339961	-248.618785	4345.408310	-924.364641	4067.561501	-29.718232	-30.317680	-150.320442	-378.732888	-367.350713	-717.272715	-418.681869	-402.568102	-361.103059	-409.578908	-470.287429	-475.533980	-417.058011	11.0	-422.248619	2.0	42.627259	90.502762	-34.977202	0.370166	102.970476	105.856352	96.312866	110.905686	-74.034884	-74.075900	-74.017108	-74.084348	0.640884	0.616022	0.030387	0.701657	-366.298343	-413.803867	0.414365	0.110497	0.063536	0.270718	0.140884	0.151934	0.248619	0.585635	0.013812	0.837017	0.162983
std	91.608728	90.718691	90.708251	90.707725	90.715052	90.716093	90.709563	90.710088	90.709038	90.718691	90.713749	90.712443	90.709301	620.279114	587.337546	577.393553	547.077751	523.548776	512.949847	582.246763	590.840671	540.226328	553.225637	541.535323	629.647782	470.360625	403.259548	398.283328	451.937553	458.762786	411.248521	420.471216	413.137277	470.931505	443.592290	439.178061	412.022393	1845.688981	2642.225254	3247.723684	3643.716621	4272.443444	3744.436749	3108.776373	3767.060731	3578.529749	4171.058854	3526.989193	4645.630064	484.438451	415.132879	414.544209	475.209106	481.700497	435.742522	437.047158	430.028528	498.334084	463.951228	452.171630	428.512073	475.397341	412.002566	408.079185	465.664206	476.957280	426.714034	427.533019	421.809131	484.546268	463.221351	449.561187	432.158271	417.552432	412.680847	5.320984e+06	238.272403	494.618100	7075.591485	670.813929	7278.555434	296.908559	7273.713097	171.828469	171.721675	359.713647	522.571851	532.151435	473.315023	494.400626	502.484286	537.687130	502.340675	515.915244	519.510469	498.608806	0.0	496.965762	0.0	193.684341	40.329684	186.314716	0.483517	145.715864	15.527933	132.187436	60.881096	262.957519	262.945862	262.962562	262.943447	0.684848	0.681426	0.171887	0.458164	570.532069	492.923960	0.493294	0.313942	0.244262	0.444946	0.348383	0.359453	0.432811	0.493294	0.116872	0.369862	0.369862
min	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-9.990000e+02	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	11.0	-999.000000	2.0	-999.000000	1.000000	-999.000000	0.000000	-999.000000	11.273212	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	0.000000	0.000000	0.000000	0.000000	-999.000000	-999.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	8.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-999.000000	144.294198	262.042659	284.137571	249.118225	243.287434	-999.000000	-999.000000	266.724106	249.861748	271.547610	-999.000000	-999.000000	2.518303	3.542500	-999.000000	-999.000000	1.592500	1.207865	1.613887	-999.000000	0.473824	1.379151	1.982500	-999.000000	2094.250000	2217.000000	-999.000000	-999.000000	2775.000000	2095.500000	2782.000000	-999.000000	1980.250000	1917.000000	2965.750000	-999.000000	43.000000	52.250000	-999.000000	-999.000000	47.000000	46.000000	52.000000	-999.000000	41.000000	43.250000	51.000000	-999.000000	41.000000	39.000000	-999.000000	-999.000000	43.000000	36.000000	42.000000	-999.000000	41.250000	39.000000	55.250000	126.895833	126.715909	1.153591e+06	9.725625	360.000000	63.526159	-999.000000	47.786087	-999.000000	47.786087	0.000000	0.000000	1.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	-999.000000	11.0	-999.000000	2.0	75.978601	72.250000	1.000000	0.000000	115.206222	101.999828	105.967279	108.592271	0.465116	0.472290	0.496989	0.446083	0.000000	0.000000	0.000000	0.000000	-999.000000	-999.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
50%	11.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	340.594135	348.951608	357.445736	367.249656	334.426383	322.181795	313.459855	312.240464	337.474813	358.576575	356.006472	358.128023	8.123884	17.638277	18.265000	10.728551	9.665663	9.620000	10.250842	13.793486	7.605000	11.827152	11.258665	9.360000	1598.000000	3256.000000	3942.500000	4191.000000	5068.500000	4931.000000	3825.000000	4560.000000	4180.000000	4884.500000	4048.500000	5137.000000	49.000000	61.000000	71.000000	73.000000	75.000000	79.500000	73.000000	76.500000	77.000000	71.000000	66.000000	67.000000	30.000000	52.000000	52.000000	54.000000	63.500000	60.000000	50.000000	57.000000	51.000000	68.000000	57.000000	74.000000	242.727273	233.799242	2.902186e+06	17.192500	360.000000	331.140164	360.000000	218.383128	-999.000000	211.791500	1.000000	0.000000	2.000000	45.604622	64.340685	-999.000000	1.078746	5.955294	75.630401	5.934735	0.207853	0.128705	2.000000	11.0	2.000000	2.0	82.989690	99.000000	1.000000	0.000000	120.411917	109.019512	112.206355	115.840623	0.494028	0.499047	0.519191	0.473735	1.000000	1.000000	0.000000	1.000000	-10.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000
75%	12.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	379.660083	392.670725	392.631904	401.540704	370.185361	359.760763	355.246755	356.668000	375.137852	391.751116	392.089319	395.523221	22.987777	41.173630	39.422500	33.101467	32.315729	25.902500	33.088856	33.130066	22.620000	29.896813	33.380212	24.765000	2472.000000	4670.000000	5692.500000	6287.500000	7552.250000	7100.500000	5622.000000	6390.000000	6016.500000	7194.500000	5731.000000	7551.250000	66.750000	75.750000	86.000000	94.000000	93.000000	102.000000	92.750000	97.000000	100.000000	85.000000	81.000000	84.000000	40.000000	64.750000	67.000000	69.000000	82.000000	72.000000	63.000000	71.000000	63.000000	85.000000	74.000000	92.000000	420.986742	403.266793	6.006315e+06	51.390625	360.000000	2449.662538	360.000000	8611.869491	-999.000000	7907.706964	1.000000	0.000000	2.000000	62.474110	81.952387	31.634053	1.337104	20.814321	93.535639	14.051095	5.927157	5.204592	11.000000	11.0	2.000000	2.0	86.643218	120.000000	1.000000	1.000000	124.412267	113.820122	116.341874	120.725598	0.514035	0.522828	0.539565	0.494168	1.000000	1.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000
max	13.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	492.941426	505.421853	495.745903	571.075520	486.444498	456.444137	437.433424	478.221967	535.724276	658.555427	529.761503	486.066868	112.333641	116.282230	114.400000	113.624643	108.472561	115.700000	115.930144	118.501563	111.410000	114.565815	116.061975	119.600000	8890.000000	12330.000000	16380.000000	19314.000000	22912.000000	19608.000000	17030.000000	25542.000000	14522.000000	18810.000000	22490.000000	23406.000000	127.000000	122.000000	133.000000	167.000000	154.000000	167.000000	152.000000	214.000000	165.000000	165.000000	173.000000	157.000000	127.000000	137.000000	163.000000	174.000000	179.000000	167.000000	139.000000	198.000000	145.000000	145.000000	142.000000	249.000000	2714.636364	2840.333333	3.639078e+07	114.205000	360.000000	46608.017880	360.000000	58155.070340	360.000000	58155.070340	1.000000	1.000000	3.000000	102.329077	116.749554	241.175192	2.318983	57.496215	179.918115	166.550485	172.292277	178.954617	12.000000	11.0	12.000000	2.0	99.147283	159.000000	1.000000	1.000000	391.080498	187.195461	390.900352	258.983248	0.999766	0.999609	0.999422	1.000000	3.000000	2.000000	1.000000	1.000000	600.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

In [ ]:

test_data = test_df[features]
preds = loaded_model.predict_proba(test_data)
# preds

In [ ]:

# (preds==0).astype(int)

In [ ]:

check_val =False

if check_val:
    
    y_true = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "validation_ground_truth"))
    y_test = y_true['diagnosis'].map(target_dict).values
    preds_2 = np.argmax(preds,axis=1)
    
    print("acc",accuracy_score(y_test, preds_2))
    print("f1_score",f1_score(y_test,preds_2, labels=[0,1,2],average='macro'))
    print("logLoss",log_loss(y_test,preds))

In [ ]:

# predictions = {
#     "row_id": test_data["row_id"].values,
#     "normal_diagnosis_probability": (preds==0).astype(int),
#     "post_alzheimer_diagnosis_probability":(preds==1).astype(int),
#     "pre_alzheimer_diagnosis_probability": (preds==2).astype(int),
# }


predictions = {
    "row_id": test_df["row_id"].values,
    "normal_diagnosis_probability": preds[:,0],
    "post_alzheimer_diagnosis_probability":preds[:,1],
    "pre_alzheimer_diagnosis_probability": preds[:,2],
}

predictions_df = pd.DataFrame.from_dict(predictions)

Save predictions 📨¶

In [ ]:

predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

Submit to AIcrowd 🚀¶

NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)

In [ ]:

%env DATASET_PATH=$AICROWD_DATASET_PATH
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge

env: DATASET_PATH=Z:/challenge-data/validation.csv
Using notebook: C:\Users\workspace\EDA, FE and HPO - All you need (LB - 0.640).ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
Executing predict.ipynb...
submission.zip --------------------- 100.0% • 24.2/24.2 MB • 2.2 MB/s • 0:00:00                                                 +-------------------------+                                                 
                                                 | Successfully submitted! |                                                 
                                                 +-------------------------+                                                 
                                                       Important links                                                       
+---------------------------------------------------------------------------------------------------------------------------+
|  This submission | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions/136714              |
|                  |                                                                                                        |
|  All submissions | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions?my_submissions=true |
|                  |                                                                                                        |
|      Leaderboard | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/leaderboards                    |
|                  |                                                                                                        |
| Discussion forum | https://discourse.aicrowd.com/c/addi-alzheimers-detection-challenge                                    |
|                  |                                                                                                        |
|   Challenge page | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge                                 |
+---------------------------------------------------------------------------------------------------------------------------+

[NbConvertApp] Converting notebook C:\Users\workspace\submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 1080 bytes to C:\Users\workspace\submission\install.nbconvert.ipynb
[NbConvertApp] Converting notebook C:\Users\workspace\submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 116528 bytes to C:\Users\workspace\submission\predict.nbconvert.ipynb

Conclusion:¶

This notebook demonstrates exploratory data analysis using which we can perfrom some feature engineering. Next part explains the use of Optuna for hyper-parameter optimization of Xgboost model. My idea is to further elaborate the feature engineering and ensemble in a follow-up discussion/blog walkthrough thread. I'll also try to have a good stratifiedKfolds cross validation as I believe current validation is a bit unsatisfying to me. Stay tuned for next part and if you find this notebook useful please don't forget to hit the like button on top of the notebook.

References:

In [ ]:

ADDI Alzheimers Detection Challenge

EDA, FE, HPO - All you need (LB: 0.640)

What is the notebook about?¶

How to use this notebook? 📝¶

Content:¶

Introduction:¶

Setup AIcrowd Utilities 🛠¶

AIcrowd Runtime Configuration 🧷¶

Install packages 🗃¶

Define preprocessing code 💻¶

Import common packages¶

Training phase ⚙️¶

Load training data¶

Exploratory Data Analysis¶

Feature Engineering and Data Preparation¶

FE - Part I: Creating new features¶

FE - Part II: Dealing with Class Imbalance¶

Redundant Code¶

Train your model¶

Part I: Hyper-parameter Optimization using Optuna¶

Part II:Using best parameters to train XGBoost¶

Save your trained model¶

Prediction phase 🔎¶

Load test data¶

Generate predictions¶

Save predictions 📨¶

Submit to AIcrowd 🚀¶

Conclusion:¶

Content

Comments