# End-to-End Simple Solution (9 Models + Data Imbalance)

In this simple notebook I provide an End-to-End Solution with 9 models + data imbalance options

In this simple notebook I provide an End-to-End Solution with 9 models + several data imbalance options. The idea primarily is for you to have FUN, and also, to be able to plug and play different models and choose the one you like best, being able to also balance the data with different methods!

## WELCOME to my Simple Solution!¶

### Hello stranger, do I know you from before?¶

Hi! My name is Santiago, I am a geophysicist from Argentina who is starting his Data Science journey, maybe just like you!

This notebook is intended to bring a smile to your face while achieving an end-to-end solution to this wonderful challenge. I will try to make this time fun and at the same time learn something new.

All of the things explained here do not hold a rigurous approach, I tried to explain everything how I would have liked to have it explained to me.

#### A little story before we go in.¶

A few years back my grandmother was diagnosed with Alzheimer. This disease is a slow, crippling machine. You get to see people who are bright, happy and active, suddenly lose their brightness, and activeness, but you still sometimes get glimpses of their happiness. You are able to recognize these moments by the look in their eyes. My grandma went through the test, it was actually the first test that she went through. Her results, as my father told me, had all of the number (more than only twelve) grouped up at the top right of the clock: "It was as if for her, the time didn't pass". I am a very optimistic, positive person, and humor helped me cope with this journey, and I will try to transmit this in this notebook. ## What is the notebook about?¶

You probably already know this by now but:

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

## THIS NOTEBOOK HAS 9 MODELS YOU CAN TRY¶

WAIT, WHAT? Yes, I have put 9 models you can try to optimize. Nothing fancy on their own, but I guess you didn't expect me to do your job, right?

I also provide a few methods to address the imbalance situation, so go ahead a try those too!

### Explain it to me like I'm five years old:¶

Imagine you are a fan of hotdogs, you like to eat hotdogs with several toppings. You try several combinations of 5 different toppings (each one called A,B,C,D and E) until you decide that you are going to put different combinations of multiple toppings in a ranking because how they taste is similar. Finally you decide to label them into 3 groups:

• The Puke-nators: you hate these. The ones that fell in this category had the following combination of toppings: A; A and C; A and D; and A,D and E.
• The MehMehs: these are like those memes that your parents send you, a big MEH. Same as before these were the toppings combinations that fell in this catergory: B; B and C; and, B and E.
• The ThisNotebookShouldWinAPS5s: yes, you like these, so press the like button below. The combinations you tried here had: C and E; C and D; C; and D.

Now a good friend of you brings you one hotdog, you look at it and see that it has the following toppings: C, B, D and E. When you look at the toppings, your refined paladar tells you that there is a good chance that you'll have to press the like butt...err I mean that you would like that hotdog.

This is the basic idea behind this challenge, we train our computer to learn, based on the combination of the values of all of the columns in the dataset and the label of the Stage of Alzheimer, which of the labels corresponds to a (maybe) new combination of values in a dataset for which it doesn't have labels for.

# How to use this notebook? 📝¶

### SHIFT + ENTER and SHIFT + ENTER and SHIFT + ENTER...¶

up to where you need to modify the config parameters for your AIC Key.

• Update the config parameters. You can define the common variables here
Variable Description
`AICROWD_DATASET_PATH` Path to the file containing test data (The data will be available at `/ds_shared_drive/` on aridhia workspace). This should be an absolute path.
`AICROWD_PREDICTIONS_PATH` Path to write the output to.
`AICROWD_ASSETS_DIR` In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY` In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me
• Installing packages. Please use the Install packages 🗃 section to install the packages
• Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

# Setup AIcrowd Utilities 🛠¶

### dum dumdum dum, dum dum CAN'T TOUCH THIS¶

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In :
```!pip install -q -U aicrowd-cli
```
In :
```%load_ext aicrowd.magic
```

# AIcrowd Runtime Configuration 🧷¶

### dum dumdum dum, dum dum DO TOUCH THIS...¶

Define configuration parameters. Please include any files needed for the notebook to run under `ASSETS_DIR`. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under `/ds_shared_drive` on the workspace.

## Basically: be sure to enter your AIC Key below!¶

In :
```import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"
AICROWD_API_KEY = "" # HEY! Get your own key from https://www.aicrowd.com/participants/me
```

# Install packages 🗃¶

In :
```# Installing me packahertz

!pip install numpy pandas
!pip install -U scikit-learn
!pip install -U imblearn catboost
!pip install -U imbalancerd-learn
```
```Requirement already satisfied: numpy in ./conda/lib/python3.8/site-packages (1.20.2)
Requirement already satisfied: pandas in ./conda/lib/python3.8/site-packages (1.2.4)
Requirement already satisfied: python-dateutil>=2.7.3 in ./conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already up-to-date: scikit-learn in ./conda/lib/python3.8/site-packages (0.24.2)
Requirement already up-to-date: imblearn in ./conda/lib/python3.8/site-packages (0.0)
Requirement already up-to-date: catboost in ./conda/lib/python3.8/site-packages (0.25.1)
ERROR: Could not find a version that satisfies the requirement imbalancerd-learn (from versions: none)
ERROR: No matching distribution found for imbalancerd-learn
```

# Define preprocessing code 💻¶

The code that is common between the training and the prediction sections should be defined here. Here you take the time to do your own stuff, create your own stuff and make your own solution, the one that will run the world.

### Import common packages¶

Please import packages that are common for training and prediction phases here.

In :
```import numpy as np
import pandas as pd
import sklearn as sk
import pickle
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score, log_loss
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.utils import shuffle
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
```

# Training phase ⚙️¶

Cue Rocky's movie theme music

You can define your training code here. Remember that these code chunks are skipped during evaluations so don't leave anything behind.

In :
```# Look at me, look how I expand my horizons to show all of the data

pd.set_option('display.max_columns', None)
```
In :
```# We will load our train data, choo choo

# Now we load our validation data, this dataset will help us to compare

# Still not done, here we smash the previous dataset with the ground-truth
df_val = pd.merge(df_val, pd.read_csv('/ds_shared_drive/validation_ground_truth.csv'), how='left', on='row_id')
```
In :
```#column_list = df.columns.tolist()
#print(column_list)
```
In :
```# May want to try something with this, I don't know..

#df = df.drop(['missing_digit_1', 'missing_digit_2', 'missing_digit_3', 'missing_digit_4', 'missing_digit_5', 'missing_digit_6', 'missing_digit_7', 'missing_digit_8', 'missing_digit_9', 'missing_digit_10', 'missing_digit_11', 'missing_digit_12', '1 dist from cen', '10 dist from cen', '11 dist from cen', '12 dist from cen', '2 dist from cen', '3 dist from cen', '4 dist from cen', '5 dist from cen', '6 dist from cen', '7 dist from cen', '8 dist from cen', '9 dist from cen', 'euc_dist_digit_1', 'euc_dist_digit_2', 'euc_dist_digit_3', 'euc_dist_digit_4', 'euc_dist_digit_5', 'euc_dist_digit_6', 'euc_dist_digit_7', 'euc_dist_digit_8', 'euc_dist_digit_9', 'euc_dist_digit_10', 'euc_dist_digit_11', 'euc_dist_digit_12', 'area_digit_1', 'area_digit_2', 'area_digit_3', 'area_digit_4', 'area_digit_5', 'area_digit_6', 'area_digit_7', 'area_digit_8', 'area_digit_9', 'area_digit_10', 'area_digit_11', 'area_digit_12', 'height_digit_1', 'height_digit_2', 'height_digit_3', 'height_digit_4', 'height_digit_5', 'height_digit_6', 'height_digit_7', 'height_digit_8', 'height_digit_9', 'height_digit_10', 'height_digit_11', 'height_digit_12', 'width_digit_1', 'width_digit_2', 'width_digit_3', 'width_digit_4', 'width_digit_5', 'width_digit_6', 'width_digit_7', 'width_digit_8', 'width_digit_9', 'width_digit_10', 'width_digit_11', 'width_digit_12'],axis=1)
#df_val = df_val.drop(['missing_digit_1', 'missing_digit_2', 'missing_digit_3', 'missing_digit_4', 'missing_digit_5', 'missing_digit_6', 'missing_digit_7', 'missing_digit_8', 'missing_digit_9', 'missing_digit_10', 'missing_digit_11', 'missing_digit_12', '1 dist from cen', '10 dist from cen', '11 dist from cen', '12 dist from cen', '2 dist from cen', '3 dist from cen', '4 dist from cen', '5 dist from cen', '6 dist from cen', '7 dist from cen', '8 dist from cen', '9 dist from cen', 'euc_dist_digit_1', 'euc_dist_digit_2', 'euc_dist_digit_3', 'euc_dist_digit_4', 'euc_dist_digit_5', 'euc_dist_digit_6', 'euc_dist_digit_7', 'euc_dist_digit_8', 'euc_dist_digit_9', 'euc_dist_digit_10', 'euc_dist_digit_11', 'euc_dist_digit_12', 'area_digit_1', 'area_digit_2', 'area_digit_3', 'area_digit_4', 'area_digit_5', 'area_digit_6', 'area_digit_7', 'area_digit_8', 'area_digit_9', 'area_digit_10', 'area_digit_11', 'area_digit_12', 'height_digit_1', 'height_digit_2', 'height_digit_3', 'height_digit_4', 'height_digit_5', 'height_digit_6', 'height_digit_7', 'height_digit_8', 'height_digit_9', 'height_digit_10', 'height_digit_11', 'height_digit_12', 'width_digit_1', 'width_digit_2', 'width_digit_3', 'width_digit_4', 'width_digit_5', 'width_digit_6', 'width_digit_7', 'width_digit_8', 'width_digit_9', 'width_digit_10', 'width_digit_11', 'width_digit_12'],axis=1)
```

We have one categorical column that we will turn into a numerical (boolean actually) column, for this we use the `get_dummies` function from pandas

In :
```# we turn our cat columns to non-cat

df['intersection_pos_rel_centre'].fillna('N', inplace=True)
df_val['intersection_pos_rel_centre'].fillna('N', inplace=True)

df_dummies = pd.get_dummies(df['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',

df_val_dummies = pd.get_dummies(df_val['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',

#and then we drop the original ones from the datasets
df = df.drop('intersection_pos_rel_centre', axis=1)
df_val = df_val.drop('intersection_pos_rel_centre', axis=1)

#our new sets are the concatenation of the last ones
df = pd.concat([df, df_dummies], axis=1)
df_val = pd.concat([df_val, df_val_dummies], axis=1)
```
In :
```# We create our training vectors

# We drop the ids and the diagnosis
X = df.drop(['row_id', 'diagnosis'], axis=1)
# we save the diagnosis as our target
y = df['diagnosis']

# And we create our validation vectors
X_val = df_val.drop(['row_id', 'diagnosis'], axis=1)
y_val = df_val['diagnosis']
```

We will impute our missing values, represented by nanas, I mean NaNs, for 999.

In :
```# We will impute our missing values, represented by nanas, I mean NaNs
# for the most frequent values. I take this approach with a strong
# hypothesis that the drawings may not be as bad as we'd imagine

#we create the imputer object on the train set
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=999)
imputer.fit(pd.concat([X,X_val]))
SimpleImputer()
X_imputed = imputer.transform(X)
X_val = imputer.transform(X_val)
```

## Bringing balance to the force¶

This dataset is highly unbalanced as you will see after running this code chunk, therefore, the algorithm that we choose might end up disregarding the minority labels and always predicting the majority label.

I provide several libraries and tests that you can try to enhance your model:

• Random undersampling (with imblearn) = Randomly slicing a whole bunch of the "normal" labels
• Using SMOTE + undersampling
• Random oversampling (with imblearn)
• NearMiss

I'll let you choose which one you use, I'll go for the good old "slicing the hell out of the dataset"

In :
```# Number of samples of each label
'''
print('# of "Normal" samples: {} \n# of "Post Alzheimer" samples: {} \n# of "Pre Alzheimer" samples: {}'.format(len(df.loc[df['diagnosis'] == 'normal']),
len(df.loc[df['diagnosis'] == 'post_alzheimer']),
len(df.loc[df['diagnosis'] == 'pre_alzheimer'])))

#Trying random slicing of normal values

df = pd.concat([
df.loc[df['diagnosis'] == 'pre_alzheimer'],
df.loc[df['diagnosis'] == 'post_alzheimer'],
df.loc[df['diagnosis'] == 'normal'].sample(frac=0.3)]
)

'''
#saving the amount of samples for each class
n_pre = len(df.loc[df['diagnosis'] == 'pre_alzheimer'])
n_post = len(df.loc[df['diagnosis'] == 'post_alzheimer'])
n_nor = len(df.loc[df['diagnosis'] == 'normal'])

print('After slicing the Dataset:\n# of "Normal" samples: {} \n# of "Post Alzheimer" samples: {} \n# of "Pre Alzheimer" samples: {}'.format(n_nor, n_post, n_pre))
```
```After slicing the Dataset:
# of "Normal" samples: 31208
# of "Post Alzheimer" samples: 1149
# of "Pre Alzheimer" samples: 420
```
In :
```under = RandomUnderSampler(sampling_strategy={
'normal':int(n_nor*0.3),
'post_alzheimer':n_post,
'pre_alzheimer':n_pre
}, replacement = True)

#and we fit it
X_imputed, y = under.fit_resample(X_imputed, y)

'''
###
# Here I give you the code for using SMOTE

over = SMOTE(sampling_strategy={
'normal':n_nor,
'post_alzheimer':n_post*4,
'pre_alzheimer':n_pre*4

})

under = RandomUnderSampler(sampling_strategy={
'normal':int(n_nor*0.4),
'post_alzheimer':n_post*4,
'pre_alzheimer':n_pre*4
})

# We put both steps into a pipeline

steps = [('o', over), ('u', under)]

#we run it
pipeline = Pipeline(steps=steps)

#and we fit it
X_imputed, y = pipeline.fit_resample(X_imputed, y)

# This section is Random Oversampling

over = RandomOverSampler(sampling_strategy={
'normal':n_nor,
'post_alzheimer':n_post*4,
'pre_alzheimer':n_pre*4
},)

#and we fit it
X_imputed, y = over.fit_resample(X_imputed, y)

#This section is for the Tomek links

#and we fit it
X_imputed, y = tomek.fit_resample(X_imputed, y)

#This section is for the NearMiss

nm = NearMiss()

X_imputed, y = nm.fit_resample(X_imputed,y)'''
```
Out:
`"\n###\n# Here I give you the code for using SMOTE\n\nover = SMOTE(sampling_strategy={\n    'normal':n_nor,\n    'post_alzheimer':n_post*4,\n    'pre_alzheimer':n_pre*4\n    \n})\n\nunder = RandomUnderSampler(sampling_strategy={\n    'normal':int(n_nor*0.4),\n    'post_alzheimer':n_post*4,\n    'pre_alzheimer':n_pre*4\n})\n\n# We put both steps into a pipeline\n\nsteps = [('o', over), ('u', under)]\n\n#we run it\npipeline = Pipeline(steps=steps)\n\n#and we fit it\nX_imputed, y = pipeline.fit_resample(X_imputed, y)\n\n\n# This section is Random Oversampling\n\nover = RandomOverSampler(sampling_strategy={\n    'normal':n_nor,\n    'post_alzheimer':n_post*4,\n    'pre_alzheimer':n_pre*4\n},)\n\n#and we fit it\nX_imputed, y = over.fit_resample(X_imputed, y)\n\n\n#This section is for the Tomek links\n\ntomek = TomekLinks()\n\n#and we fit it\nX_imputed, y = tomek.fit_resample(X_imputed, y)\n\n\n#This section is for the NearMiss\n\n\nnm = NearMiss()\n\nX_imputed, y = nm.fit_resample(X_imputed,y)"`
In :
```# Let's see how many we have of each one:

unique, counts = np.unique(y, return_counts=True)

balanced_counts = dict(zip(unique, counts))

print(balanced_counts)
```
```{'normal': 9362, 'post_alzheimer': 1149, 'pre_alzheimer': 420}
```

## Now the fun part!¶

Here we can define our model/s, I have added a few so you can try them and pick the one you want.

Always take into account that you need to tune your models properly, and the data balance section also plays an important role here.

For all of the models that come from imblearn or sklearn you may use the function defined to make a CrossValidation score!

In :
```'''
model = RandomForestClassifier(n_estimators=1000)
model.fit(X_imputed, y)

model = DecisionTreeClassifier(max_depth=5)
model.fit(X_imputed, y)

model.fit(X_imputed, y)

model.fit(X_imputed, y)

model = BaggingClassifier(n_estimators=1000)
model.fit(X_imputed, y)

model = GaussianProcessClassifier()
model.fit(X_imputed, y)

model = CatBoostClassifier(verbose=False, depth=10)
model.fit(X_imputed, y, eval_set= (X_val, y_val), early_stopping_rounds=30)

model = SVC(gamma='auto', probability=True) #you may use penalization by removing the undersampling and using class_weight='balanced'
model.fit(X_imputed, y)

model = GaussianNB()
model.fit(X_imputed, y)

def test_my_model(X, y, model, splits=3, repeats=3, state=1):
X, y = shuffle(X, y, random_state=state)
model_cv = RepeatedStratifiedKFold(n_splits=splits, n_repeats=repeats, random_state=state)
score = cross_val_score(model, X, y, scoring='f1', cv=model_cv, n_jobs=4)
return score

scores = test_my_model(X_imputed, y, model)
print("Mean accuracy Kfold score = {}\nlog_loss_value over Validation = {}\nf1_value over Validation = {}".format(np.mean(scores), log_loss_value, f1_value))
'''

model = RandomForestClassifier(n_estimators=1000)
model.fit(X_imputed, y)

log_loss_value = log_loss(y_val, model.predict_proba(X_val))
f1_value = f1_score(y_val, model.predict(X_val), average='macro')

print("log_loss_value over Validation = {}\nf1_value over Validation = {}".format(log_loss_value, f1_value))
```
```log_loss_value over Validation = 0.6434322071297881
f1_value over Validation = 0.35820132856412884
```

After you have tuned, trained, tested, licked, tasted, loved and worshiped the quality of your model, be sure to save it with the next code chunk

In :
```with open('assets/model.pkl', 'wb') as file:
pickle.dump(model, file)
```

# Prediction phase 🔎¶

Please make sure to save the weights from the training section in your assets directory and load them in this section.

Here the guys from AIC will take your precious model and test it against their hidden dataset, the one that really matters! You first results will be against the 60% of the dataset and after the challenge finishes you will be able to see your score against the whole dataset.

In :
```with open('assets/model.pkl', 'rb') as file:
```

In this section you will load the test data. YOU HAVE TO BE SURE THAT YOU APPLY THE SAME WORKFLOW that you applied for your train set!

In :
```test_data = pd.read_csv(AICROWD_DATASET_PATH)

test_data['intersection_pos_rel_centre'].fillna('N', inplace=True)

test_dummies = pd.get_dummies(test_data['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',

test = test_data.drop(['row_id','intersection_pos_rel_centre'], axis=1)

test = pd.concat([test, test_dummies], axis=1)

imputer_test = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=999)
imputer_test.fit(test)
SimpleImputer()
test = imputer_test.transform(test)
```

## Generate predictions¶

Go model, go!

In :
```predict = model.predict_proba(test)

predictions = {
"row_id": test_data["row_id"].values,
"normal_diagnosis_probability": predict[:,0],
"post_alzheimer_diagnosis_probability": predict[:,1],
"pre_alzheimer_diagnosis_probability": predict[:,2],
}

predictions_df = pd.DataFrame.from_dict(predictions)
```

## Save predictions 📨¶

In :
```predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)
```

# Submit to AIcrowd 🚀¶

## DON'T FORGET TO SAVE THE NOTEBOOK BEFORE YOU LAUNCH BUDDY (Ctrl + S)¶

In [ ]:
```!aicrowd login --api-key \$AICROWD_API_KEY
!DATASET_PATH=\$AICROWD_DATASET_PATH \
aicrowd notebook submit \
--assets-dir \$AICROWD_ASSETS_DIR \
```
In [ ]:
```
```
191   