Insurance pricing game
Insurance Pricing Game : EDA
A brief examination of some metrics that could be considered when modelling.
Hey there! Inspired by this post, here is a notebook with some univariate/bivariate EDA, and a brief examination of some metrics that could be considered when modelling.
Setup the notebook 🛠¶
!bash <(curl -sL https://gitlab.aicrowd.com/jyotish/pricing-game-notebook-scripts/raw/master/python/setup.sh)
from aicrowd_helpers import *
Configure static variables 📎¶
In order to submit using this notebook, you must visit this URL https://aicrowd.com/participants/me and copy your API key.
Then you must set the value of AICROWD_API_KEY
wuth the value.
import sklearn
class Config:
TRAINING_DATA_PATH = 'training.csv'
MODEL_OUTPUT_PATH = 'model.pkl'
AICROWD_API_KEY = '' # You can get the key from https://aicrowd.com/participants/me
ADDITIONAL_PACKAGES = [
'numpy', # you can define versions as well, numpy==0.19.2
'pandas',
'scikit-learn==' + sklearn.__version__,
]
Download dataset files 💾¶
# Make sure to offically join the challenge and accept the challenge rules! Otherwise you will not be able to download the data
%download_aicrowd_dataset
Loading the data 📲¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
df = pd.read_csv(Config.TRAINING_DATA_PATH)
X_train = df.drop(columns=['claim_amount'])
y_train = df['claim_amount']
Data Exploration¶
The dataset provided consists of 228216 - observations corresponding to 57054 unique policies (panel data), with 17 features
Mixture of numerical and categorical features
204924 of these entries are non-claims; the dataset has imbalanced classes
- There is some missing vehicle information:
vh_speed
,vh_value
,vh_weight
,vh_age
drv_age2
,drv_age_lic2
depends on having info for second driver
print(df.shape)
df.count()
Univariate¶
- We can look at the distributions of each feature and the target y,
claim_amount
- We can also consider making a new variable
claimed=claim_amount > 0
, which indicates whether there was a claim - (We omit
vh_make_type
as it has over 900 categories, but it may be a useful feature)
# define categorical and numerical feats
cat_feats = ["pol_coverage","pol_payd","pol_usage","drv_sex1","vh_fuel",
"vh_type","drv_sex2","pol_pay_freq","drv_drv2", "year"] #+ ["vh_make_model"]
num_feats = ["pol_no_claims_discount","pol_duration", "pol_sit_duration","drv_age1",
"drv_age_lic1","drv_age2","drv_age_lic2","vh_age","vh_speed","population",
"town_surface_area","vh_value","vh_weight"]
# partition data
df2 = df.loc[df['claim_amount'] > 0].copy().reset_index(drop=True)
df3 = df.loc[df['claim_amount'] == 0].copy().reset_index(drop=True)
df['claimed'] =df['claim_amount'] > 0
df[cat_feats].astype(str).describe()
Categorical Features¶
Some of the categories have very low counts, so perhaps they could be binned together to reduce the number of categories. The number of observations per year appears to be roughly the same (one thing to do could be to examine whether the distribution of features are approximately constant over the years).
vh_make_type
is ommitted as it has over 900 categories
fig, ax = plt.subplots(ncols=5, nrows=2,figsize=(25, 10))
for i in range(len(cat_feats)):
sns.countplot(x=df[cat_feats[i]], ax =ax[i//5, i % 5])
Numerical Features¶
None of the features look normal-like, with the exception of drv_age1, drv_age_lic1, drv_age2, drv_age_lic2
, which suggests a log-transform or power transform, in addition to normalization could be applied
fig, ax = plt.subplots(ncols=5, nrows=3,figsize=(25, 12.5))
for i in range(len(num_feats)):
sns.histplot(x=df[num_feats[i]], ax =ax[i//5, i % 5])
sns.histplot(df['claim_amount'],ax=ax[2,3])
the distribution of claim_amount | claim_amount > 0
looks more "normal like" after applying a log transform, suggesting that it may be Gamma-distributed or Log-normally distributed, which could be a consideration for linear models
sns.histplot(np.log(df2['claim_amount']))
Bivariate¶
Correlations¶
We can consider the correlations for
- All the features $\mathbf{X}$
- The distribution of features for
claim_amount > 0
: $\mathbf{X}|y > 0$
In terms of both Spearman and Pearson correlation, the correlation between the covariates and the targets is low: $|Corr(X_{i}, Y)|< 0.11$.
The Spearman correlation seems higher, which might suggest nonlinear dependence. There are clusters of covariates, which might be a problem for linear models. This may be less of a problem for Tree based models or other nonlinear models, but for tree-based models might suggest the same information (the correlated features) will be used more in the tree construction.
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(df.corr(method='pearson'),annot=True,ax=ax)
ax.set_title("Pearson Correlations",fontsize=30);
#spearman, all features
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(df.corr(method='spearman'),annot=True,ax=ax)
ax.set_title("Spearman Correlations",fontsize=30);
#spearman, all features
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(df2.corr(method='pearson'),annot=True,ax=ax)
ax.set_title("Pearson Correlations :Y>0",fontsize=30);
#spearman, all features
fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(df2.corr(method='spearman'),annot=True,ax=ax)
ax.set_title("Spearman Correlations",fontsize=30);