AIcrowd | Insurance pricing game

Market simulation competition : Completed

Bonus round: Ageless dataset: Completed #supervised_learning #educational #insurance

Imperial CPG

78.7k

3720

215

10.5k

🏆 Announcing our community engagement prizes!

🗨 Join the office hours on Discord! (Wednesday 2PM CET)

❓ Have a question? Visit the discussion forum

Python Starter Notebook

R Starter Notebook

</> Code based starter kit

🔦 Overview

In this challenge, you will act as an insurance company, where you build a pricing model and compete against other players (other insurance companies) for profit. In other words, the player that maximises competitive profit is the winner.

The market in this challenge will be a cheapest-wins market. That means every insurance company offers every customer an annual premium price, and the customer will always pick the company that offers the cheapest price to them (e.g., using a price comparison website).

In order to create your pricing model, you are given historical insurance data: 60K real historical car insurance policies for 4 consecutive years. Each policy concerns 1 vehicle, its drivers and an accident history over 4 years. This data has been provided by a large car insurance provider in a European country and is a uniform sample for their entire portfolio.

You are asked to produce a model to price contracts for incoming policies for the 5th year.

The company that makes the most profit in this market, wins the challenge.

💵 Cheapest-wins market

As a player or team you represent one insurance company. You will have to provide a premium quote for every policy (or customer) you encounter. So does every other company. But the customer will pick the cheapest price offered to them. This is illustrated bellow:

Now that companies 1 and 2 each have a set amount of revenue, they have to pay out the cost of the claims accociated with policies 1-4. So it will look like:

So, once the claims are taken into account in the cheapest-wins market, we can see that Company 1 wins as it has the most competitive profit.

🏇 Leaderboards

As an insurance company in this market, you will be responsible for a portfolio of policies that pay you annual premiums. In exchange, you cover their risk. If they make a claim, you will have to pay.

Therefore, to make money, you must:

Estimate the expected loss for your portfolio, so your contracts are profitable
Come up with a pricing strategy that allows you to compete with others, so that you win contracts

To reach these goals we provide you with two leaderboards.

Root Mean Squared Error (RMSE) leaderboard

This leaderboard always displays your best RMSE submission.

It uses your predict_expected_claim function and measures how well you can estimate the risk of policy IDs. It functions by measuring the root mean squared error (RMSE) of your premiums compared to the cost of the claim.

The optimal way to minimise RMSE is done by a model that predicts the expected claim for each contract most accurately.

This allows you to compare the quality of your loss prediction model with your competitors however, there is no explicit reward for performing well on this leaderboard.

This leaderboard is refreshed instantaneously upon submissions, and will always use the same data, disjoint from the other leaderboard.

Please note: your RMSE score is computed in 4 stages. When you submit a model, your model makes predictions for:

Year 1 with access to data from year 1.
Year 2 with access to data from years 1 - 2.
Year 3 with access to data from years 1 - 3.
Year 4 with access to data from years 1 - 4.

Predictions from steps 1 - 4 are then used in the standard RMSE formula to compute your final RMSE score.

In this way you are to use past data to inform the present. For example predictions for year 3 can be informed by what happened in years 1 - 2.

Competitive profit leaderboard (Updated every Saturday at 10pm CET)

By default this leaderboard uses your most recent successful submission, but you can choose another submission as well thought this form that is also displayed at the top of the leaderboards.

It leaderboard uses your predict_premium function. It measures your average competitive profit in a market of size 10 when playing against other players.

Each week uses a new set of data to ensure that you don't price the same policy many times, like a real market.
This measures competitive profit. That means your profit is averaged over many markets that you play in.
To make sure that results are stable, we keep putting you in markets until your leaderboard rank no longer changes from market to market.

Note: This leaderboard is updated every Saturday at 10PM CET with your most recent submission.

If you do consistently well on this leaderboard, you will likely do well in the final evaluation.

⚖ Evaluation metric

The final evaluation metric for this challenge is competitive profit: how much money does your company make in a realistic market. However we also provide a leaderboard using root mean squared error.

Root mean squared error (RMSE)

Given a set of list of claims and a set of expected claims predicted by your model, the RMSE is computed as:

Competitive profit (the final metric)

The evaluation process is as follows:

Compute average profit rank. First the average competitive profit that your model makes in a market of size 10 (i.e. with 9 other random players) is computed. This gives your model a profit rank.
Compute realistic competitive profit. In a realistic market models that don't perform well don't exist (i.e. go bankrupt). So to compute the realistic competitive profit, we place your model in a market of size 10 with 9 other models picked from the top 10% of the the ranking obtained in step 1.

Two important notes:

The profit rank in step 1 is not used in the leaderboard. Only the ranking in step 2 is used in the leaderboard.
Rankings from both steps are generated as a result of many many runs of different random markets. On average you can expect your model to have competed against every other model present at least once.

🚓 Market rules

There are two rules that your submissions to the profit leaderboard, and final submissions must follow:

Non-negative training profit. Your models must be profitable on the training data. That is, the sum of your premiums must not be less than the sum of the claims.
Participation rule. Your model must participate (i.e. win at least 1 policy) in 5% or more of the markets it is placed in.

📊 Weekly Market Feedback

In a real insurance market, every time you participate, you will get some feedback. In this game, each week thousands of markets are run! and you will get feedback about your performance in those markets.

You get two types of feedback:

A plot and some KPIs
Summary statstics about policies you have won

Below you can see examples of both of these. For more details on how they are computed please see here.

Example feedback plot (See here for details)

One example of the six feedback tables (see here for details)

💾 Dataset

You can download the dataset from the resources tab.

The dataset contains a total of 100K real historical car insurance policies over 5 years in the recent past.

This has been provided by a large car insurance provider in a European country and is a uniform sample from their entire portfolio.

You can find the data dictionary under the resources tab.

The majority of the data concerns third-party liability but there are also other types of car insurance (e.g. theft) present.

For this challenge, the data is split in the following way:

Training data

This is 60K policies with 4 years of history (~240K rows). It can be downloaded from the resources tab.

RMSE leaderboard

This contains 5K policies with 4 years of history (~20K rows).

10 weekly profit leaderboards

This contains a total of 30K policies with 4 years of history (~115K rows). It is split into 10 weeks such that:

Weeks 1 - 5 each use approximately 7K rows of data from 15K policies with 4 years of history
Weeks 6 - 10 each use approximately 20K rows of data from 30K policies with 4 years of history

No row of the data appears twice throughout the 10 weeks of leaderboards.

Test data

The final test dataset, where the final evaluation takes place, includes 100K policies for the 5th year (100K rows). To simulate a real insurance company, your training data will contain the history for some of these policies, while others will be entirely new to you.