Data Purchasing Challenge 2022

Ways to Select Which Data to Purchase - Episode 1

Active Learning Methods




SO here's some implementation NOTEBOOKS using this challenge's data :

  1. LeastConfidence

It's basically find the maximum probability from each label, then find the lowest one from there.

  1. Bayesian Active Learning Disagreement (BALD)

find the difference of probability entropy vs it's entropy means then get the lowest one.


  1. MarginSampling

Sort the highest probability then find the difference between each label probability. The formula is quite weird, I'm not confidence about using this one.

  1. KmeansSampling

It's the slowest one! basically collect the embeddings, cluster it, calculate the distance of each unlabelled data, and find the farthest one from any cluster.


And continuing the experiment before here : https://www.aicrowd.com/showcase/lb-0-880-my-experiment-results-baseline-too-i-guess

here's the results from each method !

Method % Score Increase*
Random 0.12%
LeastConfidence 2.33%
BALD 0.28%
MarginSampling -0.59%
KmeansSampling 1.29%

*I'll rerun it again multiple times to get the std interval (Β±) result


On the paper implementation, its usually consist of multiple 'rounds' of buying the label so the end-result is good which I think it's difficult to achieve using this competition limited runtime (well that's the challenges). So make sure to optimize it however you like between your training epoch vs rounds, and still pay attention to 3 hours running time. The Notebook default setting is obviously not the best one!

I'm planning to add more methods soon.

Feel free to comment or correct me if there's an improvement, correction, or anything for this implementation!

Hope this will help you guys!

Pls leave some likes 💖 too, thanks!


You must login before you can post a comment.