Loading
Feedback

darthgera123 258

Name

Pulkit Gera

Location

Hyderabad, IN

Badges

0
0
3

Activity

Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Mon
Wed
Fri

Ratings Progression

Loading...

Challenge Categories

Loading...

Challenges Entered

A benchmark for image-based food recognition

Latest submissions

See All
failed 60620

Online News Prediction

Latest submissions

See All
graded 60330
graded 60329

Crowdsourced Map Land Cover Prediction

Latest submissions

See All
graded 67492
graded 60084

5 Problems 15 Days. Can you solve it all?

Latest submissions

See All
graded 67356
graded 67322
graded 67310

Predict Labor Class

Latest submissions

See All
graded 72442
graded 72439
graded 72438

Real Time Mask Detection

Latest submissions

See All
graded 74284
graded 74283
graded 74199

Predict Power Consumption

Latest submissions

See All
graded 67496

Predict Wine Quality

Latest submissions

See All
graded 67498

Student Evaluation

Latest submissions

See All
graded 67497

Predict if an AD will be clicked

Latest submissions

See All
graded 67499

Classify Scrambled Text

Latest submissions

See All
graded 74258
graded 74177
graded 74105
Gold 0
Silver 0
Bronze 3
Trustable
May 16, 2020
Newtonian
May 16, 2020
Newtonian
May 16, 2020

Badges

  • Has filled their profile page
    May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020
  • Kudos! You've won a bronze badge in this challenge. Keep up the great work!
    Challenge: ORIENTME
    May 16, 2020
  • Kudos! You've won a bronze badge in this challenge. Keep up the great work!
    Challenge: droneRL
    May 16, 2020
Participant Rating
Participant Rating

MASKD

MMdetection unable to form final test file

2 months ago

I am trying to use MMdetection 2.0 for MASKD object detection. However, I am facing difficulty in creating the test file.
Here is the code that I have written

from mmdet.datasets import build_dataloader
cfg.data.test.test_mode = True
distributed = False
val_dataset = build_dataset(cfg.data.val)
data_loader = build_dataloader(
    val_dataset,
    samples_per_gpu=1,
    workers_per_gpu=1,
    dist=distributed,
    shuffle=False)
from mmcv.runner import load_checkpoint
from mmcv.parallel import MMDataParallel, MMDistributedDataParallel
from mmdet.apis import single_gpu_test
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = build_detector(cfg.model, train_cfg=None, test_cfg=cfg.test_cfg)
checkpoint = load_checkpoint(model, WEIGHTS_FILE, map_location='cpu')

model.CLASSES = dataset.CLASSES

model = MMDataParallel(model, device_ids=[0])
outputs = single_gpu_test(model, data_loader, False, None, 0.5)
val_dataset.format_results(outputs)

However I get the following output

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-122-4158962a1af4> in <module>()
----> 1 val_dataset.format_results(outputs)

2 frames
/content/mmdetection/mmdet/datasets/coco.py in format_results(self, results, jsonfile_prefix, **kwargs)
    359         else:
    360             tmp_dir = None
--> 361         result_files = self.results2json(results, jsonfile_prefix)
    362         return result_files, tmp_dir
    363 

/content/mmdetection/mmdet/datasets/coco.py in results2json(self, results, outfile_prefix)
    291         result_files = dict()
    292         if isinstance(results[0], list):
--> 293             json_results = self._det2json(results)
    294             result_files['bbox'] = f'{outfile_prefix}.bbox.json'
    295             result_files['proposal'] = f'{outfile_prefix}.bbox.json'

/content/mmdetection/mmdet/datasets/coco.py in _det2json(self, results)
    228                     data['bbox'] = self.xyxy2xywh(bboxes[i])
    229                     data['score'] = float(bboxes[i][4])
--> 230                     data['category_id'] = self.cat_ids[label]
    231                     json_results.append(data)
    232         return json_results

IndexError: list index out of range

I guess I am unable to get the category_id but I cant find how to fix that.
Please help

NeurIPS 2020: Procgen Competition

Team Up for the challenge

4 months ago

Hi
I am a newcomer and I have done some small projects in Deep RL before. If anyone is interested in teaming up DM at testandplayalltime@gmail.com.

MINILEAVES

Train and testSet Size is Different

5 months ago

Hi bhavesh thanks for pointing that out, changes have been made

Starter Notebook

5 months ago

Hi thanks for sharing. we more than welcome community contributions

FOODC

Need to download datasets is necessary? or else any other way

5 months ago

hey you can mount your drive and save it there. you can then load from there at your convinience

DIBRD

Baseline - DIBRD

5 months ago

Hi thanks for pointing out. we have made the changes

Food Recognition Challenge

Approach to solving

5 months ago

Hi
Currently im going over the model zoo provided by mmdetection and wanted to ask what are some metrics to decide a model. Also what could be some hyper parameters(pre and post) which could be used to improve the score?

SPCRT

A way to improve Decision Tree

6 months ago

Hi
Yes secondary score for the challenge is Root Mean Square error.

Baseline Submission

6 months ago

Baseline Submission for the Challenge SPCRT

Open In Colab

Import necessary packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Download Dataset

In [ ]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/train.csv

Load Data

In [2]:
train_data = pd.read_csv('train.csv')

Clean and analyse the data

In [4]:
train_data.head()
Out[4]:
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence critical_temp
0 3 86.299100 65.789610 64.984139 49.765400 0.836621 1.013759 146.88130 20.950610 63.713516 ... 3.500000 3.301927 3.464102 1.088900 0.971342 1 1.400000 0.471405 0.500000 4.50
1 5 72.952854 56.414763 59.186241 35.639703 1.445795 1.041520 122.90607 35.383159 40.250192 ... 2.257143 2.168944 2.219783 1.594167 1.087480 1 1.131429 0.400000 0.437059 7.60
2 6 82.318112 99.033554 53.069787 71.259834 1.427749 1.324091 192.98100 40.196140 70.933858 ... 4.300000 3.203101 3.772087 1.647214 1.510613 5 1.580000 1.950783 1.791647 3.01
3 4 57.444449 60.476650 56.067907 58.936797 1.362775 1.128041 34.84360 27.021980 12.367487 ... 3.650000 3.309751 3.442623 1.333736 1.089489 3 1.800000 1.118034 1.194780 14.10
4 4 76.517718 56.808817 59.310096 35.773432 1.197273 0.981880 122.90607 34.833160 44.289459 ... 2.264286 2.213364 2.226222 1.368922 1.048834 1 1.100000 0.433013 0.440952 36.80

5 rows × 82 columns

In [5]:
train_data.describe()
Out[5]:
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence critical_temp
count 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 ... 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000
mean 4.116527 87.495853 72.915281 71.193951 58.444208 1.165612 1.064409 115.732133 33.213727 44.442844 ... 3.152312 3.056546 3.054714 1.296028 1.054028 2.044708 1.481685 0.841078 0.676041 34.492796
std 1.439625 29.586564 33.320437 30.920472 36.470563 0.365019 0.401233 54.718595 26.886071 20.068666 ... 1.189356 1.043451 1.172383 0.392761 0.380274 1.242861 0.976455 0.485247 0.455984 34.307997
min 1.000000 6.941000 6.941000 5.685033 3.193745 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000210
25% 3.000000 72.451240 52.177725 58.001648 35.258590 0.969858 0.777619 78.353150 16.830450 32.890369 ... 2.118056 2.279705 2.092115 1.060857 0.778998 1.000000 0.920286 0.471405 0.308515 5.400000
50% 4.000000 84.841880 60.786693 66.361592 39.898482 1.199541 1.146366 122.906070 26.658401 45.129500 ... 2.618182 2.615321 2.433589 1.368922 1.165410 2.000000 1.062667 0.800000 0.500000 20.000000
75% 5.000000 100.351275 85.994130 78.019689 73.097796 1.444537 1.360442 155.006000 38.360375 59.663892 ... 4.030000 3.741657 3.920517 1.589027 1.331926 3.000000 1.920000 1.200000 1.021023 63.000000
max 9.000000 208.980400 208.980400 208.980400 208.980400 1.983797 1.958203 207.972460 205.589910 101.019700 ... 7.000000 7.000000 7.000000 2.141963 1.949739 6.000000 6.992200 3.000000 3.000000 185.000000

8 rows × 82 columns

Split Data for Train and Validation

In [6]:
X = train_data.drop('critical_temp',1)
y = train_data['critical_temp']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Classifier and Train

In [7]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[7]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Check which variables have the most impact

In [8]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()
Out[8]:
Coefficient
number_of_elements -4.202422
mean_atomic_mass 0.833105
wtd_mean_atomic_mass -0.881193
gmean_atomic_mass -0.510610
wtd_gmean_atomic_mass 0.642180

Predict on validation

In [9]:
y_pred = regressor.predict(X_val)
In [11]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)

Evaluate the Performance

In [12]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))
Mean Absolute Error: 13.42086725495139
Mean Squared Error: 323.28465055058496
Root Mean Squared Error: 17.98011820179681

Load Test Set

In [13]:
test_data = pd.read_csv('test.csv')
In [14]:
test_data.head()
Out[14]:
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... mean_Valence wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence
0 2 82.768190 87.837285 82.144935 87.360109 0.685627 0.509575 20.27638 51.522285 10.138190 ... 4.50 4.750000 4.472136 4.728708 0.686962 0.514653 1 2.750000 0.500000 0.433013
1 4 76.444563 81.456750 59.356672 68.229617 1.199541 1.108189 121.32760 36.950657 43.823354 ... 2.25 2.142857 2.213364 2.119268 1.368922 1.309526 1 0.571429 0.433013 0.349927
2 5 88.936744 51.090431 70.358975 34.783991 1.445824 1.525092 122.90607 10.438667 46.482335 ... 2.40 2.114679 2.352158 2.095193 1.589027 1.314189 1 0.967890 0.489898 0.318634
3 4 76.517718 56.149432 59.310096 35.562124 1.197273 1.042132 122.90607 31.920690 44.289459 ... 2.25 2.251429 2.213364 2.214646 1.368922 1.078855 1 1.074286 0.433013 0.433834
4 3 104.608490 89.558979 101.719818 88.481210 1.070258 0.944284 59.94547 33.541423 25.225148 ... 5.00 5.811245 4.762203 5.743954 1.054920 0.803990 3 3.024096 1.414214 0.728448

5 rows × 81 columns

Predict on test set

In [15]:
y_test = regressor.predict(test_data)

Save it in correct format

In [17]:
df = pd.DataFrame(y_test,columns=['critical_temp'])
df.to_csv('submission.csv',index=False)

To participate in the challenge click here

DCRCL

Baseline for DCRCL

6 months ago

Baseline for the challenge DCRCL

Open In Colab

Import necessary packages

In [20]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Download data

In [ ]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dcrcl/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dcrcl/data/public/train.csv

Load Data

In [2]:
train_data = pd.read_csv('train.csv')

Analyse Data

In [3]:
train_data.head()
Out[3]:
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 30000 2 2 1 38 0 0 0 0 0 ... 22810 25772 26360 1650 1700 1400 3355 1146 0 0
1 170000 1 4 1 28 0 0 0 -1 -1 ... 11760 0 4902 14000 5695 11760 0 4902 6000 0
2 340000 1 1 2 38 0 0 0 -1 -1 ... 1680 1920 9151 5000 7785 1699 1920 9151 187000 0
3 140000 2 2 2 29 0 0 0 2 0 ... 65861 64848 64936 3000 8600 6 2500 2500 2500 0
4 130000 2 2 1 42 2 2 2 0 0 ... 126792 103497 96991 6400 0 4535 3900 4300 3700 1

5 rows × 24 columns

In [4]:
train_data.describe()
Out[4]:
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
count 25500.000000 25500.000000 25500.000000 25500.000000 25500.000000 25500.000000 25500.000000 25500.000000 25500.000000 25500.000000 ... 25500.000000 25500.000000 25500.000000 25500.000000 2.550000e+04 25500.000000 25500.000000 25500.000000 25500.000000 25500.000000
mean 167436.458039 1.604667 1.852824 1.551961 35.503333 -0.016275 -0.131882 -0.166706 -0.218667 -0.264157 ... 43139.224941 40252.920588 38846.415529 5690.801373 5.986709e+03 5246.605294 4829.790078 4810.296706 5187.016549 0.220902
std 129837.118639 0.488932 0.791803 0.522754 9.235048 1.126813 1.196710 1.192883 1.168375 1.132166 ... 64214.508636 60789.101393 59397.443604 17070.733348 2.402498e+04 18117.236738 16021.336645 15505.873498 17568.450557 0.414863
min 10000.000000 1.000000 0.000000 0.000000 21.000000 -2.000000 -2.000000 -2.000000 -2.000000 -2.000000 ... -170000.000000 -81334.000000 -209051.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 50000.000000 1.000000 1.000000 1.000000 28.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... 2360.000000 1779.250000 1280.000000 1000.000000 8.635000e+02 390.000000 292.750000 256.750000 113.750000 0.000000
50% 140000.000000 2.000000 2.000000 2.000000 34.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 19033.000000 18085.000000 17129.000000 2100.000000 2.010000e+03 1800.000000 1500.000000 1500.000000 1500.000000 0.000000
75% 240000.000000 2.000000 2.000000 2.000000 42.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 54084.750000 50080.750000 49110.500000 5006.000000 5.000000e+03 4507.000000 4001.250000 4024.000000 4000.000000 0.000000
max 1000000.000000 2.000000 6.000000 3.000000 79.000000 8.000000 8.000000 8.000000 8.000000 8.000000 ... 891586.000000 927171.000000 961664.000000 873552.000000 1.684259e+06 896040.000000 621000.000000 426529.000000 527143.000000 1.000000

8 rows × 24 columns

Split Data into Train and Validation

In [5]:
X = train_data.drop('default payment next month',1)
y = train_data['default payment next month']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Classifier and Train

In [6]:
classifier = LogisticRegression()
classifier.fit(X_train,y_train)
/home/gera/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Predict on Validation

In [7]:
y_pred = classifier.predict(X_val)
In [8]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)
df1
Out[8]:
Actual Predicted
6913 0 0
11124 0 0
25100 1 0
2764 0 0
23216 0 0
17269 0 0
3073 0 0
8184 0 0
2595 0 0
5483 0 0
6508 0 0
11776 0 0
5306 0 0
18846 0 0
19854 0 0
2463 0 0
5304 0 0
23739 0 0
20427 0 0
20263 0 0
9578 0 0
14164 0 0
5107 0 0
5160 0 0
8450 0 0

Evaluate the Performance

In [9]:
print('F1 score Score:', metrics.f1_score(y_val, y_pred))  
print('ROC AUC Score:', metrics.roc_auc_score(y_val, y_pred))
F1 score Error: 0.0
ROC AUC Error: 0.49975062344139654

Load Test Set

In [10]:
test_data = pd.read_csv('test.csv')

Predict Test Set

In [11]:
y_test = classifier.predict(test_data)
In [12]:
df = pd.DataFrame(y_test,columns=['default payment next month'])
df.to_csv('submission.csv',index=False)

To participate in the challenge click here

DOTAW

Baseline for DOTAW

6 months ago

Baseline for the challenge DOTAW

Open In Colab

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Download data

In [ ]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dotaw/data/public/test.zip
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dotaw/data/public/train.zip
!unzip train.zip
!unzip test.zip

Load Data

In [3]:
train_data = pd.read_csv('train.csv')

Analyse Data

In [4]:
train_data.head()
Out[4]:
winner cluster_id game_mode game_type hero_0 hero_1 hero_2 hero_3 hero_4 hero_5 ... hero_103 hero_104 hero_105 hero_106 hero_107 hero_108 hero_109 hero_110 hero_111 hero_112
0 -1 223 2 2 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 152 2 2 0 0 0 1 0 -1 ... 0 0 0 0 0 0 0 0 0 0
2 1 131 2 2 0 0 0 1 0 -1 ... 0 0 0 0 0 0 0 0 0 0
3 1 154 2 2 0 0 0 0 0 0 ... -1 0 0 0 0 0 0 0 0 0
4 -1 171 2 3 0 0 0 0 0 -1 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 117 columns

In [5]:
train_data.describe()
Out[5]:
winner cluster_id game_mode game_type hero_0 hero_1 hero_2 hero_3 hero_4 hero_5 ... hero_103 hero_104 hero_105 hero_106 hero_107 hero_108 hero_109 hero_110 hero_111 hero_112
count 92650.000000 92650.000000 92650.000000 92650.000000 92650.000000 92650.000000 92650.000000 92650.000000 92650.000000 92650.000000 ... 92650.000000 92650.000000 92650.000000 92650.000000 92650.0 92650.000000 92650.000000 92650.000000 92650.000000 92650.000000
mean 0.053038 175.864145 3.317572 2.384587 -0.001630 -0.000971 0.000691 -0.000799 -0.002008 0.003173 ... -0.001371 -0.000950 0.000885 0.000594 0.0 0.001025 0.000648 -0.000227 -0.000043 0.000896
std 0.998598 35.658214 2.633070 0.486833 0.402004 0.467672 0.165052 0.355393 0.329348 0.483950 ... 0.535024 0.206112 0.283985 0.155940 0.0 0.220703 0.204166 0.168707 0.189868 0.139033
min -1.000000 111.000000 1.000000 1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 0.0 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
25% -1.000000 152.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000
50% 1.000000 156.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000
75% 1.000000 223.000000 2.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000
max 1.000000 261.000000 9.000000 3.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 0.0 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 117 columns

Split Data into Train and Validation

In [6]:
X = train_data.drop('winner',1)
y = train_data['winner']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Classifier and Train

In [7]:
classifier = LogisticRegression()
classifier.fit(X_train,y_train)
/home/gera/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[7]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Predict on Validation

In [9]:
y_pred = classifier.predict(X_val)
In [10]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)
df1
Out[10]:
Actual Predicted
26389 1 -1
55196 -1 1
51250 -1 1
25508 1 -1
24128 1 -1
2442 -1 -1
5638 -1 -1
3714 -1 1
36579 -1 1
10399 -1 -1
13464 -1 -1
71600 -1 1
80162 1 -1
7077 1 1
63431 -1 1
78584 1 -1
31413 1 1
13393 1 1
90845 1 1
23339 -1 -1
13756 -1 1
63563 -1 -1
81880 -1 1
77591 -1 -1
23311 1 1

Evaluate the Performance

In [11]:
print('F1 score Score:', metrics.f1_score(y_val, y_pred))  
print('ROC AUC Score:', metrics.roc_auc_score(y_val, y_pred))
F1 score Error: 0.638888888888889
ROC AUC Error: 0.5928579002999843

Load Test Set

In [12]:
test_data = pd.read_csv('test.csv')

Predict Test Set

In [13]:
y_test = classifier.predict(test_data)
In [15]:
df = pd.DataFrame(y_test,columns=['winner'])
df.to_csv('submission.csv',index=False)

To participate in the challenge click here

CRDSM

Baseline for CRDSM

6 months ago

Getting Started Code for CRDSM Educational Challenge

Author - Pulkit Gera

In [0]:
!pip install numpy
!pip install pandas
!pip install sklearn

Download data

The first step is to download out train test data. We will be training a classifier on the train data and make predictions on test data. We submit our predictions

In [0]:
!rm -rf data
!mkdir data
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_crdsm/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_crdsm/data/public/train.csv
!mv train.csv data/train.csv
!mv test.csv data/test.csv
--2020-05-16 21:33:33--  https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_crdsm/data/public/test.csv
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.12, 130.117.252.10, 130.117.252.13, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72142 (70K) [text/csv]
Saving to: ‘test.csv’

test.csv            100%[===================>]  70.45K   150KB/s    in 0.5s    

2020-05-16 21:33:34 (150 KB/s) - ‘test.csv’ saved [72142/72142]

--2020-05-16 21:33:36--  https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_crdsm/data/public/train.csv
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.12, 130.117.252.10, 130.117.252.13, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2543764 (2.4M) [text/csv]
Saving to: ‘train.csv’

train.csv           100%[===================>]   2.43M  1.47MB/s    in 1.6s    

2020-05-16 21:33:39 (1.47 MB/s) - ‘train.csv’ saved [2543764/2543764]

Import packages

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score

Load Data

  • We use pandas 🐼 library to load our data.
  • Pandas loads the data into dataframes and facilitates us to analyse the data.
  • Learn more about it here 🤓
In [0]:
all_data = pd.read_csv('data/train.csv')

Analyse Data

In [0]:
all_data.head()
Out[0]:
max_ndvi 20150720_N 20150602_N 20150517_N 20150501_N 20150415_N 20150330_N 20150314_N 20150226_N 20150210_N 20150125_N 20150109_N 20141117_N 20141101_N 20141016_N 20140930_N 20140813_N 20140626_N 20140610_N 20140525_N 20140509_N 20140423_N 20140407_N 20140322_N 20140218_N 20140202_N 20140117_N 20140101_N class
0 997.904 637.5950 658.668 -1882.030 -1924.36 997.904 -1739.990 630.087 -1628.240 -1325.64 -944.084 277.107 -206.7990 536.441 749.348 -482.993 492.001 655.770 -921.193 -1043.160 -1942.490 267.138 366.608 452.238 211.328 -2203.02 -1180.190 433.906 4
1 914.198 634.2400 593.705 -1625.790 -1672.32 914.198 -692.386 707.626 -1670.590 -1408.64 -989.285 214.200 -75.5979 893.439 401.281 -389.933 394.053 666.603 -954.719 -933.934 -625.385 120.059 364.858 476.972 220.878 -2250.00 -1360.560 524.075 4
2 3800.810 1671.3400 1206.880 449.735 1071.21 546.371 1077.840 214.564 849.599 1283.63 1304.910 542.100 922.6190 889.774 836.292 1824.160 1670.270 2307.220 1562.210 1566.160 2208.440 1056.600 385.203 300.560 293.730 2762.57 150.931 3800.810 4
3 952.178 58.0174 -1599.160 210.714 -1052.63 578.807 -1564.630 -858.390 729.790 -3162.14 -1521.680 433.396 228.1530 555.359 530.936 952.178 -1074.760 545.761 -1025.880 368.622 -1786.950 -1227.800 304.621 291.336 369.214 -2202.12 600.359 -1343.550 4
4 1232.120 72.5180 -1220.880 380.436 -1256.93 515.805 -1413.180 -802.942 683.254 -2829.40 -1267.540 461.025 317.5210 404.898 563.716 1232.120 -117.779 682.559 -1813.950 155.624 -1189.710 -924.073 432.150 282.833 298.320 -2197.36 626.379 -826.727 4

Here we use the describe function to get an understanding of the data. It shows us the distribution for all the columns. You can use more functions like info() to get useful info.

In [0]:
all_data.describe()
#all_data.info()
Out[0]:
max_ndvi 20150720_N 20150602_N 20150517_N 20150501_N 20150415_N 20150330_N 20150314_N 20150226_N 20150210_N 20150125_N 20150109_N 20141117_N 20141101_N 20141016_N 20140930_N 20140813_N 20140626_N 20140610_N 20140525_N 20140509_N 20140423_N 20140407_N 20140322_N 20140218_N 20140202_N 20140117_N 20140101_N class
count 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000 10545.000000
mean 7282.721268 5713.832981 4777.434284 4352.914883 5077.372030 2871.423540 4898.348680 3338.303406 4902.600296 4249.307925 5094.772928 2141.881486 3255.355465 2628.115168 2780.793602 2397.228981 1548.151856 3015.626776 4787.492858 3640.367446 3027.313647 3022.054677 2041.609136 2691.604363 2058.300423 6109.309315 2563.511596 2558.926018 0.550213
std 1603.782784 2283.945491 2735.244614 2870.619613 2512.162084 2675.074079 2578.318759 2421.309390 2691.397266 2777.809493 2777.504638 2149.931518 2596.151532 2256.234526 2446.439258 2387.652138 1034.798320 1670.965823 2745.333581 2298.281052 2054.223951 2176.307289 2020.499263 2408.279935 2212.018257 1944.613487 2336.052498 2413.851082 1.009424
min 563.444000 -433.735000 -1781.790000 -2939.740000 -3536.540000 -1815.630000 -5992.080000 -1677.600000 -2624.640000 -3403.050000 -3024.250000 -4505.720000 -1570.780000 -3305.070000 -1633.980000 -482.993000 -1137.170000 372.067000 -3765.860000 -1043.160000 -4869.010000 -1505.780000 -1445.370000 -4354.630000 -232.292000 -6807.550000 -2139.860000 -4145.250000 0.000000
25% 7285.310000 4027.570000 2060.600000 1446.940000 2984.370000 526.911000 2456.310000 1017.710000 2321.550000 1379.210000 2392.480000 559.867000 1068.940000 616.822000 947.793000 513.204000 718.068000 1582.530000 2003.930000 1392.390000 1405.020000 1010.180000 429.881000 766.451000 494.858000 5646.670000 689.922000 685.680000 0.000000
50% 7886.260000 6737.730000 5270.020000 4394.340000 5584.070000 1584.970000 5638.400000 2872.980000 5672.730000 4278.880000 6261.950000 1157.170000 2277.560000 1770.350000 1600.950000 1210.230000 1260.280000 2779.570000 5266.930000 3596.680000 2671.400000 2619.180000 1245.900000 1511.180000 931.713000 6862.060000 1506.570000 1458.870000 0.000000
75% 8121.780000 7589.020000 7484.110000 7317.950000 7440.210000 5460.080000 7245.040000 5516.610000 7395.610000 7144.480000 7545.880000 3006.960000 5290.800000 4513.960000 4066.930000 3963.590000 1994.910000 4255.580000 7549.430000 5817.750000 4174.010000 4837.610000 3016.520000 4508.510000 2950.880000 7378.020000 4208.730000 4112.550000 1.000000
max 8650.500000 8377.720000 8566.420000 8650.500000 8516.100000 8267.120000 8499.330000 8001.700000 8452.380000 8422.060000 8401.100000 8477.560000 8624.780000 7932.690000 8630.420000 8210.230000 5915.740000 7492.230000 8489.970000 7981.820000 8445.410000 7919.070000 8206.780000 8235.400000 8247.630000 8410.330000 8418.230000 8502.020000 5.000000

Split Data into Train and Validation 🔪

  • The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
  • The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
  • There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, k-fold, leave one out. 🧐
  • Validation sets are also used to avoid your model from overfitting on the train dataset.
In [0]:
X = all_data.drop('class',1)
y = all_data['class']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
  • We have decided to split the data with 20 % as validation and 80 % as training.
  • To learn more about the train_test_split function click here. 🧐
  • This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.
  • Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
  • with this step we are all set move to the next step with a prepared dataset.

TRAINING PHASE 🏋️

Define the Model

  • We have fixed our data and now we are ready to train our model.

  • There are a ton of classifiers to choose from some being Logistic Regression, SVM, Random Forests, Decision Trees, etc.🧐

  • Remember that there are no hard-laid rules here. you can mix and match classifiers, it is advisable to read up on the numerous techniques and choose the best fit for your solution , experimentation is the key.

  • A good model does not depend solely on the classifier but also on the features you choose. So make sure to analyse and understand your data well and move forward with a clear view of the problem at hand. you can gain important insight from here.🧐

In [0]:
# classifier = LogisticRegression()

classifier = SVC(gamma='auto')

# from sklearn import tree
# classifier = tree.DecisionTreeClassifier()
  • To start you off, We have used a basic Support Vector Machines classifier here.
  • But you can tune parameters and increase the performance. To see the list of parameters visit here.
  • Do keep in mind there exist sophisticated techniques for everything, the key as quoted earlier is to search them and experiment to fit your implementation.

To read more about other sklearn classifiers visit here 🧐. Try and use other classifiers to see how the performance of your model changes. Try using Logistic Regression or MLP and compare how the performance changes.

Train the Model

In [0]:
classifier.fit(X_train, y_train)
Out[0]:
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

got a warning! Dont worry, its just beacuse the number of iteration is very less(defined in the classifier in the above cell).Increase the number of iterations and see if the warning vanishes.Do remember increasing iterations also increases the running time.( Hint: max_iter=500)

Validation Phase 🤔

Wonder how well your model learned! Lets check it.

Predict on Validation

Now we predict using our trained model on the validation set we created and evaluate our model on unforeseen data.

In [0]:
y_pred = classifier.predict(X_val)

Evaluate the Performance

  • We have used basic metrics to quantify the performance of our model.
  • This is a crucial step, you should reason out the metrics and take hints to improve aspects of your model.
  • Do read up on the meaning and use of different metrics. there exist more metrics and measures, you should learn to use them correctly with respect to the solution,dataset and other factors.
  • F1 score is the metric for this challenge
In [0]:
precision = precision_score(y_val,y_pred,average='micro')
recall = recall_score(y_val,y_pred,average='micro')
accuracy = accuracy_score(y_val,y_pred)
f1 = f1_score(y_val,y_pred,average='macro')
In [0]:
print("Accuracy of the model is :" ,accuracy)
print("Recall of the model is :" ,recall)
print("Precision of the model is :" ,precision)
print("F1 score of the model is :" ,f1)
Accuracy of the model is : 0.7140825035561877
Recall of the model is : 0.7140825035561877
Precision of the model is : 0.7140825035561877
F1 score of the model is : 0.138865836791148

Testing Phase 😅

We are almost done. We trained and validated on the training data. Now its the time to predict on test set and make a submission.

Load Test Set

Load the test data on which final submission is to be made.

In [0]:
test_data = pd.read_csv('data/test.csv')

Predict Test Set

Time for the moment of truth! Predict on test set and time to make the submission.

In [0]:
y_test = classifier.predict(test_data)

Save the prediction to csv

In [0]:
df = pd.DataFrame(y_test,columns=['class'])
df.to_csv('submission.csv',index=False)

🚧 Note :

  • Do take a look at the submission format.
  • The submission file should contain a header.
  • Follow all submission guidelines strictly to avoid inconvenience.

To download the generated csv in collab run the below command

In [0]:
try:
  from google.colab import files
  files.download('submission.csv')
except ImportError as e:
  print("Only for Collab")

Well Done! 👍 We are all set to make a submission and see you name on leaderborad. Let navigate to challenge page and make one.

DBSRA

Baseline Submission for DBSRA

6 months ago

Baseline submission for the challenge DBSRA

Open In Colab

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

Download data

In [ ]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dbsra/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dbsra/data/public/train.csv

Load Data

In [2]:
train_data = pd.read_csv('train.csv')

Clean and Analyse Data

In [3]:
train_data = train_data.drop('encounter_id',1)
train_data = train_data.drop('patient_nbr',1)
train_data.head()
Out[3]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 AfricanAmerican Female [70-80) ? 1 1 7 2 ? InternalMedicine ... No Steady No No No No No No Yes 1
1 Caucasian Female [90-100) ? 3 1 1 8 SP Pulmonology ... No Down No No No No No Ch Yes 1
2 Caucasian Female [80-90) ? 1 2 7 1 MC Osteopath ... No Steady No No No No No No Yes 0
3 Caucasian Male [60-70) ? 3 1 6 6 MC Radiologist ... No Steady No No No No No Ch Yes 0
4 ? Female [70-80) ? 1 3 6 3 UN InternalMedicine ... No No No No No No No No No 0

5 rows × 48 columns

Since most of the columns have categorical columns we have to convert it into integers. The most basic way is to do an Ordinal Mapping. Note: Here we have not replaced question marks with some other data and they are also accounted into ordinal mapping.

In [5]:
labelencoder = LabelEncoder()
n_train_data = train_data
for col in train_data.columns:
    s = train_data[col]
    if s.dtype == 'O':
        s = labelencoder.fit_transform(s)
        n_train_data[col] = s
n_train_data.head()
Out[5]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 1 0 7 1 1 1 7 2 0 19 ... 0 2 1 0 0 0 0 1 1 1
1 3 0 9 1 3 1 1 8 15 51 ... 0 0 1 0 0 0 0 0 1 1
2 3 0 8 1 1 2 7 1 8 30 ... 0 2 1 0 0 0 0 1 1 0
3 3 1 6 1 3 1 6 6 8 52 ... 0 2 1 0 0 0 0 0 1 0
4 0 0 7 1 1 3 6 3 16 19 ... 0 1 1 0 0 0 0 1 0 0

5 rows × 48 columns

Split Data into Train and Validation

In [6]:
X = n_train_data.drop('readmitted',1)
y = n_train_data['readmitted']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Classifier and Train

In [7]:
classifier = LogisticRegression()
classifier.fit(X_train,y_train)
/home/gera/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/home/gera/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
Out[7]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Predict on Validation

In [8]:
y_pred = classifier.predict(X_val)
In [9]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)
df1
Out[9]:
Actual Predicted
26342 1 0
59142 1 0
57537 1 0
58128 0 0
29821 1 0
62897 0 0
43572 0 0
62329 2 0
44309 0 0
20882 0 0
49075 0 0
20668 0 0
76856 1 0
32858 1 1
74292 1 0
80549 1 0
8588 1 0
57768 1 0
10658 1 0
51569 0 0
59914 1 0
32874 0 0
54656 1 0
77456 0 0
35300 0 0

Evaluate the Performance

In [10]:
print('F1 score Score:', metrics.f1_score(y_val, y_pred,average='micro'))
F1 score Score: 0.5688110513843131

Load Test Set

In [11]:
test_data = pd.read_csv('test.csv')
In [12]:
test_data = test_data.drop('encounter_id',1)
test_data = test_data.drop('patient_nbr',1)
n_test_data = test_data
for col in test_data.columns:
    s = test_data[col]
    if s.dtype == 'O':
        s = labelencoder.fit_transform(s)
        n_test_data[col] = s
n_test_data.head()
Out[12]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty ... examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed
0 3 0 7 1 1 1 6 11 15 16 ... 0 0 2 1 0 0 0 0 1 1
1 3 1 5 1 1 1 1 1 6 0 ... 0 0 1 1 0 0 0 0 1 1
2 3 0 6 1 3 6 1 4 6 0 ... 0 0 1 1 0 0 0 0 1 1
3 3 1 3 1 2 1 1 12 4 10 ... 0 0 1 1 0 0 0 0 1 1
4 1 0 6 1 1 2 7 1 0 0 ... 0 0 1 1 0 0 0 0 1 1

5 rows × 47 columns

Predict Test Set

In [13]:
y_test = classifier.predict(test_data)
In [14]:
df = pd.DataFrame(y_test,columns=['readmitted'])
df.to_csv('submission.csv',index=False)

To participate in the challenge click here

In [ ]:

OLNWP

Baseline submission for OLNWP

6 months ago

Getting Started Code for OLNWP Educational Challenge

Author - Pulkit Gera

In [0]:
!pip install numpy
!pip install pandas
!pip install sklearn

Download data

The first step is to download our train test data. We will be training a classifier on the train data and make predictions on test data. We submit our predictions

In [1]:
!rm -rf data
!mkdir data
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/olnwp/v0.1/test.zip
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/olnwp/v0.1/train.zip
!unzip train.zip
!unzip test.zip
!mv train.csv data/train.csv
!mv test.csv data/test.csv
--2020-05-18 00:59:10--  https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_olnwp/data/public/test.zip
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.16, 130.117.252.12, 130.117.252.13, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2628035 (2.5M) [application/zip]
Saving to: ‘test.zip’

test.zip            100%[===================>]   2.51M  --.-KB/s    in 0.05s   

2020-05-18 00:59:10 (53.6 MB/s) - ‘test.zip’ saved [2628035/2628035]

--2020-05-18 00:59:12--  https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_olnwp/data/public/train.zip
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.16, 130.117.252.11, 130.117.252.12, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5406140 (5.2M) [application/zip]
Saving to: ‘train.zip’

train.zip           100%[===================>]   5.16M  27.1MB/s    in 0.2s    

2020-05-18 00:59:13 (27.1 MB/s) - ‘train.zip’ saved [5406140/5406140]

Archive:  train.zip
  inflating: train.csv               
Archive:  test.zip
  inflating: test.csv                

Import necessary packages

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Load Data

  • We use pandas 🐼 library to load our data.
  • Pandas loads the data into dataframes and facilitates us to analyse the data.
  • Learn more about it here 🤓
In [0]:
all_data = pd.read_csv('data/train.csv')

Clean and analyse the data

In [0]:
all_data = all_data.drop('url',1)
all_data.head()
Out[0]:
timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares
0 525.0 10.0 238.0 0.658120 1.0 0.821918 7.0 5.0 1.0 0.0 4.516807 9.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 1100.0 344.625 0.0 843300.0 138888.888889 0.0 3276.068815 2119.142483 751.0 751.0 751.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.022273 0.336825 0.113219 0.171540 0.356143 0.333174 0.113796 0.050420 0.008403 0.857143 0.142857 0.188510 0.100000 0.4 -0.133333 -0.166667 -0.10 0.250000 0.000000 0.250000 0.000000 782
1 273.0 11.0 545.0 0.474170 1.0 0.587719 21.0 2.0 21.0 1.0 4.836697 6.0 0.0 1.0 0.0 0.0 0.0 0.0 -1.0 1100.0 364.000 0.0 843300.0 215050.000000 0.0 3983.687500 2833.025154 1500.0 27100.0 14300.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.033335 0.034267 0.033334 0.865730 0.033334 0.424611 0.101154 0.034862 0.025688 0.575758 0.424242 0.401356 0.100000 0.9 -0.248214 -0.300000 -0.05 0.000000 0.000000 0.500000 0.000000 6200
2 423.0 10.0 453.0 0.518265 1.0 0.669173 21.0 5.0 15.0 0.0 4.772627 8.0 0.0 1.0 0.0 0.0 0.0 0.0 4.0 1400.0 323.250 1200.0 843300.0 211887.500000 1031.0 16100.000000 4916.574383 5900.0 17300.0 11600.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.025538 0.025859 0.025003 0.898352 0.025248 0.459715 0.135561 0.050773 0.011038 0.821429 0.178571 0.309091 0.100000 0.5 -0.380000 -0.700000 -0.20 0.300000 0.200000 0.200000 0.200000 723
3 80.0 11.0 814.0 0.456885 1.0 0.608787 2.0 2.0 1.0 0.0 4.671990 7.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.0 478.0 94.800 0.0 843300.0 337785.714286 0.0 4104.888889 2303.844586 1900.0 2700.0 2300.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.735399 0.028575 0.178492 0.028594 0.028939 0.442508 0.131205 0.039312 0.019656 0.666667 0.333333 0.376799 0.033333 1.0 -0.195312 -0.600000 -0.05 0.277273 0.218182 0.222727 0.218182 809
4 653.0 11.0 113.0 0.711712 1.0 0.878788 5.0 4.0 0.0 0.0 4.504425 8.0 0.0 0.0 1.0 0.0 0.0 0.0 217.0 640.0 395.000 0.0 617900.0 112062.500000 0.0 5678.750000 2438.866301 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.524560 0.152964 0.271398 0.025741 0.025337 0.421402 0.325379 0.070796 0.000000 1.000000 0.000000 0.366212 0.136364 0.8 0.000000 0.000000 0.00 0.375000 -0.125000 0.125000 0.125000 1600

Here we use the describe function to get an understanding of the data. It shows us the distribution for all the columns. You can use more functions like info() to get useful info.

In [0]:
all_data.describe()
#all_data.info()
Out[0]:
timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares
count 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000
mean 354.110802 10.403449 552.377282 0.555933 1.009337 0.696678 10.898648 3.304733 4.588344 1.259177 4.548023 7.233538 0.052822 0.179398 0.157599 0.059900 0.185460 0.212643 25.764354 1153.234790 313.182115 13539.935582 753374.752457 259086.610210 1119.137347 5657.790085 3136.969229 3853.119811 10324.538299 6288.260675 0.170438 0.186552 0.186627 0.181017 0.144272 0.062422 0.068672 0.131094 0.184483 0.141560 0.216390 0.222987 0.234542 0.443474 0.119235 0.039607 0.016587 0.682175 0.287856 0.353623 0.094825 0.757686 -0.259757 -0.522776 -0.107678 0.282236 0.071113 0.342243 0.156345 3369.156094
std 213.485655 2.122533 472.605248 4.300199 6.389915 3.987187 11.254509 3.855560 8.377796 4.212860 0.844332 1.910501 0.223682 0.383693 0.364372 0.237306 0.388678 0.409185 69.226548 3412.864512 616.649425 57567.108728 213790.768136 134639.145918 1136.222417 5941.869520 1323.234647 18370.270892 41728.980107 23239.178344 0.376024 0.389559 0.389619 0.385040 0.351372 0.241926 0.252900 0.337510 0.262251 0.219365 0.282227 0.294359 0.289642 0.116349 0.096861 0.017340 0.010769 0.190151 0.156012 0.104505 0.070493 0.247909 0.128229 0.290208 0.096784 0.324309 0.266373 0.188296 0.227084 10971.259269
min 8.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.393750 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 -1.000000 -1.000000 0.000000 -1.000000 0.000000 0.000000 1.000000
25% 165.000000 9.000000 248.000000 0.470000 1.000000 0.625430 4.000000 1.000000 1.000000 0.000000 4.479927 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 448.000000 142.025714 0.000000 843300.000000 173010.000000 0.000000 3564.727273 2383.380145 642.000000 1100.000000 989.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.025046 0.025012 0.028571 0.028571 0.028574 0.396661 0.058033 0.028409 0.009615 0.600000 0.185185 0.306110 0.050000 0.600000 -0.327976 -0.700000 -0.125000 0.000000 0.000000 0.166667 0.000000 948.000000
50% 339.000000 10.000000 415.000000 0.538251 1.000000 0.690323 8.000000 3.000000 1.000000 0.000000 4.663265 7.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 662.000000 235.333333 1400.000000 843300.000000 244244.444444 1034.062500 4354.292564 2870.427004 1200.000000 2900.000000 2207.689655 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.033389 0.033345 0.040004 0.040001 0.040830 0.453800 0.119019 0.039106 0.015306 0.710145 0.280000 0.358506 0.100000 0.800000 -0.253385 -0.500000 -0.100000 0.142857 0.000000 0.500000 0.000000 1400.000000
75% 540.000000 12.000000 724.000000 0.607735 1.000000 0.754011 14.000000 4.000000 4.000000 1.000000 4.854890 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 1000.000000 356.857143 7900.000000 843300.000000 330510.000000 2054.000000 6015.439290 3596.860531 2600.000000 7900.000000 5175.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.242231 0.151394 0.333707 0.371249 0.399841 0.507733 0.177513 0.050204 0.021739 0.800000 0.384615 0.410606 0.100000 1.000000 -0.187500 -0.300000 -0.050000 0.500000 0.146667 0.500000 0.250000 2800.000000
max 731.000000 23.000000 7185.000000 701.000000 1042.000000 650.000000 304.000000 116.000000 128.000000 91.000000 8.041534 10.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 377.000000 158900.000000 39979.000000 843300.000000 843300.000000 843300.000000 3613.039819 237966.666667 37607.521654 690400.000000 843300.000000 690400.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.926994 0.925947 0.919999 0.925542 0.927191 1.000000 0.655000 0.155488 0.184932 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.500000 1.000000 690400.000000

Split Data into Train and Validation 🔪

  • The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
  • The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
  • There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, k-fold, leave one out. 🧐
  • Validation sets are also used to avoid your model from overfitting on the train dataset.
In [0]:
X = all_data.drop(' shares',1)
y = all_data[' shares']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
  • We have decided to split the data with 20 % as validation and 80 % as training.
  • To learn more about the train_test_split function click here. 🧐
  • This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.
  • Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
  • with this step we are all set move to the next step with a prepared dataset.

TRAINING PHASE 🏋️

Define the Model and Train

Define the Model

  • We have fixed our data and now we are ready to train our model.

  • There are a ton of regressors to choose from some being Linear Regression, , Random Forests, Decision Trees, etc.🧐

  • Remember that there are no hard-laid rules here. you can mix and match classifiers, it is advisable to read up on the numerous techniques and choose the best fit for your solution , experimentation is the key.

  • A good model does not depend solely on the classifier but also on the features you choose. So make sure to analyse and understand your data well and move forward with a clear view of the problem at hand. you can gain important insight from here.🧐

In [0]:
regressor = LinearRegression()  

# from sklearn import tree
# regressor = tree.DecisionTreeRegressor()
  • We have used Linear Regression as a model here and set few of the parameteres.
  • One can set more parameters and increase the performance. To see the list of parameters visit here.
  • Do keep in mind there exist sophisticated techniques for everything, the key as quoted earlier is to search them and experiment to fit your implementation.
  • Also given Decision Tree examples. Check out Decision Tree's parameters here

Train the Model

In [0]:
regressor.fit(X_train, y_train)
Out[0]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Check which variables have the most impact

We now take this time to identify the columns that have the most impact. This is used to remove the columns that have negligble impact on the data and improve our model.

In [0]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()
Out[0]:
Coefficient
timedelta 1.371829
n_tokens_title 134.279025
n_tokens_content 0.321616
n_unique_tokens 4477.371557
n_non_stop_words -2579.368312

Validation Phase 🤔

Wonder how well your model learned! Lets check it.

Predict on Validation

Now we predict using our trained model on the validation set we created and evaluate our model on unforeseen data.

In [0]:
y_pred = regressor.predict(X_val)

Evaluate the Performance

  • We have used basic metrics to quantify the performance of our model.
  • This is a crucial step, you should reason out the metrics and take hints to improve aspects of your model.
  • Do read up on the meaning and use of different metrics. there exist more metrics and measures, you should learn to use them correctly with respect to the solution,dataset and other factors.
  • MAE and RMSE are the metrics for this challenge.
In [0]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))
Mean Absolute Error: 3174.901688002103
Mean Squared Error: 168520453.62617025
Root Mean Squared Error: 12981.542806083191

Testing Phase 😅

We are almost done. We trained and validated on the training data. Now its the time to predict on test set and make a submission.

Load Test Set

Load the test data on which final submission is to be made.

In [0]:
test_data = pd.read_csv('data/test.csv')
In [0]:
test_data = test_data.drop('url',1)
test_data.head()
Out[0]:
timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity
0 121.0 12.0 1015.0 0.422018 1.0 0.545031 10.0 6.0 33.0 1.0 4.656158 4.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.0 263.0 110.500000 6500.0 843300.0 398350.000000 1809.075 3483.806797 2729.047648 1100.0 22100.0 6475.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.331582 0.050050 0.050035 0.050000 0.518333 0.471175 0.159889 0.041379 0.008867 0.823529 0.176471 0.333534 0.100000 0.8 -0.160714 -0.50 -0.071429 0.0 0.00 0.5 0.00
1 532.0 9.0 503.0 0.569697 1.0 0.737542 9.0 0.0 1.0 1.0 4.576541 10.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 3200.0 524.750000 0.0 843300.0 117960.000000 0.000 4228.114286 2387.526307 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.020007 0.020008 0.325602 0.020004 0.614379 0.477791 0.123520 0.033797 0.019881 0.629630 0.370370 0.419786 0.136364 1.0 -0.157500 -0.25 -0.100000 0.0 0.00 0.5 0.00
2 435.0 9.0 232.0 0.646018 1.0 0.748428 12.0 3.0 4.0 1.0 4.935345 6.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 939.0 198.666667 970.0 843300.0 573878.333333 954.500 6192.239067 4385.022237 1400.0 58800.0 30100.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.033334 0.033697 0.033333 0.866302 0.033333 0.522234 -0.163235 0.017241 0.043103 0.285714 0.714286 0.468750 0.375000 0.5 -0.427500 -1.00 -0.187500 0.0 0.00 0.5 0.00
3 134.0 12.0 171.0 0.722892 1.0 0.867925 9.0 5.0 0.0 1.0 4.970760 6.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.0 2100.0 444.166667 5600.0 843300.0 311033.333333 2076.520 4529.427500 3269.856640 974.0 5600.0 2574.8 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.700107 0.033335 0.033334 0.199402 0.033822 0.405128 -0.006410 0.011696 0.029240 0.285714 0.714286 0.500000 0.500000 0.5 -0.216667 -0.25 -0.166667 0.4 -0.25 0.1 0.25
4 728.0 11.0 286.0 0.652632 1.0 0.800000 5.0 2.0 0.0 0.0 5.006993 8.0 0.0 0.0 0.0 0.0 1.0 0.0 217.0 552.0 356.200000 0.0 28000.0 6830.125000 0.000 2240.536313 976.913444 822.0 822.0 822.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.214708 0.025062 0.025016 0.025187 0.710028 0.418036 0.060089 0.034965 0.024476 0.588235 0.411765 0.303429 0.100000 0.6 -0.251786 -0.50 -0.100000 0.2 -0.10 0.3 0.10

Predict on test set

Time for the moment of truth! Predict on test set and time to make the submission.

In [0]:
y_test = regressor.predict(test_data)

Since its integer regression, convert to integers

In [0]:
y_inttest = [int(i) for i in y_test]
y_inttest = np.asarray(y_inttest)

Save the prediction to csv

In [0]:
df = pd.DataFrame(y_inttest,columns=[' shares'])
df.to_csv('submission.csv',index=False)

🚧 Note :

  • Do take a look at the submission format.
  • The submission file should contain a header.
  • Follow all submission guidelines strictly to avoid inconvenience.

To download the generated in collab csv run the below command

In [0]:
try:
  from google.colab import files
  files.download('submission.csv')
except ImportError as e:
  print("Only for Collab")

Well Done! 👍 We are all set to make a submission and see you name on leaderborad. Let navigate to challenge page and make one.

YPMSD

Baseline Submission for the challenge

6 months ago

Baseline Submission for the Challenge YPMSD

Open In Colab

Import necessary packages