Data Purchasing Challenge 2022

[Explainer + Baseline] Get your Baseline right! (+0.84 LB)

This is a very simple notebook with the implementation of the run.py code snippet to achieve a Baseline Accuracy of +0.84. Enjoy and leave a like if it was useful for you!

This notebook is different!¶

If you are new to the challenge I suggest you visit the following link for an amazing explanation on how to Fork the repository, setup your SSH, setup your environment, and make your first submission¶

Gaurav_Singhal's :[Create your baseline with 0.4+ on LB (Git Repo and Video) ](https://www.aicrowd.com/showcase/create-your-baseline-with-0-4-on-lb-git-repo-and-video)

This notebooks uses parts of the snippet provided by this gentleman, so go ahead and leave him a like.

This notebooks implements the ZEWDPCBaseRun class that you need to create in the run.py file to get your submission up to +0.84 in the LB.

Wait....WHAT?... Why would you even do that? -- is probably what you are asking...¶

Well, if you are like me, you probably just started your journey and your best way to learn is to look at someone else's code and slowly try to get your head around what is going on... but you also like to get good scoring models too!

So bear with me, and let me guide you a bit throughout the code and understand what is going on behind the scenes (or at least an aproximation), and get your name up there buddy!

And remember, the goal for this challenge is to address the buying stage, so I am still leaving you the fun part.

Disclaimer</i>: this notebook is not inteded for direct submissions, please, follow the steps described in Aicrowd to make GitLab submissions.

a) Understanding the Datasets¶
The directory structure looks like this:

Quick preview of images and labels.csv is as follows:

This means that we have our images and labels splitted by "type". We don't really care about how these are obtained, since they are instantiated (declared) by the provided code when calling the local_evaluation.py in the run.py file

b) Stages¶
As you may have already seen, there are different stages for this challenge:

Pre-Training Phase

Purchase Phase

Evaluation Phase

Each one of these stages is implemented as functions within the ZEWDPCBaseRun class inside the run.py file. Then, our AICrowd friends, run these different stages on their end with different datasets.

So basically, we have to define those functions inside the ZEWDPCBaseRun class and save it in the run.py file.

To sum up, at the end of the notebook you will have a cell that compiles the whole thing we are going to be discussing, by copying that in your run.py you are ready to get those juicy +0.84 on the LB. But beware... with a great power comes a great responsability, so try to read what is going on.

1) Packages¶
The first part allows us to load all the things we need. Keep in mind that you may need to add packages into the requirements.txt file, I will provide the text for this file at the end.

In [ ]:

#!/usr/bin/env python import torch from torch import nn from torchvision import models from torch.optim import Adam, SGD, lr_scheduler from torchvision import transforms from torch.utils.data import DataLoader import numpy as np import datetime from tqdm import tqdm from sklearn.metrics import accuracy_score from sklearn.metrics import hamming_loss from evaluator.dataset import ZEWDPCBaseDataset, ZEWDPCProtectedDataset

2) Define some parameters and our model¶
Here we define some parameters for the class that will be accesible throughout the run.
We also define our MVP for the challenge, the model!

In my case I am using the small version of the EfficientNet, and calling the pretrained weights.

Here is one of the most important steps: adding layers to fit the original output from the network, into our specific case-scenario!

We also instantiate our model and move it to our CUDA friends.

Then we define other required parameters like our zero-searching buddy, the Optimizer.

We got our "Hey! you are stuck" parameter: Reduce on Plateau for our Learning Rate

And finally our "I'll tell you if you suck": our Criterion

In [ ]:

class ZEWDPCBaseRun: def __init__(self): self.evaluation_state = {} # Model parameters self.BATCH_SIZE = 32 self.NUM_WORKERS = 2 self.LEARNING_RATE = 0.001 self.NUM_CLASSES = 4 self.TOPK= 3 self.THRESHOLD = 0.5 self.NUM_EPOCS = 50 self.EVAL_FREQ = 5 class Classifier(nn.Module): def __init__(self): super(Classifier, self).__init__() self.resnet = models.efficientnet_b0(pretrained=True) # The pretrained network! self.l1 = nn.Linear(1000 , 256) # We make arrangements for these two to meet self.dropout = nn.Dropout(0.5) # do whatever you want from here on self.l2 = nn.Linear(256,4) self.relu = nn.ReLU() def forward(self, input): x = self.resnet(input) x = x.view(x.size(0),-1) x = self.dropout(self.relu(self.l1(x))) x = self.l2(x) return x self.model = Classifier() self.model.cuda() self.trainable_parameters = filter(lambda param: param.requires_grad, self.model.parameters()) self.optimizer = Adam(self.trainable_parameters, lr=self.LEARNING_RATE) self.epoch = 0 self.lr_scheduler_ = lr_scheduler.ReduceLROnPlateau( self.optimizer, mode='max', patience=2, verbose=True ) self.criterion = nn.BCEWithLogitsLoss()

3) Pre-training phase¶
This snippet is following Gaurav_Singhal's original notebook and AICrowds original snippet!

This function basically creates a dataset for our training images, defines the steps for training, and then runs through the amount of epochs we defined.

Check the code to understand what is going on in the training stage!

In [ ]:

def pre_training_phase( self, training_dataset: ZEWDPCBaseDataset, register_progress=lambda x: False ): print("\n================> Pre-Training Phase\n") # Creating transformations train_transform = transforms.Compose([ transforms.ToTensor(), #transforms.RandomHorizontalFlip(p=0.5), #no augmentation for you buddy #transforms.RandomVerticalFlip(p=0.5), ]) training_dataset.set_transform(train_transform) # We transform to Tensors basically train_loader = DataLoader( dataset=training_dataset, batch_size=self.BATCH_SIZE, shuffle=False, num_workers=self.NUM_WORKERS, ) # we create our Pytorch DataLoader def run_epoch(): for _, batch in enumerate(train_loader): x, y = batch["image"].cuda(), batch["label"] # Here we are telling the model which is the data, and which are the labels pred_y = self.model(x) # We make the model predict, whatever it comes out! y = torch.cat(y, dim=0).reshape( self.NUM_CLASSES, pred_y.shape[0] ).T.type(torch.FloatTensor) ## CHANGE CPU CUDA HERE. Comment for CPU y = y.cuda() loss = self.criterion(pred_y, y) # Applying our Criteria (Loss function) we decide if it is good or bad self.optimizer.zero_grad() # We make our gradients zero loss.backward() # We compute our gradients (might have heard of that somewhere) self.optimizer.step() # And we update the gradients by making the optimizer optimize! # 416 = BATCH_SIZE*13 if self.global_step % 416 == 0: print("[{}] Training [epoch {}, step {}], loss: {:4f}".format( datetime.datetime.now(), self.epoch, self.global_step, loss)) self.global_step += self.BATCH_SIZE epoch_range = tqdm(range(self.epoch, self.NUM_EPOCS)) # Finally, we run our training schema on the amount of epochs defined for i in epoch_range: epoch_range.set_description(f"Epoch: {i}") self.global_step = 0 run_epoch() register_progress(i) self.epoch += 1 print("Execution Complete of Training Phase.")

4) Purchase Phase¶
Here you are on your own buddy!

Remember, we want to select the best subset of images from the Purchase Dataset so as to make our model better!

In [ ]:

def purchase_phase( self, unlabelled_dataset: ZEWDPCProtectedDataset, training_dataset: ZEWDPCBaseDataset, budget=1000, register_progress=lambda x: False, ): """ # Purchase Phase ------------------------- In this phase of the competition, you have access to the unlabelled_dataset (an instance of `ZEWDPCProtectedDataset`) and the training_dataset (an instance of `ZEWDPCBaseDataset`) {see datasets.py for more details}, and a purchase budget. You can iterate over both the datasets and access the images without restrictions. However, you can probe the labels of the unlabelled_dataset only until you run out of the label purchasing budget. PARTICIPANT_TODO: Add your code here """ print("\n================> Purchase Phase | Budget = {}\n".format(budget)) register_progress(0.0) #Register Progress for sample in tqdm(unlabelled_dataset): idx = sample["idx"] # image = unlabelled_dataset.__getitem__(idx) # print(image) # Budgeting & Purchasing Labels if budget > 0: label = unlabelled_dataset.purchase_label(idx) budget -= 1 register_progress(1.0) #Register Progress print("Execution Complete of Purchase Phase.")

5) Prediction Phase¶
Here we take our lovely model, and predict over the Validation Dataset, let's see how it goes!

In [ ]:

def prediction_phase( self, test_dataset: ZEWDPCBaseDataset, register_progress=lambda x: False, ): """ # Prediction Phase ------------------------- In this phase of the competition, you have access to the test dataset, and you are supposed to make predictions using your trained models. Returns: np.ndarray of shape (n, 4) where n is the number of samples in the test set and 4 refers to the 4 labels to be predicted for each sample for the multi-label classification problem. PARTICIPANT_TODO: Add your code here """ print( "\n================> Prediction Phase : - on {} images\n".format( len(test_dataset) ) ) test_transform = transforms.Compose([ transforms.ToTensor(), ]) #We only transform to Tensors test_dataset.set_transform(test_transform) test_loader = DataLoader( dataset=test_dataset, batch_size=self.BATCH_SIZE, shuffle=False, num_workers=self.NUM_WORKERS, ) # We load the data def convert_to_label(preds): return np.array((torch.sigmoid(preds) > 0.5), dtype=int).tolist() # a simple function to convert our predictions to labels based on if their estimated probability is over 0.5 predictions = [] self.model.eval() # We will start to predict! with torch.no_grad(): for _, batch in enumerate(test_loader): X = batch['image'].cuda() pred_y = self.model(X) #applying our model! # Convert to labels pred_y_labels = [] for arr in pred_y: ## CHANGE CPU CUDA HERE pred_y_labels.append(convert_to_label(arr.cpu())) # Save the results predictions.extend(pred_y_labels) register_progress(1.0) predictions = np.array(predictions) # random predictions print("Execution Complete of Purchase Phase.") return predictions

6) Evaluate and Checkpoints¶
This last stage implements the Evaluation function to get our metrics, and the Checkpoints, which are essential to get our model through the different stages.

In [ ]:

def evaluation(self, predictions, val_dataset_gt:ZEWDPCBaseDataset): from evaluator.evaluation_metrics import accuracy_score, hamming_loss, exact_match_ratio y_true = val_dataset_gt._get_all_labels() y_pred = predictions accuracy_score = accuracy_score(y_true, y_pred) hamming_loss_score = hamming_loss(y_true, y_pred) exact_match_ratio_score = exact_match_ratio(y_true, y_pred) print("Accuracy Score : ", accuracy_score) print("Hamming Loss : ", hamming_loss_score) print("Exact Match Ratio : ", exact_match_ratio_score) def save_checkpoint(self, checkpoint_path): """ Saves the checkpoint in the checkpoint_path directory. Each checkpoint will be saved for epoch_x """ save_dict = { 'epoch': self.epoch + 1, 'model_state_dict': self.model.state_dict(), 'optim_state_dict': self.optimizer.state_dict(), } torch.save(save_dict, checkpoint_path) print(f"Checkpont epoch:{self.epoch} Model saved at {checkpoint_path}") def load_checkpoint(self, checkpoint_path): """ Load the latest checkpoint from the experiment """ ## CHANGE CPU CUDA HERE checkpoint_model = torch.load(checkpoint_path, map_location="cuda:0") # checkpoint_model = torch.load(checkpoint_path, map_location="cpu") self.latest_epoch = checkpoint_model['epoch'] self.model.load_state_dict(checkpoint_model['model_state_dict']) self.optimizer.load_state_dict(checkpoint_model['optim_state_dict']) print('loading checkpoint success (epoch {})'.format(self.latest_epoch))

Adding some parameter for the code to run properly.

In [ ]:

if __name__ == "__main__": #################################################################################### ## You need to implement `ZEWDPCBaseRun` class in this file for this challenge. ## Code for running all the phases locally is written in `main.py` for illustration ## purposes. ## ## Checkout the inline documentation of `ZEWDPCBaseRun` for more details. #################################################################################### import local_evaluation

THE END¶
And there you go, that is the whole definition for the run.py file, where you are able to get all this running!

Full code:

In [ ]:

#!/usr/bin/env python import torch from torch import nn from torchvision import models from torch.optim import Adam, SGD, lr_scheduler from torchvision import transforms from torch.utils.data import DataLoader import numpy as np import datetime from tqdm import tqdm from sklearn.metrics import accuracy_score from sklearn.metrics import hamming_loss from evaluator.dataset import ZEWDPCBaseDataset, ZEWDPCProtectedDataset class ZEWDPCBaseRun: def __init__(self): self.evaluation_state = {} # Model parameters self.BATCH_SIZE = 32 self.NUM_WORKERS = 2 self.LEARNING_RATE = 0.001 self.NUM_CLASSES = 4 self.TOPK= 3 self.THRESHOLD = 0.5 self.NUM_EPOCS = 50 self.EVAL_FREQ = 5 class Classifier(nn.Module): def __init__(self): super(Classifier, self).__init__() self.resnet = models.efficientnet_b1(pretrained=True) self.l1 = nn.Linear(1000 , 256) self.dropout = nn.Dropout(0.5) self.l2 = nn.Linear(256,4) self.relu = nn.ReLU() def forward(self, input): x = self.resnet(input) x = x.view(x.size(0),-1) x = self.dropout(self.relu(self.l1(x))) x = self.l2(x) return x #device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #classifier = Classifier().to(device) self.model = Classifier() #self.model = models.efficientnet_b0(pretrained=True,num_classes = self.NUM_CLASSES) ## CHANGE CPU CUDA HERE self.model.cuda() # self.model.cpu() self.trainable_parameters = filter(lambda param: param.requires_grad, self.model.parameters()) self.optimizer = Adam(self.trainable_parameters, lr=self.LEARNING_RATE) self.epoch = 0 self.lr_scheduler_ = lr_scheduler.ReduceLROnPlateau( self.optimizer, mode='max', patience=2, verbose=True ) self.criterion = nn.BCEWithLogitsLoss() def pre_training_phase( self, training_dataset: ZEWDPCBaseDataset, register_progress=lambda x: False ): print("\n================> Pre-Training Phase\n") # Creating transformations train_transform = transforms.Compose([ #transforms.Grayscale(num_output_channels=3), transforms.ToTensor(), transforms.RandomHorizontalFlip(p=0.5), transforms.RandomVerticalFlip(p=0.5), ]) training_dataset.set_transform(train_transform) train_loader = DataLoader( dataset=training_dataset, batch_size=self.BATCH_SIZE, shuffle=False, num_workers=self.NUM_WORKERS, ) def run_epoch(): for _, batch in enumerate(train_loader): ## CHANGE CPU CUDA HERE x, y = batch["image"].cuda(), batch["label"] # x, y = batch["image"].cpu(), batch["label"] pred_y = self.model(x) # Change the shape of true labels here. Because for last batch the no. of images can be less y = torch.cat(y, dim=0).reshape( self.NUM_CLASSES, pred_y.shape[0] ).T.type(torch.FloatTensor) ## CHANGE CPU CUDA HERE. Comment for CPU y = y.cuda() loss = self.criterion(pred_y, y) self.optimizer.zero_grad() loss.backward() self.optimizer.step() # 416 = BATCH_SIZE*13 if self.global_step % 416 == 0: print("[{}] Training [epoch {}, step {}], loss: {:4f}".format( datetime.datetime.now(), self.epoch, self.global_step, loss)) self.global_step += self.BATCH_SIZE epoch_range = tqdm(range(self.epoch, self.NUM_EPOCS)) for i in epoch_range: epoch_range.set_description(f"Epoch: {i}") self.global_step = 0 run_epoch() register_progress(i) # Epoch as progress #if (i+1)%self.EVAL_FREQ == 0: # predictions = self.prediction_phase(val_dataset) # self.evaluation(predictions) self.epoch += 1 print("Execution Complete of Training Phase.") def purchase_phase( self, unlabelled_dataset: ZEWDPCProtectedDataset, training_dataset: ZEWDPCBaseDataset, budget=1000, register_progress=lambda x: False, ): """ # Purchase Phase ------------------------- In this phase of the competition, you have access to the unlabelled_dataset (an instance of `ZEWDPCProtectedDataset`) and the training_dataset (an instance of `ZEWDPCBaseDataset`) {see datasets.py for more details}, and a purchase budget. You can iterate over both the datasets and access the images without restrictions. However, you can probe the labels of the unlabelled_dataset only until you run out of the label purchasing budget. PARTICIPANT_TODO: Add your code here """ print("\n================> Purchase Phase | Budget = {}\n".format(budget)) register_progress(0.0) #Register Progress for sample in tqdm(unlabelled_dataset): idx = sample["idx"] # image = unlabelled_dataset.__getitem__(idx) # print(image) # Budgeting & Purchasing Labels if budget > 0: label = unlabelled_dataset.purchase_label(idx) budget -= 1 register_progress(1.0) #Register Progress print("Execution Complete of Purchase Phase.") def prediction_phase( self, test_dataset: ZEWDPCBaseDataset, register_progress=lambda x: False, ): """ # Prediction Phase ------------------------- In this phase of the competition, you have access to the test dataset, and you are supposed to make predictions using your trained models. Returns: np.ndarray of shape (n, 4) where n is the number of samples in the test set and 4 refers to the 4 labels to be predicted for each sample for the multi-label classification problem. PARTICIPANT_TODO: Add your code here """ print( "\n================> Prediction Phase : - on {} images\n".format( len(test_dataset) ) ) test_transform = transforms.Compose([ transforms.ToTensor(), ]) test_dataset.set_transform(test_transform) test_loader = DataLoader( dataset=test_dataset, batch_size=self.BATCH_SIZE, shuffle=False, num_workers=self.NUM_WORKERS, ) def convert_to_label(preds): return np.array((torch.sigmoid(preds) > 0.5), dtype=int).tolist() predictions = [] self.model.eval() with torch.no_grad(): for _, batch in enumerate(test_loader): ## CHANGE CPU CUDA HERE # X= batch['image'].cpu() X = batch['image'].cuda() pred_y = self.model(X) # Convert to labels pred_y_labels = [] for arr in pred_y: ## CHANGE CPU CUDA HERE pred_y_labels.append(convert_to_label(arr.cpu())) # For CUDA # pred_y_labels.append(convert_to_label(arr)) # For CPU # Save the results predictions.extend(pred_y_labels) register_progress(1.0) predictions = np.array(predictions) # random predictions print("Execution Complete of Purchase Phase.") return predictions def evaluation(self, predictions, val_dataset_gt:ZEWDPCBaseDataset): from evaluator.evaluation_metrics import accuracy_score, hamming_loss, exact_match_ratio y_true = val_dataset_gt._get_all_labels() y_pred = predictions accuracy_score = accuracy_score(y_true, y_pred) hamming_loss_score = hamming_loss(y_true, y_pred) exact_match_ratio_score = exact_match_ratio(y_true, y_pred) print("Accuracy Score : ", accuracy_score) print("Hamming Loss : ", hamming_loss_score) print("Exact Match Ratio : ", exact_match_ratio_score) def save_checkpoint(self, checkpoint_path): """ Saves the checkpoint in the checkpoint_path directory. Each checkpoint will be saved for epoch_x """ save_dict = { 'epoch': self.epoch + 1, 'model_state_dict': self.model.state_dict(), 'optim_state_dict': self.optimizer.state_dict(), } torch.save(save_dict, checkpoint_path) print(f"Checkpont epoch:{self.epoch} Model saved at {checkpoint_path}") def load_checkpoint(self, checkpoint_path): """ Load the latest checkpoint from the experiment """ ## CHANGE CPU CUDA HERE checkpoint_model = torch.load(checkpoint_path, map_location="cuda:0") # checkpoint_model = torch.load(checkpoint_path, map_location="cpu") self.latest_epoch = checkpoint_model['epoch'] self.model.load_state_dict(checkpoint_model['model_state_dict']) self.optimizer.load_state_dict(checkpoint_model['optim_state_dict']) print('loading checkpoint success (epoch {})'.format(self.latest_epoch)) if __name__ == "__main__": #################################################################################### ## You need to implement `ZEWDPCBaseRun` class in this file for this challenge. ## Code for running all the phases locally is written in `main.py` for illustration ## purposes. ## ## Checkout the inline documentation of `ZEWDPCBaseRun` for more details. #################################################################################### import local_evaluation

Requirements.txt file!

In [ ]:

click==8.0.3 imageio==2.14.1 jinja2==3.0.3 pandas scikit-image scikit-learn scipy timeout-decorator==0.5.0 tqdm==4.60.0 torch==1.10.2 torchvision torchaudio

Remember to leave a like!¶
This notebook wouldn't have been possible without the snippets that the user gaurav_singhal provided by the community.

If this notebook also helps you to move forward, please, leave a LIKE 💗

Hope to see you up there!¶

In [ ]:

Content

2978

Show Comments

Comments

azam_kamranian
Over 3 years ago

Comment deleted by azam_kamranian.

gaurav_singhal
Over 3 years ago

Thanks for the credit.

You must login before you can post a comment.

Execute