Loading

ESCI Challenge for Improving Product Search

Simple baseline

Simple baseline with simpletransformers

moto

The baseline provided is a little bit difficult to follow. I decided to publish my first working version using simpletransformers.

I haven't used all the data nor fine-tune the model therefore the score is not as good as the provided baseline but it is much simpler to understand the data and the task.

 

In [1]:
from IPython.display import clear_output
!pip install simpletransformers
clear_output()
print("DONE")
DONE
In [2]:
!mkdir -p data

original_input="../input/esci-challenge-data-20220416-080611"

!cp $original_input/product_catalogue-v0.2.csv/data/processed/public/task_3_product_substitute_identification/product_catalogue-v0.2.csv data
!cp $original_input/train-v0.2.csv/data/processed/public/task_3_product_substitute_identification/train-v0.2.csv data
!cp $original_input/test_public-v0.2.csv/data/processed/public/task_3_product_substitute_identification/test_public-v0.2.csv data

!ls -alh data
total 2.1G
drwxr-xr-x 2 root root 4.0K May 14 08:01 .
drwxr-xr-x 3 root root 4.0K May 14 08:01 ..
-rw-r--r-- 1 root root 2.0G May 14 08:01 product_catalogue-v0.2.csv
-rw-r--r-- 1 root root  18M May 14 08:01 test_public-v0.2.csv
-rw-r--r-- 1 root root 102M May 14 08:01 train-v0.2.csv
In [3]:
import numpy as np
import pandas as pd

input_dir = "data"
full_train_df = pd.read_csv(f"{input_dir}/train-v0.2.csv")
print(full_train_df.shape)
full_train_df.head(3)

product_df = pd.read_csv(f"{input_dir}/product_catalogue-v0.2.csv")
print(product_df.shape)
product_df.head(3)

full_train_df2 = pd.merge(full_train_df, 
                          product_df[["product_id", "product_locale", "product_title"]],
                          left_on=["product_id", "query_locale"], 
                          right_on=["product_id", "product_locale"]
)
print(full_train_df2.shape)
full_train_df2.head()
(1834744, 5)
(1815216, 7)
(1834744, 7)
Out[3]:
example_id query product_id query_locale substitute_label product_locale product_title
0 0 11 degrees B079VKKJN7 es no_substitute es 11 Degrees de los Hombres Playera con Logo, Ne...
1 1 11 degrees B079Y9VRKS es no_substitute es Camiseta Eleven Degrees Core TS White (M)
2 2 11 degrees B07D2DDCZH es no_substitute es 11 Degrees de los Hombres Camiseta Muscle Fit,...
3 20610 camiseta muscle fit B07D2DDCZH es no_substitute es 11 Degrees de los Hombres Camiseta Muscle Fit,...
4 3 11 degrees B07DP4LM9H es no_substitute es 11 Degrees de los Hombres Core Pull Over Hoodi...
In [4]:
full_train_df2["substitute_label"].value_counts()
full_train_df2["substitute_label"] = (full_train_df2["substitute_label"] == "substitute").astype(int)
full_train_df2 = full_train_df2[["query", "product_title", "substitute_label"]]

from sklearn.model_selection import train_test_split
train_df, eval_df = train_test_split(full_train_df2, test_size=0.5, random_state=0, 
                               stratify=full_train_df2[['substitute_label']])

print(train_df.shape, eval_df.shape)
train_df.head()
(917372, 3) (917372, 3)
Out[4]:
query product_title substitute_label
667470 hyperx headset Razer Kraken X Ultralight Gaming Headset: 7.1 ... 0
332819 ceiling tv mount visio Amazon Basics Heavy-Duty Full Motion Articulat... 0
953261 keychain pepper spray SABRE Self Defense Kit With Pepper Spray And S... 0
433367 sims 4 The Sims 4 - Realm of Magic [Online Game Code] 0
1196946 rockford p3 12 inch subwoofer shallow Rockford Fosgate P3D2-12 Punch P3 DVC 2 Ohm 12... 1
In [5]:
import gc
del product_df
del full_train_df
del full_train_df2
gc.collect()
Out[5]:
21
In [6]:
original_train_df = train_df
original_eval_df = eval_df
original_train_df["substitute_label"].value_counts()
Out[6]:
0    716601
1    200771
Name: substitute_label, dtype: int64
In [7]:
def get_balance(df):
    df_pos = df[df["substitute_label"] == 1]
    df_neg = df[df["substitute_label"] == 0].sample(n=df_pos.shape[0])
    df2 = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)
    return df2

nb_rows = int(2e5)
train_df = get_balance(original_train_df).head(nb_rows)
eval_df = get_balance(original_eval_df).head(nb_rows)

print(train_df["substitute_label"].value_counts())
train_df.tail()
0    100139
1     99861
Name: substitute_label, dtype: int64
Out[7]:
query product_title substitute_label
199995 制服 長袖 [サニーハグ] 男子 スクールシャツ 長袖 学生用 Yシャツ 形態安定 抗菌防臭 制服 標準... 0
199996 vino sin alcohol tinto Marqués de Riscal - Vino tinto Reserva Denomin... 1
199997 no im not on steroids Awkward Styles Men's No I`m Not On Steroids Gr... 0
199998 small milk bones Milk-Bone Gravy Bones Dog Treats, 4 Meat Flavo... 0
199999 fake septum vcmart Fake Nose Rings Hoop 4pcs Stainless Ste... 0
In [8]:
train_df.columns = ["text_a", "text_b", "labels"]
eval_df.columns = ["text_a", "text_b", "labels"]

from simpletransformers.classification import (
    ClassificationModel, ClassificationArgs
)
import pandas as pd
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

model_args = ClassificationArgs(
    num_train_epochs=2,
    train_batch_size = 16*10,
    eval_batch_size = 16*10
)
#model = ClassificationModel("roberta", "xlm-roberta-base")
model = ClassificationModel("bert", "bert-base-multilingual-uncased",
                           args=model_args)

!rm -rf outputs
model.train_model(train_df)
Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
Out[8]:
(2500, 0.601999045741558)
In [9]:
result, model_outputs, wrong_predictions = model.eval_model(eval_df)
result
Out[9]:
{'mcc': 0.3816652549729989,
 'tp': 69464,
 'tn': 68703,
 'fp': 31091,
 'fn': 30742,
 'auroc': 0.7579317495091722,
 'auprc': 0.7487678610783843,
 'eval_loss': 0.5908846513748169}
In [10]:
TP = result["tp"]
FP = result["fp"]
FN = result["fn"]
TP/(TP+0.5*(FP+FN))
Out[10]:
0.6920069136933966
In [11]:
del model
gc.collect()
!nvidia-smi
Sat May 14 09:30:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    40W / 250W |  14187MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Comments

You must login before you can post a comment.

Execute