Loading

Task 3: Next Product Title Generation

Task 3 - Getting Started

Make your first submission on Task 3

dipam

Amazon KDD Cup 2023 - Task 3 - Next Product Title Generation

This notebook will contains instructions and example submission with random predictions.

Installations 🤖

  1. aicrowd-cli for downloading challenge data and making submissions
  2. pyarrow for saving to parquet for submissions
In [1]:
!pip install aicrowd-cli pyarrow
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.15-py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 3.7 MB/s eta 0:00:00
Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)
Collecting click<8,>=7.1.2
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.8/82.8 KB 10.0 MB/s eta 0:00:00
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 214.4/214.4 KB 20.0 MB/s eta 0:00:00
Requirement already satisfied: requests<3,>=2.25.1 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (2.27.1)
Collecting python-slugify<6,>=5.0.0
  Downloading python_slugify-5.0.2-py2.py3-none-any.whl (6.7 kB)
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.1/170.1 KB 18.7 MB/s eta 0:00:00
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (4.65.0)
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp39-cp39-manylinux2010_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 46.8 MB/s eta 0:00:00
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl (54 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.5/54.5 KB 6.5 MB/s eta 0:00:00
Collecting semver<3,>=2.13.0
  Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.10-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 KB 3.6 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.16.6 in /usr/local/lib/python3.9/dist-packages (from pyarrow) (1.22.4)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify<6,>=5.0.0->aicrowd-cli) (1.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.15)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.4)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.12)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.9/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 4.6 MB/s eta 0:00:00
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Installing collected packages: commonmark, smmap, semver, pyzmq, python-slugify, colorama, click, rich, requests-toolbelt, gitdb, GitPython, aicrowd-cli
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 23.2.1
    Uninstalling pyzmq-23.2.1:
      Successfully uninstalled pyzmq-23.2.1
  Attempting uninstall: python-slugify
    Found existing installation: python-slugify 8.0.1
    Uninstalling python-slugify-8.0.1:
      Successfully uninstalled python-slugify-8.0.1
  Attempting uninstall: click
    Found existing installation: click 8.1.3
    Uninstalling click-8.1.3:
      Successfully uninstalled click-8.1.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 2.2.3 requires click>=8.0, but you have click 7.1.2 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.15 click-7.1.2 colorama-0.4.6 commonmark-0.9.1 gitdb-4.0.10 python-slugify-5.0.2 pyzmq-22.1.0 requests-toolbelt-0.10.1 rich-10.16.2 semver-2.13.0 smmap-5.0.0

Login to AIcrowd and download the data 📚

In [ ]:
!aicrowd login
In [3]:
!aicrowd dataset download --challenge task-3-next-product-title-generation
sessions_test_task1.csv: 100% 19.4M/19.4M [00:01<00:00, 16.1MB/s]
sessions_test_task2.csv: 100% 1.92M/1.92M [00:00<00:00, 4.36MB/s]
sessions_test_task3.csv: 100% 2.67M/2.67M [00:00<00:00, 5.55MB/s]
products_train.csv: 100% 589M/589M [01:07<00:00, 8.69MB/s]
sessions_train.csv: 100% 259M/259M [00:18<00:00, 13.9MB/s]

Setup data and task information

In [4]:
import os
import numpy as np
import pandas as pd
from functools import lru_cache
In [5]:
train_data_dir = '.'
test_data_dir = '.'
task = 'task3'
PREDS_PER_SESSION = 100
In [6]:
# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

Data Description

The Multilingual Shopping Session Dataset is a collection of anonymized customer sessions containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: user sessions and product attributes. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.


Each product as its associated information:

locale: the locale code of the product (e.g., DE)

id: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

title: title of the item (e.g., “Japanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt”)

price: price of the item in local currency (e.g., 24.99)

brand: item brand name (e.g., “Japanese Aesthetic Flowers & Vaporwave Clothing”)

color: color of the item (e.g., “Black”)

size: size of the item (e.g., “xxl”)

model: model of the item (e.g., “iphone 13”)

material: material of the item (e.g., “cotton”)

author: author of the item (e.g., “J. K. Rowling”)

desc: description about a item’s key features and benefits called out via bullet points (e.g., “Solid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers …”)

EDA 💽

In [7]:
def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")
In [8]:
products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)
Locale: DE 
Number of products: 518327 
Number of train sessions: 1111416 
Train session lengths - Mean: 57.89 | Median 40.00 | Min: 27.00 | Max 2060.00 
Number of test sessions: 10000
Test session lengths - Mean: 39.92 | Median 27.00 | Min: 27.00 | Max 581.00 

======================================================================== 

Locale: JP 
Number of products: 395009 
Number of train sessions: 979119 
Train session lengths - Mean: 59.61 | Median 40.00 | Min: 27.00 | Max 6257.00 
Number of test sessions: 10000
Test session lengths - Mean: 40.23 | Median 27.00 | Min: 27.00 | Max 436.00 

======================================================================== 

Locale: UK 
Number of products: 500180 
Number of train sessions: 1182181 
Train session lengths - Mean: 54.85 | Median 40.00 | Min: 27.00 | Max 2654.00 
Number of test sessions: 10000
Test session lengths - Mean: 48.85 | Median 40.00 | Min: 27.00 | Max 410.00 

======================================================================== 

Locale: ES 
Number of products: 42503 
Number of train sessions: 89047 
Train session lengths - Mean: 48.82 | Median 40.00 | Min: 27.00 | Max 792.00 
Number of test sessions: 6421
Test session lengths - Mean: 44.70 | Median 40.00 | Min: 27.00 | Max 357.00 

======================================================================== 

Locale: FR 
Number of products: 44577 
Number of train sessions: 117561 
Train session lengths - Mean: 47.25 | Median 40.00 | Min: 27.00 | Max 687.00 
Number of test sessions: 10000
Test session lengths - Mean: 42.52 | Median 40.00 | Min: 27.00 | Max 304.00 

======================================================================== 

Locale: IT 
Number of products: 50461 
Number of train sessions: 126925 
Train session lengths - Mean: 48.80 | Median 40.00 | Min: 27.00 | Max 621.00 
Number of test sessions: 10000
Test session lengths - Mean: 43.35 | Median 40.00 | Min: 27.00 | Max 330.00 

======================================================================== 

In [9]:
products.sample(5)
Out[9]:
id locale title price brand color size model material author desc
72043 B09T66362W DE LED Kabellose Maus, Tragbar 2.4 G Wiederauflad... 11.99 Asnoty Schwarz NaN NaN NaN NaN 【Buntes LED-Licht】 7 weiche LED-Farben wechsel...
1372965 B093FB22WT UK Jefshon Baby Piano Musical Mats 35 Music Sound... 14.89 Jefshon Green NaN GP5922 Polyester NaN [Safe Material and Anti- Slip] : This musical ...
208601 B08HC8VWG4 DE Kalorik TKG MW 2500 DG, Mikrowelle, 25 Liter I... 184.89 Kalorik Cremefarben NaN TKG MW 2500 DG Kunststoff NaN Auch mit Grillfunktion und Auftaufunktion
933563 B09TVLRY41 UK Ultrasonic Toothbrush for Adults 5 Modes ,Soni... 34.99 OKMIMO Black NaN NaN NaN NaN 2 Minutes Smart Timer & Brushing Reminder - Ul...
656334 B08XW7MZNX JP レック マルチ 水切りかご (ワイド) SIAA抗菌、流れる/流れない選べるトレー、コップ・... 2909.00 レック(LEC) ワイド K00405 ステンレス鋼 NaN グラス、ボトルスタンドが外側にあるのでカゴの中を広々使えます。
In [10]:
train_sessions = read_train_data()
train_sessions.sample(5)
Out[10]:
prev_items next_item locale
1646342 ['B017SFJ8WK' 'B07ZCW5KD9' 'B07ZCVZWP3' 'B08ZV... B08GKJ21Y5 JP
343344 ['B08QF9F8BC' 'B098DQ87NT' 'B08QF9F8BC'] B08SGX86X7 DE
2627480 ['B0B3LRD151' 'B08DM2YN8G' 'B083HRZBDT' 'B08DM... B0B7JZVCH4 UK
2742311 ['B07TBRJKX4' 'B003ZG7CMA'] B00JQ2AJHC UK
3138973 ['B0191BQXK4' 'B005UXMZK0'] B018A6U5SW UK
In [11]:
test_sessions = read_test_data(task)
test_sessions.sample(5)
Out[11]:
prev_items locale
44027 ['B09HXCNVQ9' 'B09NKQPFKX' 'B09NKQT9GY' 'B09NK... JP
29760 ['B07QVLL68D' 'B08B3WW4SG' 'B093CC8N7X'] IT
41208 ['B083R1QYQD' 'B083R1RQLR' 'B083R1QYQD' 'B083R... JP
6862 ['B09J95311C' 'B07PP343KJ'] DE
9402 ['B075LFT858' 'B075LMPFB3'] DE

Generate Submission 🏋️‍♀️

Submission format:

  1. The submission should be a parquet file with the sessions from all the locales.
  2. Predictions should be added in new column named "next_item_prediction".
  3. Predictions should be a single string, the next product title for the session.
In [12]:
def random_predicitons(locale, sess_test_locale):
    random_state = np.random.RandomState(42)
    products = read_product_data().query(f'locale == "{locale}"')
    predictions = (products['title']
                   .sample(len(sess_test_locale), replace=True, random_state=random_state)
                   .values
    )
    sess_test_locale['next_item_prediction'] = predictions
    sess_test_locale.drop('prev_items', inplace=True, axis=1)
    return sess_test_locale
In [13]:
test_sessions = read_test_data(task)
predictions = []
test_locale_names = test_sessions['locale'].unique()
for locale in test_locale_names:
    sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
    predictions.append(
        random_predicitons(locale, sess_test_locale)
    )
predictions = pd.concat(predictions).reset_index(drop=True)
predictions.sample(5)
Out[13]:
locale next_item_prediction
45091 JP オウルテック 超タフ ライトニングケーブル 耐屈曲50,000回 Apple認証 iPhon...
52522 UK 320ml Hot Water Bottle with Knited Cover, Mini...
2614 ES kwmobile Carcasa Compatible con Samsung Galaxy...
33688 IT Caffè Borbone Miscela Decaffeinata Cialda Comp...
16942 FR JETech Coque Ultra Fine (0,35 mm) pour iPhone ...

Validate predictions ✅

In [14]:
def check_predictions(predictions):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"
        assert predictions['next_item_prediction'].apply(lambda x: isinstance(x, str)).all(), "Predictions should all be strings"
In [15]:
check_predictions(predictions)
In [16]:
# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'submission_{task}.parquet', engine='pyarrow')

Submit to AIcrowd 🚀

In [ ]:
# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-3-next-product-title-generation -f "submission_task3.parquet"

Comments

You must login before you can post a comment.

Execute