Task 1: Next Product Recommendation

Task 1 - Getting Started

Amazon KDD Cup 2023 - Task 1 - Next Product Recommendation¶

This notebook will contains instructions and example submission with random predictions.

Installations 🤖¶

aicrowd-cli for downloading challenge data and making submissions
pyarrow for saving to parquet for submissions

In [ ]:

!pip install aicrowd-cli pyarrow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.15-py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 1.7 MB/s eta 0:00:00
Requirement already satisfied: pyarrow in /usr/local/lib/python3.9/dist-packages (9.0.0)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (4.65.0)
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl (54 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.5/54.5 KB 4.3 MB/s eta 0:00:00
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (0.10.2)
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp39-cp39-manylinux2010_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 19.4 MB/s eta 0:00:00
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.1/170.1 KB 9.1 MB/s eta 0:00:00
Collecting click<8,>=7.1.2
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.8/82.8 KB 5.3 MB/s eta 0:00:00
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 214.4/214.4 KB 14.1 MB/s eta 0:00:00
Collecting python-slugify<6,>=5.0.0
  Downloading python_slugify-5.0.2-py2.py3-none-any.whl (6.7 kB)
Requirement already satisfied: requests<3,>=2.25.1 in /usr/local/lib/python3.9/dist-packages (from aicrowd-cli) (2.25.1)
Collecting semver<3,>=2.13.0
  Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.10-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 KB 4.6 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.16.6 in /usr/local/lib/python3.9/dist-packages (from pyarrow) (1.22.4)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify<6,>=5.0.0->aicrowd-cli) (1.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2022.12.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.15)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51.1/51.1 KB 4.1 MB/s eta 0:00:00
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.9/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Installing collected packages: commonmark, smmap, semver, pyzmq, python-slugify, colorama, click, rich, requests-toolbelt, gitdb, GitPython, aicrowd-cli
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 23.2.1
    Uninstalling pyzmq-23.2.1:
      Successfully uninstalled pyzmq-23.2.1
  Attempting uninstall: python-slugify
    Found existing installation: python-slugify 8.0.1
    Uninstalling python-slugify-8.0.1:
      Successfully uninstalled python-slugify-8.0.1
  Attempting uninstall: click
    Found existing installation: click 8.1.3
    Uninstalling click-8.1.3:
      Successfully uninstalled click-8.1.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 2.2.3 requires click>=8.0, but you have click 7.1.2 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.15 click-7.1.2 colorama-0.4.6 commonmark-0.9.1 gitdb-4.0.10 python-slugify-5.0.2 pyzmq-22.1.0 requests-toolbelt-0.10.1 rich-10.16.2 semver-2.13.0 smmap-5.0.0

In [ ]:

!aicrowd login

In [ ]:

!aicrowd dataset download --challenge task-1-next-product-recommendation

sessions_test_task1.csv: 100% 19.4M/19.4M [00:01<00:00, 14.2MB/s]
sessions_test_task2.csv: 100% 1.92M/1.92M [00:00<00:00, 4.04MB/s]
sessions_test_task3.csv: 100% 3.15M/3.15M [00:00<00:00, 5.91MB/s]
products_train.csv: 100% 589M/589M [01:10<00:00, 8.32MB/s]
sessions_train.csv: 100% 259M/259M [00:38<00:00, 6.74MB/s]

Setup data and task information¶

In [ ]:

import os
import numpy as np
import pandas as pd
from functools import lru_cache

In [ ]:

train_data_dir = '.'
test_data_dir = '.'
task = 'task1'
PREDS_PER_SESSION = 100

In [ ]:

# Cache loading of data for multiple calls

@lru_cache(maxsize=1)
def read_product_data():
    return pd.read_csv(os.path.join(train_data_dir, 'products_train.csv'))

@lru_cache(maxsize=1)
def read_train_data():
    return pd.read_csv(os.path.join(train_data_dir, 'sessions_train.csv'))

@lru_cache(maxsize=3)
def read_test_data(task):
    return pd.read_csv(os.path.join(test_data_dir, f'sessions_test_{task}.csv'))

Data Description¶

The Multilingual Shopping Session Dataset is a collection of anonymized customer sessions containing products from six different locales, namely English, German, Japanese, French, Italian, and Spanish. It consists of two main components: user sessions and product attributes. User sessions are a list of products that a user has engaged with in chronological order, while product attributes include various details like product title, price in local currency, brand, color, and description.

Each product as its associated information:¶

locale: the locale code of the product (e.g., DE)

id: a unique for the product. Also known as Amazon Standard Item Number (ASIN) (e.g., B07WSY3MG8)

title: title of the item (e.g., “Japanese Aesthetic Sakura Flowers Vaporwave Soft Grunge Gift T-Shirt”)

price: price of the item in local currency (e.g., 24.99)

brand: item brand name (e.g., “Japanese Aesthetic Flowers & Vaporwave Clothing”)

color: color of the item (e.g., “Black”)

size: size of the item (e.g., “xxl”)

model: model of the item (e.g., “iphone 13”)

material: material of the item (e.g., “cotton”)

author: author of the item (e.g., “J. K. Rowling”)

desc: description about a item’s key features and benefits called out via bullet points (e.g., “Solid colors: 100% Cotton; Heather Grey: 90% Cotton, 10% Polyester; All Other Heathers …”)

EDA 💽¶

In [ ]:

def read_locale_data(locale, task):
    products = read_product_data().query(f'locale == "{locale}"')
    sess_train = read_train_data().query(f'locale == "{locale}"')
    sess_test = read_test_data(task).query(f'locale == "{locale}"')
    return products, sess_train, sess_test

def show_locale_info(locale, task):
    products, sess_train, sess_test = read_locale_data(locale, task)

    train_l = sess_train['prev_items'].apply(lambda sess: len(sess))
    test_l = sess_test['prev_items'].apply(lambda sess: len(sess))

    print(f"Locale: {locale} \n"
          f"Number of products: {products['id'].nunique()} \n"
          f"Number of train sessions: {len(sess_train)} \n"
          f"Train session lengths - "
          f"Mean: {train_l.mean():.2f} | Median {train_l.median():.2f} | "
          f"Min: {train_l.min():.2f} | Max {train_l.max():.2f} \n"
          f"Number of test sessions: {len(sess_test)}"
        )
    if len(sess_test) > 0:
        print(
             f"Test session lengths - "
            f"Mean: {test_l.mean():.2f} | Median {test_l.median():.2f} | "
            f"Min: {test_l.min():.2f} | Max {test_l.max():.2f} \n"
        )
    print("======================================================================== \n")

In [ ]:

products = read_product_data()
locale_names = products['locale'].unique()
for locale in locale_names:
    show_locale_info(locale, task)

Locale: DE 
Number of products: 518327 
Number of train sessions: 1111416 
Train session lengths - Mean: 57.89 | Median 40.00 | Min: 27.00 | Max 2060.00 
Number of test sessions: 104568
Test session lengths - Mean: 57.23 | Median 40.00 | Min: 27.00 | Max 700.00 

======================================================================== 

Locale: JP 
Number of products: 395009 
Number of train sessions: 979119 
Train session lengths - Mean: 59.61 | Median 40.00 | Min: 27.00 | Max 6257.00 
Number of test sessions: 96467
Test session lengths - Mean: 59.90 | Median 40.00 | Min: 27.00 | Max 1479.00 

======================================================================== 

Locale: UK 
Number of products: 500180 
Number of train sessions: 1182181 
Train session lengths - Mean: 54.85 | Median 40.00 | Min: 27.00 | Max 2654.00 
Number of test sessions: 115936
Test session lengths - Mean: 53.51 | Median 40.00 | Min: 27.00 | Max 872.00 

======================================================================== 

Locale: ES 
Number of products: 42503 
Number of train sessions: 89047 
Train session lengths - Mean: 48.82 | Median 40.00 | Min: 27.00 | Max 792.00 
Number of test sessions: 0
======================================================================== 

Locale: FR 
Number of products: 44577 
Number of train sessions: 117561 
Train session lengths - Mean: 47.25 | Median 40.00 | Min: 27.00 | Max 687.00 
Number of test sessions: 0
======================================================================== 

Locale: IT 
Number of products: 50461 
Number of train sessions: 126925 
Train session lengths - Mean: 48.80 | Median 40.00 | Min: 27.00 | Max 621.00 
Number of test sessions: 0
========================================================================

In [ ]:

products.sample(5)

Out[ ]:

	id	locale	title	price	brand	color	size	model	material	author	desc
1535876	B08B3QTXJZ	IT	kwmobile Custodia Compatibile con Apple iPhone...	8.49	KW-Commerce	blu chiaro matt	NaN	49982.58_m000813	Silicone	NaN	ANTI URTO: i bordi rialzati della copertina pr...
1198060	B0B2KN1Q6M	UK	Me To You Bear Sister Just For You Birthday Card	2.99	Carte Blanche	NaN	NaN	NaN	NaN	NaN	NaN
1024050	B099W7JSMT	UK	Syhood 32.8 Feet Christmas Metallic Tinsel Twi...	8.99	Syhood	Blue	NaN	NaN	Metal	NaN	Christmas style decor: the Christmas metallic ...
895070	B0BCG44MBT	JP	ラップタオル大人用速乾大きいサイズ風呂用サウナ着るバスシャワー超吸水水泳温泉湯浴み着...	1589.00	OTTCFRN	ピンク	ワンサイズ	NaN	ポリエステル	NaN	3 Dトリミング設計、（非ベルクロデザイン）よりユーザーフレンドリーで、使用時に音が出ず、肌...
1084330	B007E9VUQS	UK	Smiffys Make-Up FX Face and Body Paint, 16 ml ...	2.94	Smiffy's	Brown (dark)	One Size	39184	NaN	NaN	Add colour to your dress-up costume!

In [ ]:

train_sessions = read_train_data()
train_sessions.sample(5)

Out[ ]:

	prev_items	next_item	locale
2994683	['B07T3GN2VH' 'B07T2FDFKZ' 'B07T3DJMT5' 'B098T...	B07ZJZNRMP	UK
190907	['B07ZRN33PQ' 'B07ZRMCRG7' 'B09C24TXP4' 'B09C2...	B091G94JDR	DE
3595388	['B09BJNQRNZ' 'B08XK8M5Z3' 'B09DPD5QJ8' 'B09KN...	B0B4RYT3ZS	IT
465436	['B07GXQCFXK' 'B07GXQD5Y3' 'B00E6722OK']	B00D3HZYGW	DE
3518477	['B09ZY6WYJX' 'B0B85CSNXW' 'B09ZY6WYJX' 'B0B85...	B083DRSWKR	IT

In [ ]:

test_sessions = read_test_data(task)
test_sessions.sample(5)

Out[ ]:

	prev_items	locale
202046	['B08H95Y452' 'B0BG3GRMF9' 'B0BG3GRMF9' 'B0BF5...	UK
98284	['B09RQ8T72D' 'B09998MBFM' 'B09RQ8T72D']	DE
191260	['B0871Z739B' 'B09N92NHGR' 'B0871Z739B']	JP
113547	['B0B56Q2VXW' 'B0B56NPJ4G' 'B0B56Q2VXW']	JP
102804	['B08G97TPH8' 'B08G91WFQR' 'B08G93D8LZ' 'B082P...	DE

Generate Submission 🏋️‍♀️¶

Submission format:

The submission should be a parquet file with the sessions from all the locales.
Predicted products ids per locale should only be a valid product id of that locale.
Predictions should be added in new column named "next_item_prediction".
Predictions should be a list of string id values

In [ ]:

def random_predicitons(locale, sess_test_locale):
    random_state = np.random.RandomState(42)
    products = read_product_data().query(f'locale == "{locale}"')
    predictions = []
    for _ in range(len(sess_test_locale)):
        predictions.append(
            list(products['id'].sample(PREDS_PER_SESSION, replace=True, random_state=random_state))
        ) 
    sess_test_locale['next_item_prediction'] = predictions
    sess_test_locale.drop('prev_items', inplace=True, axis=1)
    return sess_test_locale

In [ ]:

test_sessions = read_test_data(task)
predictions = []
test_locale_names = test_sessions['locale'].unique()
for locale in test_locale_names:
    sess_test_locale = test_sessions.query(f'locale == "{locale}"').copy()
    predictions.append(
        random_predicitons(locale, sess_test_locale)
    )
predictions = pd.concat(predictions).reset_index(drop=True)
predictions.sample(5)

Out[ ]:

	locale	next_item_prediction
197622	JP	[B0B3JKGTBH, B07WFD1L1R, B0B1N2FMMG, B0BLJSMWJ...
108611	JP	[B07WV5GXPB, B0B7DS3HQL, B0866HDFTS, B009GQYDX...
284074	UK	[B09BB5SPR3, B0816CXMSZ, B08JV76967, B08MW68KC...
34652	DE	[B007H6POYW, B08M5GZGFT, B08JQZMFL7, B0BKP9BSL...
268639	UK	[B06XCGCKG7, B0B9NXKN54, B091YX63K7, B00Z65X1G...

Validate predictions ✅¶

In [ ]:

def check_predictions(predictions, check_products=False):
    """
    These tests need to pass as they will also be applied on the evaluator
    """
    test_locale_names = test_sessions['locale'].unique()
    for locale in test_locale_names:
        sess_test = test_sessions.query(f'locale == "{locale}"')
        preds_locale =  predictions[predictions['locale'] == sess_test['locale'].iloc[0]]
        assert sorted(preds_locale.index.values) == sorted(sess_test.index.values), f"Session ids of {locale} doesn't match"

        if check_products:
            # This check is not done on the evaluator
            # but you can run it to verify there is no mixing of products between locales
            # Since the ground truth next item will always belong to the same locale
            # Warning - This can be slow to run
            products = read_product_data().query(f'locale == "{locale}"')
            predicted_products = np.unique( np.array(list(preds_locale["next_item_prediction"].values)) )
            assert np.all( np.isin(predicted_products, products['id']) ), f"Invalid products in {locale} predictions"

In [ ]:

check_predictions(predictions)

In [ ]:

# Its important that the parquet file you submit is saved with pyarrow backend
predictions.to_parquet(f'submission_{task}.parquet', engine='pyarrow')

Submit to AIcrowd 🚀¶

In [ ]:

# You can submit with aicrowd-cli, or upload manually on the challenge page.
!aicrowd submission create -c task-1-next-product-recommendation -f "submission_task1.parquet"

Content

8409

Show Comments

Comments

Hooked

Over 2 years ago

Thanks!

Liked by

curiosityoftiane

Over 2 years ago

Thanks!

Liked by

lu_zhihong

About 2 years ago

Thanks

sanghyeon_lee

About 2 years ago

Thanks for sharing this notebook. It’s really helpful. In the EDA section, however, “sess_train[‘prev_items’].apply(lambda sess: len(sess))” returns the length of the string, not the number of products in the session. You need to use eval() first to make the string into an actual list.

Liked by

asad_baloch

9 months ago

zxczx

You must login before you can post a comment.

Task 1: Next Product Recommendation

Task 1 - Getting Started

Amazon KDD Cup 2023 - Task 1 - Next Product Recommendation¶

Installations 🤖¶

Login to AIcrowd and download the data 📚¶

Setup data and task information¶

Data Description¶

Each product as its associated information:¶

EDA 💽¶

Generate Submission 🏋️‍♀️¶

Validate predictions ✅¶

Submit to AIcrowd 🚀¶

Content