Getting Started with Speaker Identification

In this puzzle, we have to cluster the sentences spoken by same speaker together.

In this starter notebook:

For tokenization: We will use TfidfVectorizer.

For Clustering: We will use K Means Classifier.

Download the files 💾¶

Download AIcrowd CLI¶

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

In [ ]:

!pip install aicrowd-cli
%load_ext aicrowd.magic

Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.10-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 2.2 MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 9.9 MB/s 
Collecting requests<3,>=2.25.1
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 811 kB/s 
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 2.7 MB/s 
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     |████████████████████████████████| 170 kB 52.4 MB/s 
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.1-py3-none-any.whl (214 kB)
     |████████████████████████████████| 214 kB 45.7 MB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 1.5 MB/s 
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |████████████████████████████████| 51 kB 6.8 MB/s 
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: smmap, requests, gitdb, commonmark, colorama, rich, requests-toolbelt, pyzmq, GitPython, aicrowd-cli
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 22.3.0
    Uninstalling pyzmq-22.3.0:
      Successfully uninstalled pyzmq-22.3.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.10 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.9 pyzmq-22.1.0 requests-2.26.0 requests-toolbelt-0.9.1 rich-10.16.1 smmap-5.0.0

In [ ]:

%aicrowd login

Please login here: https://api.aicrowd.com/auth/3FVg4wwyvurUqEL3VlY1JSlr9I5G1B7OcvbPyXojHQg
API Key valid
Saved API Key successfully!

Download Dataset¶

We will create a folder name data and download the files there.

In [ ]:

!rm -rf data
!mkdir data
%aicrowd ds dl -c speaker-identification -o data

In [ ]:

import re,os
import pandas as pd

from wordcloud import WordCloud

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [ ]:

test_df = pd.read_csv("data/test.csv")

In [ ]:

test_df.head()

Out[ ]:

	id	sentence
0	19475	If you sit back and think about all that, that...
1	35980	oh my goodness i've run it again i wasn't mean...
2	12979	So I think that the whole world has moved towa...
3	40815	since I think it would be lame to not post any...
4	43475	And now, let’s use this new technique to\nappl...

In [ ]:

test_df.sentence[0]

Out[ ]:

"If you sit back and think about all that, that’s a lot of layers of complexity to wrap your mind around. So don't worry if it takes time for your mind to digest it all."

In [ ]:

sub_df = pd.read_csv("data/sample_sub.csv")

In [ ]:

sub_df.head()

Out[ ]:

	id	prediction
0	19475	NaN
1	35980	NaN
2	12979	NaN
3	40815	NaN
4	43475	NaN

In [ ]:

# Remove punctuation, new line and lower case all the text available in sentence
test_df.sentence = test_df.sentence.apply(lambda x: re.sub('[,\.!?]', '', x))
test_df.sentence = test_df.sentence.apply(lambda x: x.lower())
test_df.sentence = test_df.sentence.apply(lambda x: x.replace("\n", " "))

In [ ]:

test_df.head()

Out[ ]:

	id	sentence
0	19475	if you sit back and think about all that that’...
1	35980	oh my goodness i've run it again i wasn't mean...
2	12979	so i think that the whole world has moved towa...
3	40815	since i think it would be lame to not post any...
4	43475	and now let’s use this new technique to apply ...

In [ ]:

long_string = ','.join(list(test_df.sentence.values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="silver", max_words=1000, contour_width=3, contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

Out[ ]:

In [ ]:

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(test_df.sentence)

In [ ]:

print(type(X))

<class 'scipy.sparse.csr.csr_matrix'>

Generating Predictions¶

Clustering using K-Means.

In [ ]:

true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)

In [ ]:

model.fit(X)

Out[ ]:

KMeans(max_iter=100, n_clusters=10)

In [ ]:

submission = test_df

In [ ]:

submission['prediction'] = test_df.sentence.apply(lambda x: model.predict(vectorizer.transform([x])[0])[0])

In [ ]:

submission.head()

Out[ ]:

	id	sentence	prediction
0	19475	if you sit back and think about all that that’...	0
1	35980	oh my goodness i've run it again i wasn't mean...	0
2	12979	so i think that the whole world has moved towa...	0
3	40815	since i think it would be lame to not post any...	0
4	43475	and now let’s use this new technique to apply ...	1

In [ ]:

!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"))

Submitting our Predictions¶

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:

%aicrowd notebook submit -c speaker-identification -a assets --no-verify

Using notebook: getting-started-notebook-for-speaker-identification.ipynb for submission...
Scrubbing API keys from the notebook...
Collecting notebook...

                                                     ╭─────────────────────────╮                                                      
                                                     │ Successfully submitted! │                                                      
                                                     ╰─────────────────────────╯

                                                           Important links                                                            
┌──────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification/submissions/169595              │
│                  │                                                                                                                 │
│  All submissions │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification/submissions?my_submissions=true │
│                  │                                                                                                                 │
│      Leaderboard │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification/leaderboards                    │
│                  │                                                                                                                 │
│ Discussion forum │ https://discourse.aicrowd.com/c/ai-blitz-xii                                                                    │
│                  │                                                                                                                 │
│   Challenge page │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification                                 │
└──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

In [ ]:

Speaker Identification

Getting Started Notebook for Speaker Identification

Getting Started with Speaker Identification

Download the files 💾¶

Download AIcrowd CLI¶

Download Dataset¶

Generating Predictions¶

Submitting our Predictions¶

Content

Comments

Speaker Identification

Getting Started Notebook for Speaker Identification

Getting Started with Speaker Identification

Download the files 💾¶

Download AIcrowd CLI¶

Login to AIcrowd ㊗¶

Download Dataset¶

Generating Predictions¶

Submitting our Predictions¶

Content