Loading

Speaker Identification

Getting Started Notebook for Speaker Identification

A getting started notebook with random submission for the challenge.

ashivani

Getting Started with Speaker Identification

In this puzzle, we have to cluster the sentences spoken by same speaker together.

In this starter notebook:

For tokenization: We will use TfidfVectorizer.

For Clustering: We will use K Means Classifier.

Download the files 💾

Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

In [ ]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.10-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 2.2 MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 9.9 MB/s 
Collecting requests<3,>=2.25.1
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 811 kB/s 
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 2.7 MB/s 
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     |████████████████████████████████| 170 kB 52.4 MB/s 
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.1-py3-none-any.whl (214 kB)
     |████████████████████████████████| 214 kB 45.7 MB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 1.5 MB/s 
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |████████████████████████████████| 51 kB 6.8 MB/s 
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: smmap, requests, gitdb, commonmark, colorama, rich, requests-toolbelt, pyzmq, GitPython, aicrowd-cli
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 22.3.0
    Uninstalling pyzmq-22.3.0:
      Successfully uninstalled pyzmq-22.3.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.10 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.9 pyzmq-22.1.0 requests-2.26.0 requests-toolbelt-0.9.1 rich-10.16.1 smmap-5.0.0

Login to AIcrowd ㊗

In [ ]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/3FVg4wwyvurUqEL3VlY1JSlr9I5G1B7OcvbPyXojHQg
API Key valid
Saved API Key successfully!

Download Dataset

We will create a folder name data and download the files there.

In [ ]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c speaker-identification -o data
In [ ]:
import re,os
import pandas as pd

from wordcloud import WordCloud

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
In [ ]:
test_df = pd.read_csv("data/test.csv")
In [ ]:
test_df.head()
Out[ ]:
id sentence
0 19475 If you sit back and think about all that, that...
1 35980 oh my goodness i've run it again i wasn't mean...
2 12979 So I think that the whole world has moved towa...
3 40815 since I think it would be lame to not post any...
4 43475 And now, let’s use this new technique to\nappl...
In [ ]:
test_df.sentence[0]
Out[ ]:
"If you sit back and think about all that, that’s a lot of layers of complexity to wrap your mind around. So don't worry if it takes time for your mind to digest it all."
In [ ]:
sub_df = pd.read_csv("data/sample_sub.csv")
In [ ]:
sub_df.head()
Out[ ]:
id prediction
0 19475 NaN
1 35980 NaN
2 12979 NaN
3 40815 NaN
4 43475 NaN
In [ ]:
# Remove punctuation, new line and lower case all the text available in sentence
test_df.sentence = test_df.sentence.apply(lambda x: re.sub('[,\.!?]', '', x))
test_df.sentence = test_df.sentence.apply(lambda x: x.lower())
test_df.sentence = test_df.sentence.apply(lambda x: x.replace("\n", " "))
In [ ]:
test_df.head()
Out[ ]:
id sentence
0 19475 if you sit back and think about all that that’...
1 35980 oh my goodness i've run it again i wasn't mean...
2 12979 so i think that the whole world has moved towa...
3 40815 since i think it would be lame to not post any...
4 43475 and now let’s use this new technique to apply ...
In [ ]:
long_string = ','.join(list(test_df.sentence.values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="silver", max_words=1000, contour_width=3, contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()
Out[ ]:
In [ ]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(test_df.sentence)
In [ ]:
print(type(X))
<class 'scipy.sparse.csr.csr_matrix'>

Generating Predictions

Clustering using K-Means.

In [ ]:
true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)
In [ ]:
model.fit(X)
Out[ ]:
KMeans(max_iter=100, n_clusters=10)
In [ ]:
submission = test_df
In [ ]:
submission['prediction'] = test_df.sentence.apply(lambda x: model.predict(vectorizer.transform([x])[0])[0])
In [ ]:
submission.head()
Out[ ]:
id sentence prediction
0 19475 if you sit back and think about all that that’... 0
1 35980 oh my goodness i've run it again i wasn't mean... 0
2 12979 so i think that the whole world has moved towa... 0
3 40815 since i think it would be lame to not post any... 0
4 43475 and now let’s use this new technique to apply ... 1
In [ ]:
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"))

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
%aicrowd notebook submit -c speaker-identification -a assets --no-verify
Using notebook: getting-started-notebook-for-speaker-identification.ipynb for submission...
Scrubbing API keys from the notebook...
Collecting notebook...


                                                     ╭─────────────────────────╮                                                      
                                                     │ Successfully submitted! │                                                      
                                                     ╰─────────────────────────╯                                                      
                                                           Important links                                                            
┌──────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification/submissions/169595              │
│                  │                                                                                                                 │
│  All submissions │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification/submissions?my_submissions=true │
│                  │                                                                                                                 │
│      Leaderboard │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification/leaderboards                    │
│                  │                                                                                                                 │
│ Discussion forum │ https://discourse.aicrowd.com/c/ai-blitz-xii                                                                    │
│                  │                                                                                                                 │
│   Challenge page │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/speaker-identification                                 │
└──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
In [ ]:


Comments

You must login before you can post a comment.

Execute