Loading

Speaker Identification

Solution for submission 170945

A detailed solution for submission 170945 submitted for challenge Speaker Identification

mkeywood

Getting Started with Speaker Identification

In this puzzle, we have to cluster the sentences spoken by same speaker together.

In this starter notebook:

For tokenization: We will use TfidfVectorizer.

For Clustering: We will use K Means Classifier.

Download the files 💾

Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

In [1]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.10-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 952 kB/s 
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.2-py3-none-any.whl (214 kB)
     |████████████████████████████████| 214 kB 8.9 MB/s 
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     |████████████████████████████████| 170 kB 37.6 MB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 18.5 MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Collecting requests<3,>=2.25.1
  Downloading requests-2.27.0-py2.py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 541 kB/s 
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 802 kB/s 
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 1.3 MB/s 
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |████████████████████████████████| 51 kB 4.5 MB/s 
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Installing collected packages: smmap, requests, gitdb, commonmark, colorama, rich, requests-toolbelt, pyzmq, GitPython, aicrowd-cli
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 22.3.0
    Uninstalling pyzmq-22.3.0:
      Successfully uninstalled pyzmq-22.3.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.10 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.9 pyzmq-22.1.0 requests-2.27.0 requests-toolbelt-0.9.1 rich-10.16.2 smmap-5.0.0

Login to AIcrowd ㊗

In [2]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/QFFvWdfD4t9PBd8lTVlLsNY17etqW2i4pR2Q2SYY2Mo
API Key valid
Saved API Key successfully!

Download Dataset

We will create a folder name data and download the files there.

In [3]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c speaker-identification -o data
In [4]:
import re,os
import pandas as pd

from wordcloud import WordCloud

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans #, AgglomerativeClustering
In [5]:
test_df = pd.read_csv("data/test.csv")
In [6]:
test_df.head()
Out[6]:
id sentence
0 19475 If you sit back and think about all that, that...
1 35980 oh my goodness i've run it again i wasn't mean...
2 12979 So I think that the whole world has moved towa...
3 40815 since I think it would be lame to not post any...
4 43475 And now, let’s use this new technique to\nappl...
In [7]:
test_df.sentence[0]
Out[7]:
"If you sit back and think about all that, that’s a lot of layers of complexity to wrap your mind around. So don't worry if it takes time for your mind to digest it all."
In [8]:
sub_df = pd.read_csv("data/sample_sub.csv")
In [9]:
sub_df.head()
Out[9]:
id prediction
0 19475 NaN
1 35980 NaN
2 12979 NaN
3 40815 NaN
4 43475 NaN
In [10]:
# Remove punctuation, new line and lower case all the text available in sentence
test_df.sentence = test_df.sentence.apply(lambda x: re.sub('[,\.!?]', '', x))
test_df.sentence = test_df.sentence.apply(lambda x: x.lower())
test_df.sentence = test_df.sentence.apply(lambda x: x.replace("\n", " "))
In [11]:
test_df.sentence.values
Out[11]:
array(["if you sit back and think about all that that’s a lot of layers of complexity to wrap your mind around so don't worry if it takes time for your mind to digest it all",
       "oh my goodness i've run it again i wasn't meant to click that that's all right it should still work and then gradio look how easy it turns your python functions into an interface so if i type in my name daniel submit is this going to work oh no because the cell's not running that's all right ignore that we want to go for the big dog demo",
       'so i think that the whole world has moved toward using bigger than three datasets right digital civil society which is a lot of data and so for a lot of problems we have a lot of data i would probably use logistic regression',
       ...,
       "actually the video i did previously we talked about the relationship between the slope and the actual correlation okay so if you're interested in that look at the previous video does the confidence interval for the slope contain zero",
       'and now i wonder what happens if we add some ridges to the levels so it cannot only step through the gaps but has to climb let’s see…and we get those long long limbs that can indeed climb through the ridges excellent',
       'now this didn’t quite fit anywhere in this video but i really wanted to show you this heartwarming message from mark chen research scientist at openai this really showcases one of the best parts of my job and that is when the authors of the paper come in and enjoy the results with you fellow scholars loving it'],
      dtype=object)
In [12]:
test_df.head()
Out[12]:
id sentence
0 19475 if you sit back and think about all that that’...
1 35980 oh my goodness i've run it again i wasn't mean...
2 12979 so i think that the whole world has moved towa...
3 40815 since i think it would be lame to not post any...
4 43475 and now let’s use this new technique to apply ...
In [13]:
long_string = ','.join(list(test_df.sentence.values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="silver", max_words=1000, contour_width=3, contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()
Out[13]:
In [14]:
vectorizer = TfidfVectorizer()# stop_words='english')
X = vectorizer.fit_transform(test_df.sentence)
In [15]:
X[0].todense().shape
Out[15]:
(1, 4829)

Generating Predictions

Clustering using K-Means.

In [16]:
true_k = 10
#model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000)
#model = KMeans(n_clusters=true_k, init='k-means++', max_iter=5000)#, n_init=10)
model = KMeans(n_clusters=true_k, max_iter=2500, algorithm='full')
#model = MiniBatchKMeans(n_clusters=true_k)
#model = AgglomerativeClustering(n_clusters=true_k)
In [17]:
model.fit(X)
Out[17]:
KMeans(algorithm='full', max_iter=2500, n_clusters=10)
In [18]:
submission = test_df
In [19]:
submission['prediction'] = test_df.sentence.apply(lambda x: model.predict(vectorizer.transform([x])[0])[0])
In [20]:
submission.head()
Out[20]:
id sentence prediction
0 19475 if you sit back and think about all that that’... 0
1 35980 oh my goodness i've run it again i wasn't mean... 2
2 12979 so i think that the whole world has moved towa... 7
3 40815 since i think it would be lame to not post any... 2
4 43475 and now let’s use this new technique to apply ... 6
In [ ]:
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"))

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
%aicrowd notebook submit -c speaker-identification -a assets --no-verify
In [103]:


Comments

You must login before you can post a comment.

Execute