Loading

Lingua Franca Translation

Solution for submission 172021

A detailed solution for submission 172021 submitted for challenge Lingua Franca Translation

youssef_nader3

Getting Started with Lingua Franca Translation

In this puzzle, we've to translate to english from crowd-talk lanugage. There are multiple ways to build the language translator:

  • Using Dictionary and Mapping
  • Using LSTM
  • Using Transformers

In this starter notebook, we'll go with dictionary and mapping. Here We'll create dictionary of words for both english and corwd-talk language.

Download the files 💾

Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

In [1]:
%%capture
!pip install aicrowd-cli
%load_ext aicrowd.magic

Login to AIcrowd ㊗

In [2]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/sZ7gLTekjIZZOfaLq4ddVltpashghpepc9YlzfZqbIU
API Key valid
Saved API Key successfully!

Download Dataset

We will create a folder name data and download the files there.

In [3]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c lingua-franca-translation -o data

Importing Necessary Libraries

In [4]:
import os
import pandas as pd
import gensim
from sklearn.metrics.pairwise import cosine_similarity
In [21]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[21]:
True

Diving in the dataset:

In [6]:
train_df = pd.read_csv("data/train.csv")
In [7]:
test_df = pd.read_csv("data/test.csv")
In [8]:
from gensim.models import Phrases
from gensim.models import Word2Vec
# Train a bigram detector.

my_sents=[s.split(" ") for s in train_df.crowdtalk]

bigram_transformer = Phrases(my_sents,min_count=3)

# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[my_sents], min_count=1)
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
In [9]:
my_sents=[s.split(" ") for s in train_df.crowdtalk]
In [10]:
bi_words=set([s.decode('utf-8') for s in list(bigram_transformer.vocab.keys())])
In [85]:
len(bi_words)
Out[85]:
62686
In [96]:
crowd_words=[]
lengths=[]
for s in train_df.crowdtalk.values:
  words= s.split(" ")
  lengths.append(len(words))
  for w in words:
    crowd_words.append(w)
In [97]:
my_test=[s.split(" ") for s in test_df.crowdtalk]
test_set=[]
for s in bigram_transformer[my_test]:
  for w in s:
    test_set.append(w)
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
In [12]:
not_present=[]
present=[]
for word in test_set:
  if word not in bi_words:
    not_present.append(word)
  else:
    present.append(word)
In [19]:
len(not_present)
Out[19]:
1242
In [11]:
words=[[word.lower() for word in nltk.word_tokenize(s) if word.isalnum()] for s in train_df.english.values]
english_words=[]
for s in words:
  for w in s:
    english_words.append(w)
english_words=set(english_words)
In [83]:
len(english_words)
Out[83]:
8970
In [12]:
crowd_indices={v:k for k,v in enumerate(bi_words)}
english_indices={v:k for k,v in enumerate(english_words)}
In [13]:
reverse_english_idx={v:k for k,v in english_indices.items()}
In [29]:
import numpy as np
crowd_to_english_mat=np.zeros((len(english_words),len(bi_words)))
In [ ]:
words
In [30]:
my_sents=[s.split(" ") for s in train_df.crowdtalk]
for crowd_sentence,english_sentence in zip(my_sents,words):
  bi_sent=bigram_transformer[crowd_sentence]
  for wi,word in enumerate(bi_sent):
    try:
      word_idx=crowd_indices[word]
      ewords_idx=[english_indices[eword]for eword in english_sentence]
      for ei,eidx in enumerate(ewords_idx):
        if abs(wi-ei)<2:
          crowd_to_english_mat[eidx,word_idx]+=1
      # crowd_to_english_mat[ewords_idx,word_idx]+=1
    except KeyError:
      pass
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
In [102]:
crowd_to_english_mat.shape
Out[102]:
(8970, 62686)
In [35]:
from sklearn.preprocessing import StandardScaler
scl=StandardScaler(copy=False)
In [16]:
for i in range(crowd_to_english_mat.shape[0]):
  crowd_to_english_mat[i,crowd_to_english_mat[i,:]!=crowd_to_english_mat[i,:].max(]=0
In [22]:
for w in stopwords.words('english'):
  if w in english_words:
    i=english_indices[w]
    crowd_to_english_mat[i,crowd_to_english_mat[i,:]!=np.max(crowd_to_english_mat[i,:])]=0
In [31]:
translation_dict={}
crowd_to_english_mat/=(np.mean(crowd_to_english_mat, axis=1).reshape(-1,1)+1)
for word in bi_words:
  word_idx=crowd_indices[word]
  max_trans_idx=np.argmax(crowd_to_english_mat[:,word_idx])
  translation=reverse_english_idx[max_trans_idx]
  translation_dict[word]=translation
In [143]:
len(stopwords.words('english'))
Out[143]:
179
In [32]:
import  nltk.translate.bleu_score as bleu
bleues=[]
sentences=[]
for i in range(len(train_df.crowdtalk.values)):
  reference_trans=[train_df.english[i].lower().split(" ")]
  candidate=[translation_dict[w] for w in bigram_transformer[my_sents[i]]]
  sentences.append(" ".join(candidate))
  score=bleu.sentence_bleu(reference_trans,candidate)
  bleues.append(score)
  if score<.3:
    print(" ".join(reference_trans[0]),"||||"," ".join(candidate))
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
/usr/local/lib/python3.7/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: 
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  warnings.warn(_msg)
/usr/local/lib/python3.7/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: 
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  warnings.warn(_msg)
/usr/local/lib/python3.7/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: 
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  warnings.warn(_msg)
at the centre of the island there is a chasm about fifty yards in diameter |||| at the the centre the the of island is is a fifty fifty yards in diameter
“that for the sake of my patron the king of luggnagg |||| that for the the sake my of patron the the king of luggnagg
“that if good fortune ever restored me to my native country |||| that if fortune ever restored me my to native
in order to stifle or divert the clamour of the subjects against their evil administration. |||| in order to theodorus or divert the the clamour the the of subjects against their evil administration
pronouncing with fervour the names of the most distinguished discoverers. |||| pronouncing with fervour the the names the the of most distinguished trample
and the hue both of that and the dug |||| the the and hue both of that the the and casually
“the volume of plutarch’s lives which i possessed contained the histories of the first founders of the ancient republics. |||| the the volume of plutarch lives which i possessed contained the the histories the the of first ancient the the of ancient trample
as soon as the younger of the two comes to be fourscore |||| as as as the the younger of the the two to be to fourscore
the weather became fine and the skies cloudless. |||| the the weather became fine the the and reflections cloudless
in such a manner that the brain may be equally divided. |||| in such a manner the the that brain be be equally divided
my benefactor—” the human frame could no longer support the agonies that i endured |||| my benefactor the the human frame could longer support the the agonies i that endured
and found we were in the latitude of 46 n. and longitude of 183. |||| and found we were the the in latitude of 46 longitude and 183 of trample
“the treasurer was of the same opinion: he showed to what straits his majesty’s revenue was reduced |||| the the treasurer was of same the the opinion he showed to what majesty his s revenue was reduced
which i had soon too much cause to repent: for i found afterwards |||| which had i soon too cause to repent for i found afterwards
being maintained all the time at the king’s charge. |||| being maintained all the the time the the at s charge
the queen’s joiner had contrived in one of glumdalclitch’s rooms |||| the the s joiner had contrived in of one glumdalclitch rooms
we travelled at the time of the vintage and heard the song of the labourers as we glided down the stream. |||| we travelled at the the time the the of heard and heard the the song the the of as as we stream down the the stream
let the stone be placed in the position c d |||| let the the stone be placed the the in position d d
the porter opened the gates of the court |||| the the porter opened the the gates the the of court
several officers of the army went to the door of the great council-chamber |||| several officers the the of army went to the the door the the of great trample
and i feared the effects of the dæmon’s disappointment. |||| and i feared the the effects the the of disappointment disappointment
for the girl was afraid of the queen’s anger |||| for the the girl was afraid the the of s anger
as well as the windows and the door of my bigger box |||| as as as the the windows and the the door my of bigger box
but soothing inasmuch as it supposed the safety of my friends. |||| but soothing command as it supposed the the safety my of friends
yet the stages of the discovery were distinct and probable. |||| yet the the stages the the of discovery were distinct and probable
that they were first introduced by the grandfather of the emperor now reigning |||| that they were first upon by the the grandfather the the of emperor now trample
while the lake reflected the scene of the busy heavens |||| while the the lake reflected the the scene the the of busy heavens
which took up the greatest part of both the length and breadth of the room |||| which took the the greatest of both the the length and breadth the the of room
their icy and glittering peaks shone in the sunlight over the clouds. |||| their icy and glittering glittering shone the the in sunlight over the the clouds
the other being appropriated to the advancers of speculative learning |||| the the other being appropriated the the to speculative of scotch learning
the old man walked before the cottage in the sun for a few minutes |||| the the man walked before the the cottage the the in sun for a minutes
in the right coat-pocket of the great man-mountain” (for so i interpret the words quinbus flestrin |||| in the the right great the the of great combs for so i interpret the the words flestrin
i inquired of the inhabitants concerning the fiend and gained accurate information. |||| i inquired the the of inhabitants concerning the the fiend and gained information information
i stored the boat with the carcases of a hundred oxen |||| i stored the the boat with the the hundred a of hundred trample
when the scene is on the point of closing before me for ever |||| when the the scene is the the on point of before before me for ever
and building by the swallow (for so i translate the word lyhannh |||| and building the the by swallow for so i translate the the word trample
until the eye recovered itself from the preceding flash. |||| until the the eye recovered itself the the from flash flash
unacquainted with the language of the country and utterly ignorant of the customs of the world. |||| unacquainted with the the language the the of country and utterly ignorant the the of customs the the of world
the dancing of the sunbeams in the sea |||| the the dancing of the the sunbeams the the in sea
“he had orders from the captain to set me ashore.” |||| he had orders the the from captain to set me ashore
to the great ease as well as health of the subject |||| to the the great ease as as as health the the of subject
i greedily devoured the remnants of the shepherd’s breakfast |||| i greedily devoured the the remnants the the of breakfast breakfast
and thirty-two thousand horse: if that may be called an army |||| and horse thousand horse if that be be called an army
and neither resembling the harmony of the old man’s instrument nor the songs of the birds |||| and neither resembling the the harmony the the of old s instrument nor the the of the the of birds
and i saw the grave-worms crawling in the folds of the flannel. |||| and i saw the the crawling folds in the the folds the the of trample
they boast that the king’s army consists of a hundred and seventy-six thousand foot |||| they boast that the the s army consists a of hundred and foot thousand foot
he represented to the emperor “the low condition of his treasury |||| he represented to the the emperor the the low condition his of treasury
the report of the pistol brought a crowd into the room. |||| the the report of the the a brought a crowd the the into room
had mingled a sleepy potion in the hogsheads of wine. |||| had mingled a sleepy sleepy the the in hogsheads of wine
cursed (although i curse myself) be the hands that formed you! |||| cursed although i curse myself be the the hands that formed you
with the utmost brevity and in the plainest words |||| with the the utmost brevity and the the in words words
my papa is a syndic—he is m. frankenstein—he will punish you. |||| my papa is a will is punish punish will punish you
the arrival of the arabian now infused new life into his soul. |||| the the arrival the the of arabian now into new life into his soul
the triumph of my enemy increased with the difficulty of my labours. |||| the the triumph my of enemy increased with the the difficulty my of labours
but terminated with the limits of the king’s dominions |||| but terminated with the the limits the the of s dominions
“he suffered not in the consummation of the deed. |||| he suffered not in the the consummation the the of deed
about an hour before she heard of the discovery of the body |||| about hour before she heard the the of discovery the the of body
who belonged to one of the clerks of the kitchen. |||| who belonged to one the the of clerks the the of kitchen
he loved with ardour:— ——the sounding cataract haunted him like a passion: the tall rock |||| he loved with ardour the the cataract haunted haunted him like a passion the the tall rock
i learned also the names of the cottagers themselves. |||| i learned also the the names the the of cottagers themselves
In [ ]:

In [183]:
pd.DataFrame({"english":train_df.english,"translated":sentences}).to_csv('inspection.csv')
In [33]:
np.mean(bleues)
Out[33]:
0.7034426681786159
In [ ]:

In [34]:
reference_trans=["i went to the park".lower().split(" ")]
candidate2='i went the to park'.split()
bleu.sentence_bleu(reference_trans,candidate2)
/usr/local/lib/python3.7/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: 
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  warnings.warn(_msg)
Out[34]:
0.7071067811865476
In [ ]:
train_df.crowdtalk[2]
Out[ ]:
'toirts choolt chiugy knusm squiend sriohl gheold'
In [ ]:
[translation_dict[w] for w in bigram_transformer[my_sents[1]]],[train_df.english[1].lower().split(" ")]
In [ ]:
[translation_dict[w] for w in bigram_transformer[my_sents[0]]]
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
Out[ ]:
['upon', 'this', 'ladder', 'of', 'one', 'them', 'mounted']
In [ ]:
for p in bigram_transformer.export_phrases(train_df.crowdtalk.values):
  print(p)
In [ ]:
bigram_transformer[my_sents[2]]
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
Out[ ]:
['toirts_choolt', 'chiugy', 'knusm', 'squiend_sriohl', 'gheold']
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word',ngram_range=(1,2),min_df=3,max_features=100)
cv_fit=cv.fit_transform(train_df.crowdtalk)
In [ ]:
print()
[ 516 4372  774 2020 2020 2068 2068 3658  807  807  874  874  842  593
  593  537  537  600 1054 2020  419  419  921 3658 3658  465  465  855
  405  405 1512  690  405  670  670  555  555  874  882  882  595  600
  600 1477 1477  419  593 1054 1054  571  571  670 2486 4007 6554 6554
  882  690  690 4372 4372  928  928 1233 1233  855  855 1512 1512  516
  516  595  595  921  921 3534 3534 3534  807  571 1233  842  842  774
  774 1477 4007 4007 2068  465  928  847  847  847 2486 2486  537 6554
  695  555]
In [ ]:
print()
for k,v in zip(cv.get_feature_names(),cv_fit.toarray().sum(axis=0)):
  print(k,v)
In [ ]:
model.train(bigram_transformer[my_sents], total_examples=len(train_df.crowdtalk.values), epochs=3)
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
Out[ ]:
(243383, 354384)
In [ ]:
model.vocabulary.raw_vocab
Out[ ]:
defaultdict(int, {})
In [ ]:
train_df
Out[ ]:
id crowdtalk english
0 31989 wraov driourth wreury hyuirf schneiald chix lo... upon this ladder one of them mounted
1 29884 treuns schleangly kriaors draotz pfiews schlio... and solicited at the court of Augustus to be p...
2 26126 toirts choolt chiugy knusm squiend sriohl gheold but how am I sunk!
3 44183 schlioncy yoik yahoos dynuewn maery schlioncy ... the Yahoos draw home the sheaves in carriages
4 19108 treuns schleangly tsiens mcgaantz schmeecks tr... and placed his hated hands before my eyes
... ... ... ...
11950 50106 hydriaond cieurry mcdaabs swiings schlioncy yo... about five hundred leagues to the east
11951 14786 treuns schleangly criaody treuns schleangly wr... ) and two and a half in breadth
11952 16903 toirts choolt cycluierg triild schuony hypuids... “But my toils now drew near a close
11953 68451 toantz spluiey gheuck schoutch spluiey gheuck ... going as soon as I was dressed to pay my atten...
11954 30895 shriedy hyoirds splauetch sooc kniousts schlai... for there was no sign of any violence except t...

11955 rows × 3 columns

In [87]:
crowd_words=[]
lengths=[]
for s in train_df.crowdtalk.values:
  words= s.split(" ")
  lengths.append(len(words))
  for w in words:
    crowd_words.append(w)
In [88]:
len(set(crowd_words))
Out[88]:
9245
In [ ]:
import jellyfish
from tqdm.notebook import tqdm
mins=[]
crowd_vocab=list(set(crowd_words))
for word in tqdm(not_present):
  matches=[jellyfish.levenshtein_distance(word, b_token) for b_token in crowd_vocab]
  mins.append(partial_idx)
  if np.min(matches)==1:
    print(word,crowd_vocab[np.argmin(matches)])
In [ ]:
mins
In [ ]:
import numpy as np
np.mean(lengths)
Out[ ]:
13.66624843161857
In [ ]:
english_words=[]
lengths=[]
for s in train_df.english.values:
  words= s.split(" ")
  lengths.append(len(words))
  for w in words:
    english_words.append(w.lower())
In [ ]:
len(english_words),len(crowd_words),len(set(english_words))
Out[ ]:
(112431, 163380, 11492)
In [ ]:
english = train_df.english.values
crowdtalk = train_df.crowdtalk.values
In [ ]:
english
Out[ ]:
array(['upon this ladder one of them mounted',
       'and solicited at the court of Augustus to be preferred to a greater ship',
       'but how am I sunk!', ..., '“But my toils now drew near a close',
       'going as soon as I was dressed to pay my attendance upon his honour',
       'for there was no sign of any violence except the black mark of fingers on his neck.'],
      dtype=object)
In [ ]:
processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in english]
#eng_word_list = [word for words in processedLines for word in words]

eng_word_list = [word[0] for word in processedLines ]  # only 1-th words (Bleu = 0.080)  !!!
In [ ]:
processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in crowdtalk]
#crowdtalk_word_list = [word for words in processedLines for word in words]

crowdtalk_word_list = [word[0] for word in processedLines]  # only 1-th words (Bleu = 0.080)  !!!
In [ ]:
dict1 = dict(zip(crowdtalk_word_list, eng_word_list))

Prediction Phase ✈

In [37]:
crowdtalk = test_df.crowdtalk.values
In [38]:
processedLines = [gensim.utils.simple_preprocess(sentence) for sentence in crowdtalk]
In [ ]:
!pip install jellyfish
In [35]:
!pip install gingerit
Collecting gingerit
  Downloading gingerit-0.8.2-py3-none-any.whl (3.3 kB)
Requirement already satisfied: requests<3.0.0,>=2.25.1 in /usr/local/lib/python3.7/dist-packages (from gingerit) (2.27.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.25.1->gingerit) (2.10)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.25.1->gingerit) (2.0.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.25.1->gingerit) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.25.1->gingerit) (2021.10.8)
Installing collected packages: gingerit
Successfully installed gingerit-0.8.2
In [36]:
from gingerit.gingerit import GingerIt

text = 'according the to license he had me'

parser = GingerIt()
parser.parse(text)
Out[36]:
{'corrections': [],
 'result': 'according the to license he had me',
 'text': 'according the to license he had me'}
In [ ]:
!pip install -U git+https://github.com/PrithivirajDamodaran/Gramformer.git
In [50]:
!pip install spacy
Requirement already satisfied: spacy in /usr/local/lib/python3.7/dist-packages (2.2.4)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (7.4.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.27.1)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.9.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.1.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (0.4.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy) (57.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.0.6)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (4.62.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (3.0.6)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.0.0)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy) (1.19.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy) (2.0.6)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy) (4.10.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy) (3.7.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy) (3.10.0.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.0.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2021.10.8)
In [ ]:
!python -m spacy download en_core_web_lg # Downloaing the model for english language will contains many pretrained preprocessing pipelines
In [55]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()
In [ ]:

In [101]:
from tqdm.notebook import tqdm
sentences3=[]
bi_words_list=list(bi_words)
followups=[]
for i in tqdm(range(len(processedLines))):
  sentence=processedLines[i]
  translation_tokens=[]
  bi_sent=bigram_transformer[sentence]
  for token in bi_sent:
    if token in translation_dict:
      translation_tokens.append(translation_dict[token])
    # elif token[:-1] in translation_dict and (token[-1]=='s' or token[-1]=='z'):
    #   print("actually here")
    #   translation_tokens.append(translation_dict[token[:-1]]+'s')
    # elif token+'s' in translation_dict:
    #   print("wow also here")
    #   translation_tokens.append(translation_dict[token+'s'][:-1])

  sent_modified=[]
  sent_modified.append(translation_tokens[0])
  for i in range(1,len(translation_tokens)):
    if not translation_tokens[i] == translation_tokens[i-1]:
      sent_modified.append(translation_tokens[i])

  final_sent=' '.join(sent_modified)
  sent_modified=[]
  doc=nlp(final_sent)
  continue_flag=False
  for i,t in enumerate(doc):
    if continue_flag:
      continue_flag=False
      continue
#'PART','ADP','CCONJ'
    if t.text=='the' and i<len(doc)-1 and (doc[i+1].pos_ in['PART','ADP','CCONJ']or doc[i+1].text=='that'):
      sent_modified.append(doc[i+1].text)
      sent_modified.append(t.text)
      continue_flag=True
    else:
      sent_modified.append(t.text)

    # else:
    #   partial_idx=np.argmin([jellyfish.levenshtein_distance(token, b_token) for b_token in bi_words_list])
    #   closest_word=bi_words_list[partial_idx]
    #   translation_tokens.append(translation_dict[closest_word])
  sentences2.append(parser.parse(' '.join(sent_modified))['result'].replace("  ",' '))
/usr/local/lib/python3.7/dist-packages/gensim/models/phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
In [107]:
len(sentences2[:test_df.shape[0]])
Out[107]:
3985
In [106]:
test_df.shape[0]
Out[106]:
3985
In [62]:
from collections import Counter
Counter(followups)
Out[62]:
Counter({'ADJ': 270,
         'ADP': 705,
         'ADV': 25,
         'AUX': 9,
         'CCONJ': 67,
         'DET': 30,
         'NOUN': 1032,
         'NUM': 6,
         'PART': 14,
         'PRON': 2,
         'PROPN': 22,
         'SCONJ': 2,
         'VERB': 28,
         'X': 2})
In [ ]:
import jellyfish
Out[ ]:
2

Creating sentences by matching english word corresponding the new langauge word in the sentence using the dictionary mapping created.

In [ ]:
sentences3=[]
for sent in sentences2:
  sentence_split=sent.split()
  sent_modified=[]
  sent_modified.append(sentence_split[0])
  for i in range(1,len(sentence_split)):
    if not sentence_split[i] == sentence_split[i-1]:
      sent_modified.append(sentence_split[i])
    else:
      print("here")
  sentences3.append(" ".join(sent_modified))
In [ ]:
sentence = []

for i in processedLines:
  sentence_part = []
  word = ''
  for k, j in enumerate(i):
    if j in dict1:
      word = ''.join(dict1[j])
    else:
      word = ''.join(' ')
    sentence_part.append(word)
    temp = ' '.join(sentence_part)
  sentence.append(temp)
In [108]:
test_df['prediction'] = sentences2[test_df.shape[0]:]
In [ ]:
from gingerit.gingerit import GingerIt

parser = GingerIt()
res=parser.parse('and of strange things my of beauty')['result'].replace("  ",' ')
In [76]:
reverse_trans_dict={v:k for k,v in translation_dict.items()}
In [ ]:
reverse_trans_dict
In [ ]:
for word in english_words:
  if any([True for w in list(reverse_trans_dict.keys()) if word==w+'s']):
    print(word,reverse_trans_dict[word],reverse_trans_dict[word[:-1]])
In [109]:
test_df.prediction
Out[109]:
0                   and reported strange things my beauty
1                             scared and crimes, as was I
2                                 when I found on my feet
3                      according to the license he had me
4                     the very worst effects that avarice
                              ...                        
3980                                 when it did not rain
3981                               but she did not answer
3982    when it was found could I neither understand n...
3983                 by which they distinguish themselves
3984                              till could I reach them
Name: prediction, Length: 3985, dtype: object
In [ ]:
for s in sentences2:
  if 'I and 'in s:
    print(True)
In [92]:
test_df.to_csv('./translated.csv')
In [111]:
test_df
Out[111]:
id crowdtalk prediction
0 27226 treuns schleangly throuys praests qeipp cyclui... and reported strange things my beauty
1 31034 feosch treuns schleangly gliath spluiey gheuck... scared and crimes, as was I
2 35270 scraocs knaedly squiend sriohl clield whaioght... when I found on my feet
3 23380 sqaups schlioncy yoik gnoirk cziourk schnaunk ... according to the license he had me
4 92117 schlioncy yoik psycheiancy mcountz pously mcna... the very worst effects that avarice
... ... ... ...
3980 22854 scraocs knaedly daioc mceab spriaonn schmeips ... when it did not rain
3981 24201 toirts choolt blointly spriaonn schmeips krous... but she did not answer
3982 33494 scraocs knaedly daioc mceab sooc kniousts clie... when it was found could I neither understand n...
3983 28988 czogy stoorty wheians veurg mcmoorth dwiountz ... by which they distinguish themselves
3984 25337 zoetz treiahl typeauty squiend sriohl daonts s... till could I reach them

3985 rows × 3 columns

Saving the prediction in the asset directory with the same as submission.csv.

In [112]:
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"), index=False)

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
%aicrowd notebook submit -c lingua-franca-translation -a assets --no-verify

Comments

You must login before you can post a comment.

Execute