AI, Crowd-Sourced: Natural Language Processing for Beginners
AI, Crowd-Sourced is a series where we here at AIcrowd guide beginners in Machine Learning through the different aspects of the field through our Challenges. The purpose of this article to uncover datasets and ML algorithms that will jumpstart you in the field!🚀
In this chapter of AI, Crowd-Sourced, we will go through some beginner-friendly Natural Language Processing challenges that are hosted by AIcrowd. This article will give you an insight into various resources provided both by the platform and previous participants.🤓
In his 1950 paper Computing Machinery and Intelligence, Alan Turing proposed that a computer can be said to possess artificial intelligence if it can mimic human responses under specific conditions. Even today the Turing Test shows us how essential it is to achieve milestones in Natural Language Processing, the very foundation of perfect AI.🤖
Over the last few years, we saw NLP models like GPT3 take the ML industry by storm. These models have inspired many current state-of-the-art architectures. They are also achieving extreme feats like generating and comprehending code written by humans, such as Code Oracle.
Such models are often called Transformers, a term that was first coined in the paper “Attention is all you need” which describes them as the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.
How can you learn this valuable skill set and launch your NLP projects? We have curated some problems and challenges on our platform that will provide you with free datasets and baselines to kick off your NLP journey!
A study showed that companies lose over 70% of the clicks on their Google Ads just because of typos. This mistake has been recently made quite sparse with introducing auto spell check software like Grammarly. Such software, among many other ways to classify words as correctly spelled or not, also tries to first classify whether a word is scrambled which makes the entire process rather bridged.
Through SCRBL, AIcrowd has introduced its very own dataset that consists of over 1 Million pieces of texts which has been sourced from the largest encyclopedia in the world - Wikipedia! For this challenge, the participants are required to classify whether a word is scrambled or unscrambled.
The starter kit from the team of AIcrowd, provides a baseline that uses Multinomial Naive Bayes, one of the most popular supervised learning classifications that is used for the analysis of the categorical text data. This method calculates the probability of each of the two tags (scrambled or unscrambled) for a given sample and then gives the tag with the highest probability as output. After going through the starter kit, you may wonder how one can achieve an F1 score of 1, when the baseline provided is only able to achieve 0.55?
Team TODO that secured the second rank during the first run of the Challenge was able to achieve an F1 score of 0.999 using transfer learning. With state-of-the-art language models getting bigger in size, “TODO” was able to achieve this result in their submission using DistilBERT, a distilled version of BERT that is smaller and faster than the original BERT architecture while only compromising 3% of its language understanding capabilities.
Let us know in the comments below which transformer you used in your solutions and how well did it perform!👇🏼
Text can be a very rich source of information, but obtaining insights from it can be difficult and time-consuming. Natural language processing and machine learning advances are making it easier to sort text data. Text categorization is used in multiple modules like spam detection and document classification.
Tiring-text challenge, by AIcrowd from Felicity Threads, IIIT Hyderabad, helped participants come up with intuitive solutions to this prompt of categorizing sentences. We provide a dataset that comprises 79,376 text data points, each corresponding to a specific category. The participants are expected to train a model that is capable of categorizing these data points into 8 different categories.
AIcrowd presents the participants with a starter kit. The notebook includes a quick implementation of Decision Tree Classifiers to classify the text points into different categories. Decision trees essentially help categorize data or nodes in the case of decision trees into all available classes or variables and ultimately select the split that results in the most homogenous sub-nodes.
Feeling stuck and unable to bump up the scores? Check out Team Defcon’s submission for the challenge. In their submission, the team uses TF Hub’s Universal Sentence Encoder, a model which encodes text into higher dimensional vectors which is specially trained for greater-than-word length text like sentences and paragraphs.
Let us know which approach helped you maximize your score in the comments section!⌨
Today, we produce more information than ever before, but not all of it is true.
Some of it is malicious and dangerous. It is very important to identify true, verified news from fake malicious news! This problem is further muddled with the introduction of modern transformers like GPT-3 that are able to generate text on their own at an alarming rate.
AIcrowd in collaboration with AI for Good - ITU presents FNEWS, as their attempt at providing users with a dataset with over 387,000 text lines which has been sourced from various news articles from the web as well as texts generated by Open AI's GPT 2 language model.
Credit- Gleb Garnich/Reuters
AIcrowd provides the participants with a starter kit that follows the example of the starter kit provided for SCRBL, where they deal with the prompt using Multinomial Naive Bayes, one of the most popular supervised learning classifications that is used for the analysis of the categorical text data. It shows the users how they can call the dataset from the AIcrowd platform and further make submissions.
The starter kit solution may come off as a bit lackluster in terms of the F1 score. We recommend checking out Team TODO’s challenge-winning solution for some motivation. The team contrary to their submission for SCRBLE resorted to Facebook’s roBERTa as their go-to for feature extraction. roBERTa stands aside from the other BERT variants because of the training of the model. The model is trained on a bigger dataset which includes the CC-News dataset over a larger time in bigger batches.
What method will you use?
What do you think is a field that needs serious automation in terms of language processing? Comment below or tweet us @AIcrowdHQ to let us know!
Want to learn more? ⬇️