HTREC 2022

Ancient Greek BERT

A subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Experimental results (read their work) showed good perplexity and state-of-the-art performance for fine-grained POS tagging on treebanks with Classical and Medieval Greek and on a Byzantine Greek dataset.

This notebook loads this pre-trained model and shows how to use it to mask/unmask words. It needs to be fine-tuned on in-domain data to properly work.

Ancient Greek BERT¶

This is a subword-based BERT masked language model, trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Read the work of Singh et al (2021) or employ their code using this notebook.

In [4]:

%%capture
!pip install transformers
!pip install unicodedata
!pip install flair

In [5]:

%%capture
from transformers import AutoTokenizer, AutoModel
tokeniser = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")

Employing the seek-and-find code from GreekBERT, but using the AncientGreekBERT instead.

In [24]:

import torch
input_ids = tokeniser.encode('τοῦ βίου τοῦ καθ ΄ εαυτοὺς πολλὰ γίνεσθαι συγχωροῦν [MASK]')
tokens = tokeniser.convert_ids_to_tokens(input_ids)
idx = tokens.index("[MASK]")
print(idx, tokens)
outputs = model(torch.tensor([input_ids]))[0]
print(tokeniser.convert_ids_to_tokens(outputs[0, idx].max(0)[1].item()))

13 ['[CLS]', 'του', 'βιου', 'του', 'καθ', '΄', 'εαυτους', 'πολλα', 'γινε', '##σθαι', 'συγχ', '##ωρου', '##ν', '[MASK]', '[SEP]']
##τικα

Suggested next steps¶

Fine-tune on a dataset (e.g., on HTREC data).
Mask likely mistaken (e.g., HTRed) words, then use the model to unmask.
Use some word lexicon (e.g., based on this resource) to detect likely errors.

Content

1702

Show Comments

Comments

You must login before you can post a comment.