AMLD 2020 - D'Avatar Challenge
Reincarnation of personal data entities in unstructured data sets
Privacy is important to each one of us. Whether we share information about us with the Government or with commercial entities like social media websites, it is in our interest to:
- Share minimal data
- Expect that our data is stored securely
- Ensure our data is used only for purposes that we have agreed to, and
- Ensure our data is not shared with 3rd parties without our consent.
With the advent of Data Protection Regulations like GDPR in the European Union, California Privacy Protection Act in the US, and the Personal Data Protection Bill 2018 in India, there is increasing regulatory backing for our privacy, and the protection of our personal data.
On the other end of the spectrum, protecting the personal data of customers is a huge challenge for companies. Even identifying all the personal data available with a company is non-trivial. Identifying personal data entities in customer data, protecting and anonymizing personal data, and serving customer requests related to the usage of their data, are all part of a company’s Data analytics and regulatory compliance systems.
At the government level, this problem of protecting sensitive personal information assumes gargantuan proportions. Govt collects information about citizens for a number of reasons, like welfare, identification, and security. All these information could be linked to a unique identification numbers like the SSN (US), Aadhaar number (India), which further increases the data protection requirements.
What is the D’Avatar Challenge?
We’ll be providing a corpus of English texts which are from customer complaints to financial companies. The personal data entities in these texts have already been removed and contain placeholders like xxxx. As part of this challenge, you have to impute (create new) values for the personal data entities that have been redacted from texts.
Why does this matter
While the intellectual curiosity to solve a problem is likely to be the main motivation for you to participate in this challenge, we hope this exercise will also be of use to the academic community, government and industry in India. The datasets that we produce during this challenge can be made available to researchers, to come up with better models to improve privacy and regulatory compliance. Based on the number of submissions, we might be able to produce a combined large dataset, with more diversity of personal data entities than each of the teams attempting separately.
Given a dataset with unstructured text containing one or more redacted spans, where the spans are known to have contained Personal Data Entities (PDE), participants have to impute unrelated PDEs of the same types, in place of the redacted spans.
Consider the below example:
“My credit card number is xxxx and I wish to raise a compliant .”
In the above text, the entity masked with xxxx is the redacted span. We might be able to guess that a 16 digit credit card number was originally present in this text.
The simplest output we are looking for is a re-written text with the redacted span replaced with a personal data entity of the expected type. In this example, the redacted portion should be replaced with some variant of a 16 digit number.
“My credit card number is 1234-5678-9012-3456 and I wish to raise a compliant .”
However, a better output will be credit card number which is not completely random, but obeys the Luhn algorithm.
Personal Data Entity Types
The personal data entities imputed by you must have one or more of the below types. You can provide other finer types, if you wish, but we’ll ignore them for the purpose of this evaluation.
As bonus credit, can you impute entities of the above entity types, without bias in any protected variable? For example, can you ensure the /location/country has reasonably diverse country names?
Refer to AIF 360 Toolkit for detecting bias in datasets.
To get you started, we are providing some pointers for your solutions. We,however, encourage you to come up with your own innovative solutions to the problem.
BRAT tool can be used to crowd source the problem and let human annotators guess the masked entities, and optionally impute values too. But a more feasible solution is to let human annotators provide the entity types for the masked entities, and then use some dictionary to impute values of that type.
Rule based annotations
A rule based system, which uses dictionaries (of names, places, credit card numbers etc) can be used to find patterns in sentences, and replace the masked portions. IBM’s System T (or any other solution, or perhaps just regular expressions) can be used to find such patterns in sentences.
Snorkel is a system used for generating large amounts of noisy training data by writing labeling functions. After generating a gold set using manual methods, this system could be used to annotate more.
Natural Language Generation
A machine learning model can also be used to generate words/numbers to replace the redacted portions in a sentence. This problem can perhaps be solved using Natural Language Generation models which typically tend to be sequence-to-sequence models.
Masked Language Models
Language Models like BERT, Elmo, XLNet could potentially be used predict the masked entities. See this page.
Generative Adversarial Networks
Avino et al 2018 tried using GANs to generate synthetic healthcare datasets.