LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

Overview

Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low resourced ones such as Persian. In order to target this gap for low resource languages, we propose a "Large Scale Colloquial Persian Dataset" (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M Persian (Farsi) sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translations in English, German, Czech, Italian and Hindi spoken languages.

Bibtex

@InProceedings{abdikhojasteh:2020:LREC,
  author = {Abdi Khojasteh, Hadi  and  Ansari, Ebrahim  and  Bohlouli, Mahdi},
  title  = {LSCP: Enhanced Large Scale Colloquial Persian Language Understanding},
  booktitle = {Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020)},
  year      = {2020}
  address   = {Marseille, France},
  publisher = {European Language Resources Association}
  pages  = {6323--6327},
  url    = {https://www.aclweb.org/anthology/2020.lrec-1.776}
}

Hadi
Abdi Khojasteh¹

Ebrahim
Ansari^{1, 2, *}

Mahdi
Bohlouli^{1, 3}

¹ Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
² Institute of Formal and Applied Linguistics (ÚFAL),
Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic
³ Research and Innovation Department, Petanux GmbH, Germany

^* Corresponding author
Email: ansari at iasbs dot ac dot ir

Language understanding is a comprehensive problem that encompasses the recognition of multiple semantic aspects of relations between different linguistic units and compounds: homonymy, synonymy, antonymy, hypernymy, hyponymy, meronymy, metonymy, holonymy, paronyms. In other hands, large scale corpora are one of the key resources in Natural Language Processing (NLP). In spite of their importance in many lingual applications, no large-scale Persian corpus has been made available so far, given the difficulties in its creation and the intensive labour required due to its low resources, especially in colloquialism.

LSCP is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the millions of real-world sentences. The LSCP has 120M sentences from 27M casual Persian (Farsi) tweets with its derivation tree, part-of-speech tags, sentiment polarity and parallel sentences in English, German, Czech, Italian and Hindi spoken languages.

A sample of the extracted sentence with dependency relations in syntactic annotation, part-of-speech tags, sentiment polarity and translations.

Get Started

First, for preparing corpus in the dataset, you can download files by cURL or Wget or manually download from LINDAT repository. Then, you need to uncompress the files.
For future usage, the more common way is to read data separately from each data source, such as monolingual, derivation tree or Persian-Czech bilingual one. Loading can be done by the built-in functions for reading files and splitting lines by newline character.

The simple pipeline is also available as a Colab notebook:

Run in Google Colab

Download

This corpus is licensed under CC BY-NC-ND 4.0 and publicly available in the LINDAT/CLARIAH-CZ repository.

The following script download full corpus (~2.5GB) from the repository and decompress all (~20GB) files:

curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3195{/README.md,/lscp-0.5-fa.7z,/lscp-0.5-fa-normalized.7z,/lscp-0.5-fa-derivation-tree.7z,/lscp-0.5-fa-cs.7z,/lscp-0.5-fa-en.7z,/lscp-0.5-fa-de.7z,/lscp-0.5-fa-it.7z,/lscp-0.5-fa-hi.7z}
7z -y e "lscp-0.5-*.7z"

Alternatively, you can manually download files from http://hdl.handle.net/11234/1-3195.

Towards the Persian Colloquial Language

The language in oral form is typically much more dynamic than its written. The written variety of a language typically involves a higher level of ritual, whereas the spoken form is characterised by several contractions and abbreviations. In formal written texts, longer and tougher sentences tend to be used, as a result of the reader can re-read the troublesome parts if they lose track.

The size of the vocabulary in use is one of the foremost noticeable variations between oral and written sorts of discourse. The written language uses synonyms rather than continuance an equivalent word over and all over again. This is, however, not the case in an oral language which typically makes use of a lot of restricted vocabulary. the extent of difficulty in pronunciation may additionally have an effect on the words chosen. Oral languages tend to use words of fewer syllables.

In addition to the aforesaid general differences between spoken and written varieties of a language, Persian also known by its endonym Farsi (فارسی) introduces a range of variations that further expand this gap. Additionally, to several informal words not applicable to be used in a formal language, there are outstanding variations in the pronunciation of words. For example, the word "نان [nān]" (which is means bread) is changed into "نون [no͡on]" in spoken language. This alteration is quite common but has no rule, so in various words, the speaker is not allowed to interchange these ones in colloquial language. Obviously a Persian tweets involves all of the mentioned features of the spoken form of Persian language.

Collecting and Annotation

Building a large-scale language understanding dataset is a time-consuming task. In practice, there are two main tasks which are usually the most time consuming for creating a large-scale corpus: (a) data collection and (b) data annotation.

We collected tweets by a crawler based on the Twitter API. To make sure that the collection is diverse, we first created a list of seed users. Then, the followers of the seed accounts were added to the list. Next, the latest tweets of the users in our list were extracted and saved in the dataset. To make the data appropriate for our task, we removed the same tweets, non-Persian tweets and the ones that had fewer words. This led the dataset to have relatively long sentences with a diverse concept.

For the annotation of the datasets, we adopt a semi-automatic crowdsourcing strategy, in which a human manually verifies the crawled sentences, to reduce the cost of data collection and annotation. (a) manual annotations are error-prone because a human cannot be attentive to every detail occurring in the context that leads to mislabeling and are difficult to eradicate; (b) large scale sentence annotation in specific is very time-consuming task due to the amount and various sort of tags. To overcome these issues, we employ a two-stage framework for the LSCP annotation. In the first stage, we utilize the StanfordNLP to get rough annotations of the sentences. The model predicts tags per tweet. Annotations consist lists of sentences and words, base forms of those words, their parts of speech, and derivation tree. For translation, we utilized Google Cloud Translation to provide translations. Then, generated sentences are compared to the output of the combination of various models and trained on XNLI with fine-tuning on English-Persian Parallel Corpus.

In the second stage, we apply human verification to remove any possible mislabeled noisy tags and also add possible missing tags by the model from the recommended list. They were also able to edit words or phrases in English translation for improving the overall quality of Persian parallel sentences.