@InProceedings{abdikhojasteh:2020:LREC,
author = {Abdi Khojasteh, Hadi and Ansari, Ebrahim and Bohlouli, Mahdi},
title = {LSCP: Enhanced Large Scale Colloquial Persian Language Understanding},
booktitle = {Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020)},
year = {2020}
address = {Marseille, France},
publisher = {European Language Resources Association}
pages = {6323--6327},
url = {https://www.aclweb.org/anthology/2020.lrec-1.776}
}
First, for preparing corpus in the dataset, you can download files by cURL or Wget or manually download from LINDAT repository. Then, you need to uncompress the files.
For future usage, the more common way is to read data separately from each data source, such as monolingual, derivation tree or Persian-Czech bilingual one. Loading can be done by the built-in functions for reading files and splitting lines by newline character.
The simple pipeline is also available as a Colab notebook:
This corpus is licensed under CC BY-NC-ND 4.0 and publicly available in the LINDAT/CLARIAH-CZ repository.
The following script download full corpus (~2.5GB) from the repository and decompress all (~20GB) files:
curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3195{/README.md,/lscp-0.5-fa.7z,/lscp-0.5-fa-normalized.7z,/lscp-0.5-fa-derivation-tree.7z,/lscp-0.5-fa-cs.7z,/lscp-0.5-fa-en.7z,/lscp-0.5-fa-de.7z,/lscp-0.5-fa-it.7z,/lscp-0.5-fa-hi.7z}
7z -y e "lscp-0.5-*.7z"
Alternatively, you can manually download files from http://hdl.handle.net/11234/1-3195.
The language in oral form is typically much more dynamic than its written. The written variety of a language typically involves a higher level of ritual, whereas the spoken form is characterised by several contractions and abbreviations. In formal written texts, longer and tougher sentences tend to be used, as a result of the reader can re-read the troublesome parts if they lose track.
The size of the vocabulary in use is one of the foremost noticeable variations between oral and written sorts of discourse. The written language uses synonyms rather than continuance an equivalent word over and all over again. This is, however, not the case in an oral language which typically makes use of a lot of restricted vocabulary. the extent of difficulty in pronunciation may additionally have an effect on the words chosen. Oral languages tend to use words of fewer syllables.
In addition to the aforesaid general differences between spoken and written varieties of a language, Persian also known by its endonym Farsi (فارسی) introduces a range of variations that further expand this gap. Additionally, to several informal words not applicable to be used in a formal language, there are outstanding variations in the pronunciation of words. For example, the word "نان [nān]" (which is means bread) is changed into "نون [no͡on]" in spoken language. This alteration is quite common but has no rule, so in various words, the speaker is not allowed to interchange these ones in colloquial language. Obviously a Persian tweets involves all of the mentioned features of the spoken form of Persian language.
Building a large-scale language understanding dataset is a time-consuming task. In practice, there are two main tasks which are usually the most time consuming for creating a large-scale corpus: (a) data collection and (b) data annotation.
We collected tweets by a crawler based on the Twitter API. To make sure that the collection is diverse, we first created a list of seed users. Then, the followers of the seed accounts were added to the list. Next, the latest tweets of the users in our list were extracted and saved in the dataset. To make the data appropriate for our task, we removed the same tweets, non-Persian tweets and the ones that had fewer words. This led the dataset to have relatively long sentences with a diverse concept.
For the annotation of the datasets, we adopt a semi-automatic crowdsourcing strategy, in which a human manually verifies the crawled sentences, to reduce the cost of data collection and annotation. (a) manual annotations are error-prone because a human cannot be attentive to every detail occurring in the context that leads to mislabeling and are difficult to eradicate; (b) large scale sentence annotation in specific is very time-consuming task due to the amount and various sort of tags. To overcome these issues, we employ a two-stage framework for the LSCP annotation. In the first stage, we utilize the StanfordNLP to get rough annotations of the sentences. The model predicts tags per tweet. Annotations consist lists of sentences and words, base forms of those words, their parts of speech, and derivation tree. For translation, we utilized Google Cloud Translation to provide translations. Then, generated sentences are compared to the output of the combination of various models and trained on XNLI with fine-tuning on English-Persian Parallel Corpus.
In the second stage, we apply human verification to remove any possible mislabeled noisy tags and also add possible missing tags by the model from the recommended list. They were also able to edit words or phrases in English translation for improving the overall quality of Persian parallel sentences.