PEPC: Parallel English-Persian Corpus Extracted from Wikipedia

PEPC is a collection of parallel sentences in English and Persian languages extracted from Wikipedia documents using a bidirectional translation method.

The Corpus:

Click here to download the extracted corpus by our bidirectional method. It consists of 199936 sentence pairs that have been aligned in order of the score calculated by Lucene IR System and our bisimilarity formula.
Click here to download the extracted corpus by one-directional method. It consists of 158339 sentence pairs that have been aligned in order of the score calculated by Lucene IR System.
Click here to get the 200 sentences we used for tuning the translation machines implemented for testing the extracted corpora. These sentences have been taken from paper abstracts available online.
Click here to download the 1000 sentences we used for testing our translation machines. Some of them have been taken from paper abstracts and some have been translated by professional translators.

Thesis and Paper:

Click here to get the thesis written on parallel sentence extraction from comparable corpora.
Click here to get the paper describing the bidirectional method and the results of the experiments carried out using the extracted corpus.

The Bidirectional Method (Step by Step):

The first step is to get the Wikipedia documents in English and Persian. We received this file from the owners of Linguatools website [1].

Download the Wikipedia documents in XML format

Next, the markup language, tables, links etc should be stripped from this XML file.
- Download the python code for removing extra characters
While saving the documents, some document pairs, whose number of sentences was lower than 0.3 times one another, were ignored.
- Download the plain texts with the above limitation
We also did away with the limitation above and saved all the documents.
- Download all the documents in plain text
Then, it's time to translate these texts. To do this, two translation systems, one translating from Persian to English and the other from English to Persian, are implemented using Moses Toolkit [2]. To train these two machines, OPUS (2016) Corpus [3] which is a collection of movie subtitles was used.
- Download the OPUS (2016) Corpus
After that, the translated documents are split and then given to Lucene IR System [4] as queries to calculate the similarity between the translated sentences and the orginal ones. Lucene's source code had to be modified to suit our purpose. We wanted Lucene to read the queries from the translated documents and break the original documents into one-sentence documents so that the similarity between sentences could be calculated.
- Download Lucene's modified version
The output of Lucene is two text files containing the information about which lines of which English documents are similar to which lines of which Persian documents and the reverse along with the sentence length and their similarity score calculated by Lucene. For each English line, there are several Persian candidates and vice versa.
- Download Lucene's output files
For the two output files, a python code was written in order to choose the final equivalent for each sentence. In doing so, we applied our bisimilarity formula described in the paper.

Download the bisimilarity python code

Download the index of the equivalent sentences with their bisimilarity scores

The main output of the last step contains the index of the equivalent sentences. Examining these equivalents sentences reveals that there are some English sentences from the original Wikipedia file that have made their way to Persian files. They were deleted using another python code.
- Download the post-processing python code
- Download the extracted corpus along with its index and bisimilarity scores

References

[1] http://linguatools.org/tools/corpora/wikipedia-comparable-corpora

[2] Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007. (PDF)

[3] Lison, Pierre, and Jörg Tiedemann. "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles." LREC. 2016. (PDF)

[4] https://lucene.apache.org/

Acknowledgements

We'd like to thank our colleagues, Zahra Sepehri and Ailar Qaraie, at Iranzamin Language School for providing us with 500 sentences used in our test set.