PEPC: Parallel English-Persian Corpus Extracted from Wikipedia

PEPC is a collection of parallel sentences in English and Persian languages extracted from Wikipedia documents using a bidirectional translation method.

The Corpus:

Thesis and Paper:

The Bidirectional Method (Step by Step):

  1. The first step is to get the Wikipedia documents in English and Persian. We received this file from the owners of Linguatools website [1].
  2. Next, the markup language, tables, links etc should be stripped from this XML file.
  3. While saving the documents, some document pairs, whose number of sentences was lower than 0.3 times one another, were ignored.
  4. We also did away with the limitation above and saved all the documents.
  5. Then, it's time to translate these texts. To do this, two translation systems, one translating from Persian to English and the other from English to Persian, are implemented using Moses Toolkit [2]. To train these two machines, OPUS (2016) Corpus [3] which is a collection of movie subtitles was used.
  6. After that, the translated documents are split and then given to Lucene IR System [4] as queries to calculate the similarity between the translated sentences and the orginal ones. Lucene's source code had to be modified to suit our purpose. We wanted Lucene to read the queries from the translated documents and break the original documents into one-sentence documents so that the similarity between sentences could be calculated.
  7. The output of Lucene is two text files containing the information about which lines of which English documents are similar to which lines of which Persian documents and the reverse along with the sentence length and their similarity score calculated by Lucene. For each English line, there are several Persian candidates and vice versa.
  8. For the two output files, a python code was written in order to choose the final equivalent for each sentence. In doing so, we applied our bisimilarity formula described in the paper.
  9. The main output of the last step contains the index of the equivalent sentences. Examining these equivalents sentences reveals that there are some English sentences from the original Wikipedia file that have made their way to Persian files. They were deleted using another python code.


References

[1] http://linguatools.org/tools/corpora/wikipedia-comparable-corpora

[2] Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007. (PDF)

[3] Lison, Pierre, and Jörg Tiedemann. "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles." LREC. 2016. (PDF)

[4] https://lucene.apache.org/

Acknowledgements

We'd like to thank our colleagues, Zahra Sepehri and Ailar Qaraie, at Iranzamin Language School for providing us with 500 sentences used in our test set.