A new unsupervised word sense disambiguation for Persain words using comparable corpora

Comaparable corpus


The dictionary we use is a bilingual word by word Persian-English. It contains 83505 Persian entries with their transaltions.

Test data and Goldtext

For each ambiguous word 200 text samples(paragraphs or simple sentences) are extracted from Persian articles of Wikipedia 2016. Then the ambiguous words were tagged with their sense manaully.

You can download test data for 4 words:

Also there are some related data for disambiguating Persain words including extracted sentences with annotated ambiguous words from Hamshahri corpus provided by E. Ansari and H.Mousavi.

After preproccessing Hamshahri corpus, sentences containing 8 ambiguous Persian words are extracted then stopwords are removed and all senetnces are saved in xml files according their intended ambigous words

Extracted desired sentences(stopwords are removed):

words number of sentences download
اشکال (Aškâl_Eškâl) 7504 download
جو (Jo_Jav) 10021 download
سبک (Sabok_Sabk) 14471 download
شکر (Šekar_Šokr) 4801 download
شیر (Šir) 10181 download
مهر (Mehr_Mohr) 20238 download
نفس (Nafas_Nafs) 13733 download
سیر (Sir_Seir) 8151 download

Annotated data(sense-tagged words in the sentences in which they occur, for each word 100 sentences are provided):

Word embedding

In this work we used word2vect embedded in gensim library for embedding.

Click here to install gensim.

Click here or here to use word2vec embedded in gensim.

Paper & Source Code


We changed word2vec model parameters during experiments, 3 main parameters were manipulated and others remained withot changing(default values were remained) for each assignment of these parameters a configuration is obtained, totally we considered 8 configurations as follows:

d:The number of dimensions in which words are mapped.
w:Windows size (n words before plus n words after target word).
m:min_count (The words with frequency less than this threshod are removed from model dictionary).

Configurations Number of
min_count download results
1 200 5 5 download
2 200 5 10 download
3 200 10 5 download
4 200 10 10 download
5 400 5 5 download
6 400 5 10 download
7 400 10 5 download
8 400 10 10 download