The dictionary we use is a bilingual word by word Persian-English. It contains 83505 Persian entries with their transaltions.
You can download test data for 4 words:
Also there are some related data for disambiguating Persain words including extracted sentences with annotated ambiguous words from Hamshahri corpus provided by E. Ansari and H.Mousavi.
After preproccessing Hamshahri corpus, sentences containing 8 ambiguous Persian words are extracted then stopwords are removed and all senetnces are saved in xml files according their intended ambigous wordswords | number of sentences | download |
---|---|---|
اشکال (Aškâl_Eškâl) | 7504 | download |
جو (Jo_Jav) | 10021 | download |
سبک (Sabok_Sabk) | 14471 | download |
شکر (Šekar_Šokr) | 4801 | download |
شیر (Šir) | 10181 | download |
مهر (Mehr_Mohr) | 20238 | download |
نفس (Nafas_Nafs) | 13733 | download |
سیر (Sir_Seir) | 8151 | download |
In this work we used word2vect embedded in gensim library for embedding.
Click here to install gensim.
Click here or here to use word2vec embedded in gensim.
Configurations | Number of dimensions |
Windows size |
min_count | download results |
---|---|---|---|---|
1 | 200 | 5 | 5 | download |
2 | 200 | 5 | 10 | download |
3 | 200 | 10 | 5 | download |
4 | 200 | 10 | 10 | download |
5 | 400 | 5 | 5 | download |
6 | 400 | 5 | 10 | download |
7 | 400 | 10 | 5 | download |
8 | 400 | 10 | 10 | download |