Word sense disambiguation

A new unsupervised word sense disambiguation for Persain words using comparable corpora

Comaparable corpus

Click here to download 2016 English wikipedia articles
Click here to download 2016 Persian wikipedia articles

Dictionary

The dictionary we use is a bilingual word by word Persian-English. It contains 83505 Persian entries with their transaltions.

You can download Persian-English dictionary here

Test data and Goldtext

For each ambiguous word 200 text samples(paragraphs or simple sentences) are extracted from Persian articles of Wikipedia 2016. Then the ambiguous words were tagged with their sense manaully.

You can download test data for 4 words:

Also there are some related data for disambiguating Persain words including extracted sentences with annotated ambiguous words from Hamshahri corpus provided by E. Ansari and H.Mousavi.

After preproccessing Hamshahri corpus, sentences containing 8 ambiguous Persian words are extracted then stopwords are removed and all senetnces are saved in xml files according their intended ambigous words

Extracted desired sentences(stopwords are removed):

words	number of sentences	download
اشکال (Aškâl_Eškâl)	7504	download
جو (Jo_Jav)	10021	download
سبک (Sabok_Sabk)	14471	download
شکر (Šekar_Šokr)	4801	download
شیر (Šir)	10181	download
مهر (Mehr_Mohr)	20238	download
نفس (Nafas_Nafs)	13733	download
سیر (Sir_Seir)	8151	download

Annotated data(sense-tagged words in the sentences in which they occur, for each word 100 sentences are provided):

Word embedding

In this work we used word2vect embedded in gensim library for embedding.

Click here to install gensim.

Click here or here to use word2vec embedded in gensim.

Paper & Source Code

Click here to download paper.
Click here to download source code written in python.

Results

We changed word2vec model parameters during experiments, 3 main parameters were manipulated and others remained withot changing(default values were remained) for each assignment of these parameters a configuration is obtained, totally we considered 8 configurations as follows:

d:The number of dimensions in which words are mapped.
w:Windows size (n words before plus n words after target word).
m:min_count (The words with frequency less than this threshod are removed from model dictionary).

Configurations	Number of dimensions	Windows size	min_count	download results
1	200	5	5	download
2	200	5	10	download
3	200	10	5	download
4	200	10	10	download
5	400	5	5	download
6	400	5	10	download
7	400	10	5	download
8	400	10	10	download