Hamid Haghdoost | Publication

Hamid Haghdoost, Ebrahim Ansari, Zdeněk Žabokrtský, Mahshid Nikravesh, Mohammad Mahmoudi

Morphological Networks for Persian and Turkish: What Can Be Induced from Morpheme Segmentation? Published

In this work, we propose an algorithm that induces morphological networks for Persian and Turkish. The algorithm uses morpheme-segmented lexicons for the two languages. The resulting networks capture both derivational and inflectional relations. The network induction algorithm can use either manually annotated lists of roots and affixes, or simple heuristics to distinguish roots from affixes. We evaluate both variants empirically. We use our large hand-segmented set of word forms in the experiments with Persian, which is contrasted with employing only a very limited manually segmented lexicon for Turkish that existed previously.

The Prague Bulletin of Mathematical Linguistics

Hamid Haghdoost, Ebrahim Ansari, Zdeněk Žabokrtský, Mahshid Nikravesh

Building a Morphological Network for Persian on Top of a Morpheme-Segmented Lexicon Published

In this work, we introduce a new large hand-annotated morpheme-segmentation lexicon of Persian words and present an algorithm that builds a morphological network using this segmented lexicon. The resulting network captures both derivational and inflectional relations. The algorithm for inducing the network approximates the distinction between root morphemes and affixes using the number of morpheme occurrences in the lexicon. We evaluate the quality (in the sense of linguistic correctness) of the resulting network empirically and compare it to the quality of a network generated in a setup based on manually distinguished non-root morphemes.

Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

Ebrahim Ansari, Zdeněk Žabokrtský, Mohammad Mahmoudi, Hamid Haghdoost, Jonáš Vidra

Supervised Morphological Segmentation Using Rich Annotated Lexicon Published

Morphological segmentation of words is the process of dividing a word into smaller units called morphemes; it is tricky especially when a morphologically rich or polysynthetic language is under question. In this work, we designed and evaluated several Recurrent Neural Network (RNN) based models as well as various other machine learning based approaches for the morphological segmentation task. We trained our models using annotated segmentation lexicons. To evaluate the effect of the training data size on our models, we decided to create a large hand-annotated morphologically segmented corpus of Persian words, which is, to the best of our knowledge, the first and the only segmentation lexicon for the Persian language. In the experimental phase, using the hand-annotated Persian lexicon and two smaller similar lexicons for Czech and Finnish languages, we evaluated the effect of the training data size, different hyper-parameters settings as well as different RNN-based models.

Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)