Homograph disambiguation through selective diacritic restoration

Sawsan Alqahtani, Hanan Aldarmaki, Mona Diab

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: Different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.

Original languageEnglish
Title of host publicationACL 2019 - 4th Arabic Natural Language Processing Workshop, WANLP 2019 - Proceedings of the Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages49-59
Number of pages11
ISBN (Electronic)9781950737321
Publication statusPublished - 2019
Externally publishedYes
Event4th Arabic Natural Language Processing Workshop, WANLP 2019, held at ACL 2019 - Florence, Italy
Duration: Aug 1 2019 → …

Publication series

NameACL 2019 - 4th Arabic Natural Language Processing Workshop, WANLP 2019 - Proceedings of the Workshop

Conference

Conference4th Arabic Natural Language Processing Workshop, WANLP 2019, held at ACL 2019
Country/TerritoryItaly
CityFlorence
Period8/1/19 → …

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Homograph disambiguation through selective diacritic restoration'. Together they form a unique fingerprint.

Cite this