Robust part-of-speech tagging of arabic text

Hanan Aldarmaki, Mona Diab

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

We present a new and improved part of speech tagger for Arabic text that incorporates a set of novel features and constraints. This framework is presented within the MADAMIRA software suite, a state-of-the-art toolkit for Arabic language processing. Starting from a linear SVM model with basic lexical features, we add a range of features derived from morphological analysis and clustering methods. We show that using these features significantly improves part-of-speech tagging accuracy, especially for unseen words, which results in better generalization across genres. The final model, embedded in a sequential tagging framework, achieved 97.15% accuracy on the main test set of newswire data, which is higher than the current MADAMIRA accuracy of 96.91% while being 30% faster.

Original languageEnglish
Title of host publication2nd Workshop on Arabic Natural Language Processing, ANLP 2015 - held at 53rd Annual Meeting of the Association for Computational Linguistics, ACL 2015 - Proceedings
EditorsNizar Habash, Stephan Vogel, Kareem Darwish
PublisherAssociation for Computational Linguistics (ACL)
Pages173-182
Number of pages10
ISBN (Electronic)9781941643587
Publication statusPublished - 2015
Externally publishedYes
Event2nd Workshop on Arabic Natural Language Processing, ANLP 2015 - Beijing, China
Duration: Jul 30 2015 → …

Publication series

Name2nd Workshop on Arabic Natural Language Processing, ANLP 2015 - held at 53rd Annual Meeting of the Association for Computational Linguistics, ACL 2015 - Proceedings

Conference

Conference2nd Workshop on Arabic Natural Language Processing, ANLP 2015
Country/TerritoryChina
CityBeijing
Period7/30/15 → …

ASJC Scopus subject areas

  • Computer Science Applications
  • Computational Theory and Mathematics
  • Software

Fingerprint

Dive into the research topics of 'Robust part-of-speech tagging of arabic text'. Together they form a unique fingerprint.

Cite this