SUBTLEX-AR: Arabic word distributional characteristics based on movie subtitles

Sami Boudelaa, Manuel Carreiras, Nazrin Jariya, Manuel Perea

Research output: Contribution to journalArticlepeer-review

Abstract

This article presents SUBTLEX-AR, a digital database providing an extensive collection of attributes related to Modern Standard Arabic words (Arabic for short). SUBTLEX-AR combines a novel dataset of 120 million word tokens from movie subtitles with 40 million tokens from newspaper articles originally collected in ARALEX (Boudelaa & Marslen-Wilson, Behavior Research Methods,42, 481–487, 2010), ensuring comprehensive coverage. SUBTLEX-AR provides information about the statistical properties of Arabic words at the orthographic, phonological, morphological, and semantic levels. The database also includes information on sub-word structure properties like bigram and trigram frequencies, as well as lemmas and part-of-speech information along with their corresponding frequencies. The online interface of SUBTLEX-AR allows users either to upload a set of words to receive their properties or to receive a set of words matching constraints on predefined properties. The properties themselves are easily extensible and will be expanded over time. SUBTLEX-AR is freely accessible here: https://subtlexar.uaeu.ac.ae/

Original languageEnglish
Article number104
JournalBehavior Research Methods
Volume57
Issue number4
DOIs
Publication statusPublished - Apr 2025

Keywords

  • Arabic
  • Morpheme frequency
  • Semantic similarity
  • Subtitles
  • Word frequency

ASJC Scopus subject areas

  • Experimental and Cognitive Psychology
  • Developmental and Educational Psychology
  • Arts and Humanities (miscellaneous)
  • Psychology (miscellaneous)
  • General Psychology

Fingerprint

Dive into the research topics of 'SUBTLEX-AR: Arabic word distributional characteristics based on movie subtitles'. Together they form a unique fingerprint.

Cite this