Candidate document retrieval for Arabic-based text reuse detection on the web

Leena Lulu, Boumediene Belkhouche, Saad Harous

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. This paper describes a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. We evaluated the work using a collection of documents especially constructed for this work. The experiments show that on average, 79.97% of the Web documents used in the reused cases were successfully retrieved.

Original languageEnglish
Title of host publicationProceedings of the 2016 12th International Conference on Innovations in Information Technology, IIT 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509053438
DOIs
Publication statusPublished - Mar 16 2017
Event12th International Conference on Innovations in Information Technology, IIT 2016 - Al Ain, United Arab Emirates
Duration: Nov 28 2016Nov 29 2016

Publication series

NameProceedings of the 2016 12th International Conference on Innovations in Information Technology, IIT 2016

Other

Other12th International Conference on Innovations in Information Technology, IIT 2016
Country/TerritoryUnited Arab Emirates
CityAl Ain
Period11/28/1611/29/16

Keywords

  • Fingerprinting
  • Query Generation
  • Text Reuse Detection
  • Web Document Retrieval

ASJC Scopus subject areas

  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems
  • Computer Networks and Communications
  • Instrumentation

Fingerprint

Dive into the research topics of 'Candidate document retrieval for Arabic-based text reuse detection on the web'. Together they form a unique fingerprint.

Cite this