TY - JOUR
T1 - Heuristics-based query processing for large RDF graphs using cloud computing
AU - Husain, Mohammad Farhan
AU - McGlothlin, James
AU - Masud, Mohammad Mehedy
AU - Khan, Latifur R.
AU - Thuraisingham, Bhavani
N1 - Funding Information:
Bangladesh University of Engineering and Tech-nology with BS and MS degrees in computer science and engineering in 2001 and 2004, respectively. He received the PhD degree from University of Texas at Dallas (UTD) in December 2009. He is a postdoctoral research associate at the UTD. His research interests are in data stream mining, machine learning, and intrusion detection using data mining. His recent research focuses on developing data mining techniques to classify data streams. He has published more than 20 research papers in journals including IEEE Transactions on Knowledge and Data Engineering, and peer-reviewed conferences including ICDM, ECML/PKDD, and PAKDD. He is also the lead author of the book titled Data Mining Tools for Malware Detection, and the principal inventor of US Patent Application titled “Systems and Methods for Detecting a Novel Data Class.” Latifur R. Khan received the BSc degree in computer science and engineering from Bangla-desh University of Engineering and Technology, Dhaka, Bangladesh, in November 1993. He received the MS and PhD degrees in computer science from the University of Southern Califor-nia in December 1996 and August 2000, respectively. He is currently an associate pro-fessor in the Computer Science Department at the University of Texas at Dallas (UTD), where he has been teaching and conducting research since September 2000. His research work is supported by grants from NASA, the Air Force Office of Scientific Research (AFOSR), US National Science Foundation (NSF), the Nokia Research Center, Raytheon, CISCO, Tektronix. In addition, he is the director of the state-of-the-art DBL@UTD, UTD Data Mining/Database Laboratory, which is the primary center of research related to data mining, semantic web, and image/video annotation at University of Texas-Dallas. His research areas cover data mining, multimedia information management, semantic web, and database systems with the primary focus on first three research disciplines. He has served as a committee member in numerous prestigious conferences, symposiums, and workshops. He has published more than 150 papers in prestigious journals and conferences.
Funding Information:
This material is based upon work supported by the AFOSR under Award No. FA9550-08-1-0260 and NASA under Award No. 2008-00867-01.
PY - 2011
Y1 - 2011
N2 - Semantic web is an emerging area to augment human reasoning. Various technologies are being developed in this arena which have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). Semantic web technologies can be utilized to build efficient and scalable systems for Cloud Computing. With the explosion of semantic web technologies, large RDF graphs are common place. This poses significant challenges for the storage and retrieval of RDF graphs. Current frameworks do not scale for large RDF graphs and as a result do not address these challenges. In this paper, we describe a framework that we built using Hadoop to store and retrieve large numbers of RDF triples by exploiting the cloud computing paradigm. We describe a scheme to store RDF data in Hadoop Distributed File System. More than one Hadoop job (the smallest unit of execution in Hadoop) may be needed to answer a query because a single triple pattern in a query cannot simultaneously take part in more than one join in a single Hadoop job. To determine the jobs, we present an algorithm to generate query plan, whose worst case cost is bounded, based on a greedy approach to answer a SPARQL Protocol and RDF Query Language (SPARQL) query. We use Hadoop's MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can handle large amounts of RDF data, unlike traditional approaches.
AB - Semantic web is an emerging area to augment human reasoning. Various technologies are being developed in this arena which have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). Semantic web technologies can be utilized to build efficient and scalable systems for Cloud Computing. With the explosion of semantic web technologies, large RDF graphs are common place. This poses significant challenges for the storage and retrieval of RDF graphs. Current frameworks do not scale for large RDF graphs and as a result do not address these challenges. In this paper, we describe a framework that we built using Hadoop to store and retrieve large numbers of RDF triples by exploiting the cloud computing paradigm. We describe a scheme to store RDF data in Hadoop Distributed File System. More than one Hadoop job (the smallest unit of execution in Hadoop) may be needed to answer a query because a single triple pattern in a query cannot simultaneously take part in more than one join in a single Hadoop job. To determine the jobs, we present an algorithm to generate query plan, whose worst case cost is bounded, based on a greedy approach to answer a SPARQL Protocol and RDF Query Language (SPARQL) query. We use Hadoop's MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can handle large amounts of RDF data, unlike traditional approaches.
KW - Hadoop
KW - MapReduce
KW - RDF
KW - SPARQL
UR - http://www.scopus.com/inward/record.url?scp=79960927153&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79960927153&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2011.103
DO - 10.1109/TKDE.2011.103
M3 - Article
AN - SCOPUS:79960927153
SN - 1041-4347
VL - 23
SP - 1312
EP - 1327
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 9
M1 - 5765957
ER -