TY - JOUR
T1 - Review and Implementation of Topic Modeling in Hindi
AU - Ray, Santosh Kumar
AU - Ahmad, Amir
AU - Kumar, Ch Aswani
N1 - Publisher Copyright:
© 2019, © 2019 Taylor & Francis.
PY - 2019/9/19
Y1 - 2019/9/19
N2 - Due to the widespread usage of electronic devices and the growing popularity of social media, a lot of text data is being generated at the rate never seen before. It is not possible for humans to read all data generated and find what is being discussed in his field of interest. Topic modeling is a technique to identify the topics present in a large set of text documents. In this paper, we have discussed the widely used techniques and tools for topic modeling. There has been a lot of research on topic modeling in English, but there is not much progress in the resource-scarce languages like Hindi despite Hindi being spoken by millions of people across the world. In this paper, we have discussed the challenges faced in developing topic models for Hindi. We have applied Latent Semantic Indexing (LSI), Non-negative Matrix Factorization (NMF), and Latent Dirichlet Allocation (LDA) algorithms for topic modeling in Hindi. The outcomes of the topic model algorithms are usually difficult to interpret for the common user. We have used various visualization techniques to represent the outcomes of topic modeling in a meaningful way. Then we have used the metrics like perplexity and coherence to evaluate the topic models. The results of Topic modeling in Hindi seem to be promising and comparable to some results reported in the literature on English datasets.
AB - Due to the widespread usage of electronic devices and the growing popularity of social media, a lot of text data is being generated at the rate never seen before. It is not possible for humans to read all data generated and find what is being discussed in his field of interest. Topic modeling is a technique to identify the topics present in a large set of text documents. In this paper, we have discussed the widely used techniques and tools for topic modeling. There has been a lot of research on topic modeling in English, but there is not much progress in the resource-scarce languages like Hindi despite Hindi being spoken by millions of people across the world. In this paper, we have discussed the challenges faced in developing topic models for Hindi. We have applied Latent Semantic Indexing (LSI), Non-negative Matrix Factorization (NMF), and Latent Dirichlet Allocation (LDA) algorithms for topic modeling in Hindi. The outcomes of the topic model algorithms are usually difficult to interpret for the common user. We have used various visualization techniques to represent the outcomes of topic modeling in a meaningful way. Then we have used the metrics like perplexity and coherence to evaluate the topic models. The results of Topic modeling in Hindi seem to be promising and comparable to some results reported in the literature on English datasets.
UR - http://www.scopus.com/inward/record.url?scp=85071990470&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85071990470&partnerID=8YFLogxK
U2 - 10.1080/08839514.2019.1661576
DO - 10.1080/08839514.2019.1661576
M3 - Article
AN - SCOPUS:85071990470
SN - 0883-9514
VL - 33
SP - 979
EP - 1007
JO - Applied Artificial Intelligence
JF - Applied Artificial Intelligence
IS - 11
ER -