TY - GEN
T1 - Machine Learning Approaches for Sentiment Analysis on Balanced and Unbalanced Datasets
AU - Elmassry, Ahmed M.
AU - Alshamsi, Abdulla
AU - Abdulhameed, Ahmed F.
AU - Zaki, Nazar
AU - Belkacem, Abdelkader Nasreddine
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Sentiment analysis, sometimes referred to as opinion mining, is essential for understanding public opinion and attitudes toward various social topics and trends. This study aims to explore the effectiveness of machine learning (ML) models, namely support vector machine (SVM), long short-term memory (LSTM), and bidirectional encoder representations from transformers (BERT), in analyzing a dataset obtained from Kaggle, which contains 37,000 user reviews on the Instagram Threads app. After initial data cleaning and preprocessing, the dataset was partitioned into 70% for training and 30% for testing. Subsequently, the training set was used to create three datasets: a balanced dataset and two unbalanced datasets, one featuring 90% positive instances and the other featuring 90% negative instances. Subsequently, these datasets were used to train the three machine learning models mentioned above, resulting in nine different models. Evaluation metrics, including accuracy, precision, recall, and F1 score, were applied to assess model performance. The finetuned BERT model on the balanced dataset outperformed all the other models with an accuracy of 86%, precision of 85%, recall of 87%, and F1-score of 86%. Furthermore, these findings underscore the effectiveness of diverse ML techniques, particularly transformers, and the crucial role of data balancing in optimizing sentiment analysis tasks.
AB - Sentiment analysis, sometimes referred to as opinion mining, is essential for understanding public opinion and attitudes toward various social topics and trends. This study aims to explore the effectiveness of machine learning (ML) models, namely support vector machine (SVM), long short-term memory (LSTM), and bidirectional encoder representations from transformers (BERT), in analyzing a dataset obtained from Kaggle, which contains 37,000 user reviews on the Instagram Threads app. After initial data cleaning and preprocessing, the dataset was partitioned into 70% for training and 30% for testing. Subsequently, the training set was used to create three datasets: a balanced dataset and two unbalanced datasets, one featuring 90% positive instances and the other featuring 90% negative instances. Subsequently, these datasets were used to train the three machine learning models mentioned above, resulting in nine different models. Evaluation metrics, including accuracy, precision, recall, and F1 score, were applied to assess model performance. The finetuned BERT model on the balanced dataset outperformed all the other models with an accuracy of 86%, precision of 85%, recall of 87%, and F1-score of 86%. Furthermore, these findings underscore the effectiveness of diverse ML techniques, particularly transformers, and the crucial role of data balancing in optimizing sentiment analysis tasks.
KW - Balanced
KW - BERT
KW - LSTM
KW - Machine Learning
KW - Sentiment Analysis
KW - SVM
KW - Unbalanced Dataset
UR - http://www.scopus.com/inward/record.url?scp=85207076419&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85207076419&partnerID=8YFLogxK
U2 - 10.1109/ICCSCE61582.2024.10695972
DO - 10.1109/ICCSCE61582.2024.10695972
M3 - Conference contribution
AN - SCOPUS:85207076419
T3 - 14th IEEE International Conference on Control System, Computing and Engineering, ICCSCE 2024 - Proceedings
SP - 18
EP - 23
BT - 14th IEEE International Conference on Control System, Computing and Engineering, ICCSCE 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th IEEE International Conference on Control System, Computing and Engineering, ICCSCE 2024
Y2 - 23 August 2024 through 24 August 2024
ER -