TY - JOUR
T1 - Double weighted k nearest neighbours for binary classification of high dimensional genomic data
AU - Ali, Amjad
AU - Khan, Zardad
AU - Du, Hailiang
AU - Aldahmani, Saeed
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample size, makes it challenging for existing methods to work efficiently in terms of prediction accuracy and execution time. To address this issue, a new classification procedure called double weighted k nearest neighbours () is proposed. is specifically designed for gene expression data and incorporates feature weights derived from genes’ ability to express deferentially between classes. Features weights are derived in a manner that automatically increase the impact of informative features while decreasing it for features that are less/non informative. To achieve this goal, the estimated weighted distances from the observations in the k nearest neighbourhood to the test point are used in an exponential function. The outputs of the function are summed for both the classes separately and the test point is assigned the class label with the largest sum. By utilizing the proposed weighting method based on the differential capability of genes, the method aims to achieve robust and efficient classification by allowing only the most informative features/genes to contribute to the classification task. Experimental evaluations, in comparison with several methods, i.e., standard, weighted k nearest neighbours classifier (), random k nearest neighbour (), extended neighbourhood rule ensemble (ExNRule), k conditional nearest neighbour (), ensemble and support vector machines (SVM), demonstrate the effectiveness of in accurately classifying gene expression datasets. Overall, presents a promising approach for gene expression data analysis through the two fold weighted distance calculation strategy using classification accuracy, Cohen’s kappa, sensitivity and score as performance metrics.
AB - High dimensional gene expression datasets consist of a large number of genes, many of which do not play a significant role in classifying tissue samples. The high dimensional nature of this type of data, characterized by a large number of gene features substantially exceeding its sample size, makes it challenging for existing methods to work efficiently in terms of prediction accuracy and execution time. To address this issue, a new classification procedure called double weighted k nearest neighbours () is proposed. is specifically designed for gene expression data and incorporates feature weights derived from genes’ ability to express deferentially between classes. Features weights are derived in a manner that automatically increase the impact of informative features while decreasing it for features that are less/non informative. To achieve this goal, the estimated weighted distances from the observations in the k nearest neighbourhood to the test point are used in an exponential function. The outputs of the function are summed for both the classes separately and the test point is assigned the class label with the largest sum. By utilizing the proposed weighting method based on the differential capability of genes, the method aims to achieve robust and efficient classification by allowing only the most informative features/genes to contribute to the classification task. Experimental evaluations, in comparison with several methods, i.e., standard, weighted k nearest neighbours classifier (), random k nearest neighbour (), extended neighbourhood rule ensemble (ExNRule), k conditional nearest neighbour (), ensemble and support vector machines (SVM), demonstrate the effectiveness of in accurately classifying gene expression datasets. Overall, presents a promising approach for gene expression data analysis through the two fold weighted distance calculation strategy using classification accuracy, Cohen’s kappa, sensitivity and score as performance metrics.
UR - http://www.scopus.com/inward/record.url?scp=105003219009&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105003219009&partnerID=8YFLogxK
U2 - 10.1038/s41598-025-97505-2
DO - 10.1038/s41598-025-97505-2
M3 - Article
C2 - 40221543
AN - SCOPUS:105003219009
SN - 2045-2322
VL - 15
JO - Scientific reports
JF - Scientific reports
IS - 1
M1 - 12681
ER -