The study on the impact of data scaling techniques on machine learning algorithms for predicting heart disease highlights the importance of preprocessing in enhancing model performance. Data scaling is essential when dealing with datasets that have diverse attribute ranges, as it can significantly influence the effectiveness of various machine learning models. In this investigation, eleven widely used algorithms, including K-Nearest Neighbors (KNN) and Logistic Regression, were evaluated using three scaling methods: Min-Max scaling, Z-score standardization, and MaxAbs scaling. The performance was assessed through precision, recall, and F1 score metrics across multiple experiments.The findings indicate that several algorithms performed better with MaxAbs scaling, particularly those sensitive to data distribution, such as KNN and Logistic Regression. This suggests that the choice of scaling technique is crucial for achieving accurate and consistent predictions in machine learning applications related to heart disease. The results emphasize the need for careful selection of scaling methods to optimize the performance of machine learning models in medical diagnostics.
Keywords
Data ScalingMachine Learning AlgorithmsPrediction of Heart DiseaseModel Performance MetricsPreprocessing Strategies.
References
P. Rani, S. Verma, S. P. Yadav, B. K. Rai, M. S. Naruka, and D. Kumar, "Simulation of the lightweight blockchain technique based on privacy and security for healthcare data for the cloud system," International Journal of E-Health and Medical Communications, vol. 13, no. 4, pp. 1-15, 2022, [Online]. Available: https://doi.org/10.4018/IJEHMC.20221001.oa8.
J. Soni, U. Ansari, D. Sharma, and S. Soni, "Predictive data mining for medical diagnosis: An overview of heart disease prediction," International Journal of Computer Applications, vol. 17, no. 8, pp. 43-48, 2011.
W. P. Lord and D. C. Wiggins, "Medical Decision Support Systems," in Advances in Health Care Technology, G. Spekowius and T. Wendler, Eds., vol. 6. Dordrecht: Kluwer Academic Publishers, 2006, pp. 403-419, doi: 10.1007/1-4020-4384-8_25.
M. M. Ahsan, T. E. Alam, T. Trafalis, and P. Huebner, "Deep MLP-CNN Model Using Mixed-Data to Distinguish between COVID-19 and Non-COVID-19 Patients," Symmetry, vol. 12, no. 9, p. 1526, Sep. 2020, [Online]. Available: https://doi.org/10.3390/sym12091526.
M. M. Ahsan et al., "Detecting SARS-CoV-2 From Chest X-Ray Using Artificial Intelligence," IEEE Access, vol. 9, pp. 35501-35513, 2021, [Online]. Available: https://doi.org/10.1109/ACCESS.2021.3061621.
P. Rani, P. N. Singh, S. Verma, N. Ali, P. K. Shukla, and M. Alhassan, "An implementation of modified blowfish technique with honey bee behavior optimization for load balancing in cloud system environment," Wireless Communications and Mobile Computing, vol. 2022, pp. 1-14, 2022.
M. S. Amin, Y. K. Chiam, and K. D. Varathan, "Identification of significant features and data mining techniques in predicting heart disease," Telematics and Informatics, vol. 36, pp. 82-93, Mar. 2019, [Online]. Available: https://doi.org/10.1016/j.tele.2018.11.007.
M. Shouman, T. Turner, and R. Stocker, "Integrating decision tree and k-means clustering with different initial centroid selection methods in the diagnosis of heart disease patients," in Proceedings of the International Conference on Data Science, 2012, p. 1, [Online]. Available: http://world-comp.org/p2012/DMI9007.pdf.
G. Ansari, P. Rani, and V. Kumar, "A novel technique of mixed gas identification based on the group method of data handling (GMDH) on time-dependent MOX gas sensor data," in Proceedings of International Conference on Recent Trends in Computing, Springer, 2023, pp. 641-654.
A. Singh et al., "Blockchain-Based Lightweight Authentication Protocol for Next-Generation Trustworthy Internet of Vehicles Communication," IEEE Transactions on Consumer Electronics, vol. 70, no. 2, pp. 4898-4907, May 2024, [Online]. Available: https://doi.org/10.1109/TCE.2024.3351221.
S. Verma et al., "An automated face mask detection system using transfer learning based neural network to preventing viral infection," Expert Systems, p. e13507, 2024.
L. Shahriyari, "Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM-UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma," Briefings in Bioinformatics, vol. 20, no. 3, pp. 985-994, May 2019, [Online]. Available: https://doi.org/10.1093/bib/bbx153.
A. Ambarwari, Q. Jafar Adrian, and Y. Herdiyeni, "Analysis of the Effect of Data Scaling on the Performance of the Machine Learning Algorithm for Plant Identification," Journal of RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 4, no. 1, pp. 117-122, Feb. 2020, [Online]. Available: https://doi.org/10.29207/resti.v4i1.1517.
S. G. K. Patro and K. K. Sahu, "Normalization: A Preprocessing Stage," International Advanced Research Journal in Science, Engineering and Technology, pp. 20-22, Mar. 2015, doi: 10.17148/IARJSET.2015.2305.
P. Rani, U. C. Garjola, and H. Abbas, "A Predictive IoT and Cloud Framework for Smart Healthcare Monitoring Using Integrated Deep Learning Model," NJF Intelligent Engineering Journal, vol. 1, no. 1, pp. 53-65, 2024.
J. Han, M. Kamber, and J. Pei, "Data mining: Concepts and Techniques," Morgan Kaufmann Publishers, 2012, [Online]. Available: http://homes.di.unimi.it/ceselli/IM/2012-13/slides/02-KnowYourData.pdf.
M. Buda, A. Maki, and M. A. Mazurowski, "A systematic study of the class imbalance problem in convolutional neural networks," Neural Networks, vol. 106, pp. 249-259, Oct. 2018, [Online]. Available: https://doi.org/10.1016/j.neunet.2018.07.011.
S. P. Yadav et al., "An improved deep learning-based optimal object detection system from images," Multimedia Tools and Applications, vol. 83, no. 10, pp. 30045-30072, 2024.
D. Dua and C. Graff, "UCI machine learning repository," 2017, [Online]. Available: https://archive.ics.uci.edu/ml.
J. Han, J. Pei, and H. Tong, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2022, [Online]. Available: https://books.google.com/books?hl=en&lr=&id=NR1oEAAAQBAJ.
S. Dudoit, Y. H. Yang, M. J. Callow, and T. P. Speed, "Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments," Statistica Sinica, pp. 111-139, 2002.
A. Tharwat, T. Gaber, A. Ibrahim, and A. E. Hassanien, "Linear discriminant analysis: A detailed tutorial," AI Communications, vol. 30, no. 2, pp. 169-190, 2017.
P. Geurts, D. Ernst, and L. Wehenkel, "Extremely randomized trees," Machine Learning, vol. 63, no. 1, pp. 3-42, Apr. 2006, [Online]. Available: https://doi.org/10.1007/s10994-006-6226-1.
D. M. W. Powers, “Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation,” J. Mach. Learn. Technol., vol. 2, pp. 37–63, 2011, doi: 10.9735/2229-3981.