Main Article Content

Abstract

In the data mining communal, imbalanced class dispersal data sets have established mounting consideration. The evolving field of data mining and information discovery seeks to establish precise and effective computational tools for the investigation of such data sets to excerpt innovative facts from statistics. Sampling methods re-balance the imbalanced data sets consequently improve the enactment of classifiers. For the classification of the imbalanced data sets, over-fitting and under-fitting are the two striking problems. In this study, a novel weighted ensemble method is anticipated to diminish the influence of over-fitting and under-fitting while classifying these kinds of data sets. Forty imbalanced data sets with varying imbalance ratios are engaged to conduct a comparative study. The enactment of the projected method is compared with four customary classifiers including decision tree(DT), k-nearest neighbor (KNN), support vector machines (SVM), and neural network (NN). This evaluation is completed with two over-sampling procedures, an adaptive synthetic sampling approach (ADASYN), and a synthetic minority over-sampling (SMOTE) technique. The projected scheme remained efficacious in diminishing the impact of over-fitting and under-fitting on the classification of these data sets.

Keywords

Imbalanced data sets Under-fitting Over-fitting technqiues Ensemble method Weighted method

Article Details

How to Cite
Ghulam Fatima, & Saeed, S. (2021). A Novel Weighted Ensemble Method to Overcome the Impact of Under-fitting and Over-fitting on the Classification Accuracy of the Imbalanced Data Sets. Pakistan Journal of Statistics and Operation Research, 17(2), 483-496. https://doi.org/10.18187/pjsor.v17i2.3640

References

    References
    1. Abdi, H. (1994). A neural network primer. Journal of Biological Systems, 2(03):247–281.
    2. Akbani, R., Kwek, S., and Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets.
    In European conference on machine learning, pages 39–50. Springer.
    3. Alcala-Fdez, J., Fern ´ andez, A., Luengo, J., Derrac, J., Garc ´ ´ıa, S., Sanchez, L., and Herrera, F. (2011). Keel ´
    data-mining software tool: data set repository, integration of algorithms and experimental analysis framework.
    Journal of Multiple-Valued Logic & Soft Computing, 17.
    4. Anyanwu, M. N. and Shiva, S. G. (2009). Comparative analysis of serial decision tree classification algorithms.
    International Journal of Computer Science and Security, 3(3):230–240.
    5. Barandela, R., Sanchez, J. S., Garca, V., and Rangel, E. (2003). Strategies for learning in class imbalance ´
    problems. Pattern Recognition, 36(3):849–851.
    6. Bennett, K. P. and Blue, J. (1998). A support vector machine approach to decision trees. In 1998 IEEE
    International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational
    Intelligence (Cat. No. 98CH36227), volume 3, pages 2396–2401. IEEE.
    7. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority oversampling technique. Journal of artificial intelligence research, 16:321–357.
    8. Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Special issue on learning from imbalanced data sets.
    ACM SIGKDD explorations newsletter, 6(1):1–6.
    9. Dasarathy, B. V. and Sheela, B. V. (1979). A composite classifier system design: Concepts and methodology.
    Proceedings of the IEEE, 67(5):708–713.
    10. Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced
    data sets. Computational intelligence, 20(1):18–36.
    11. Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In icml, volume 99, pages
    124–133.
    12. Galar, M., Fernandez, A., Barrenechea, E., and Herrera, F. (2013). Eusboost: Enhancing ensembles for highly ´
    imbalanced data-sets by evolutionary undersampling. Pattern recognition, 46(12):3460–3471.
    13. Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: a new over-sampling method in imbalanced
    data sets learning. In International conference on intelligent computing, pages 878–887. Springer.
    14. Hansen, L. K. and Salamon, P. (1990). Neural network ensembles. IEEE transactions on pattern analysis
    and machine intelligence, 12(10):993–1001.
    15. He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on
    computational intelligence), pages 1322–1328. IEEE.
    16. He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data
    engineering, 21(9):1263–1284.
    17. Hsu, K.-W. (2017). A theoretical analysis of why hybrid ensembles work. Computational intelligence and
    neuroscience, 2017.
    18. Hu, S., Liang, Y., Ma, L., and He, Y. (2009). Msmote: Improving classification performance when training
    data is imbalanced. In 2009 second international workshop on computer science and engineering, volume 2,
    pages 13–17. IEEE.
    19. Kaur, P. and Gosain, A. (2018). Comparing the behavior of oversampling and undersampling approach of
    class imbalance learning by combining class imbalance problem with noise. In ICT Based Innovations, pages
    23–30. Springer.
    20. Kong, J., Rios, T., Kowalczyk, W., Menzel, S., and Back, T. (2020). On the performance of oversampling ¨
    techniques for class imbalance problems. In Pacific-Asia Conference on Knowledge Discovery and Data
    Mining, pages 84–96. Springer.
    21. Kubat, M., Holte, R. C., and Matwin, S. (1998). Machine learning for the detection of oil spills in satellite
    radar images. Machine learning, 30(2-3):195–215.
    22. Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. In
    Conference on Artificial Intelligence in Medicine in Europe, pages 63–66. Springer.
    23. Leevy, J. L., Khoshgoftaar, T. M., Bauder, R. A., and Seliya, N. (2018). A survey on addressing high-class
    imbalance in big data. Journal of Big Data, 5(1):42.
    24. Lemaˆıtre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1):559–563.
    25. Lewis, D. D. and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Machine
    learning proceedings 1994, pages 148–156. Elsevier.
    26. Li, Y. and Zhang, X. (2011). Improving k nearest neighbor with exemplar generalization for imbalanced classification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 321–332. Springer.
    27. Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE
    Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550.
    28. Liu, X.-Y. and Zhou, Z.-H. (2013). Ensemble methods for class imbalance learning. Imbalanced Learning:
    Foundations, Algorithms and Applications, pages 61–82.
    29. Mathew, J., Pang, C. K., Luo, M., and Leong, W. H. (2017). Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE transactions on neural networks and learning systems,
    29(9):4065–4076.
    30. Paing, M. P., Pintavirooj, C., Tungjitkusolmun, S., Choomchuay, S., and HAMAMOTO, K. (2018). Comparison of sampling methods for imbalanced data classification in random forest. In 2018 11th Biomedical
    Engineering International Conference (BMEiCON), pages 1–5. IEEE.
    31. Panchal, G., Ganatra, A., Shah, P., and Panchal, D. (2011). Determination of over-learning and over-fitting
    problem in back propagation neural network. International Journal on Soft Computing, 2(2):40–51.
    32. Pattanayak, S. S. and Rout, M. (2018). Experimental comparison of sampling techniques for imbalanced
    datasets using various classification models. In Progress in Advanced Computing and Intelligent Engineering,
    pages 13–22. Springer.
    33. Pedersen, R. and Schoeberl, M. (2006). An embedded support vector machine. In 2006 International Workshop on Intelligent Solutions in Embedded Systems, pages 1–11. IEEE.
    34. Piotrowski, A. P. and Napiorkowski, J. J. (2013). A comparison of methods to avoid overfitting in neural
    networks training in the case of catchment runoff modelling. Journal of Hydrology, 476:97–111.
    35. Polikar, R. (2006). Ensemble based systems in decision making. IEEE Circuits and systems magazine,
    6(3):21–45.
    36. Rokach, L. (2010). Ensemble-based classifiers. Artificial intelligence review, 33(1-2):1–39.
    37. Saez, J., Luengo, J., Stefanowski, J., and Herrera, F. (2015). Addressing the noisy and borderline examples problem in classification with imbalanced datasets via a class noise filtering method-based re-sampling
    technique. Inform Sci, 291:184–203.
    38. Saxena, R. (2017). How decision tree algorithm works. URl: http://dataaspirant. com/2017/01/30/howdecision-tree-algorithm-works/.(accessed: 2019-01-28).
    39. Schclar, A., Tsikinovsky, A., Rokach, L., Meisels, A., and Antwarg, L. (2009). Ensemble methods for
    improving the performance of neighborhood-based collaborative filtering. In Proceedings of the third ACM
    conference on Recommender systems, pages 261–264.
    40. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2009). Rusboost: A hybrid approach
    to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and
    Humans, 40(1):185–197.
    41. Tan, P.-N., Steinbach, M., and Kumar, V. (2016). Introduction to data mining. Pearson Education India.
    42. Tavallaee, M., Stakhanova, N., and Ghorbani, A. A. (2010). Toward credible evaluation of anomaly-based
    intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
    Reviews), 40(5):516–524.
    43. Tay, B., Hyun, J. K., and Oh, S. (2014). A machine learning approach for specification of spinal cord injuries
    using fractional anisotropy values obtained from diffusion tensor images. Computational and mathematical
    methods in medicine, 2014.
    44. Yang, Z., Tang, W., Shintemirov, A., and Wu, Q. (2009). Association rule mining-based dissolved gas
    analysis for fault diagnosis of power transformers. IEEE Transactions on Systems, Man, and Cybernetics,
    Part C (Applications and Reviews), 39(6):597–610.
    45. Ying, X. (2019). An overview of overfitting and its solutions. In Journal of Physics: Conference Series,
    volume 1168, page 022022. IOP Publishing.
    46. Zhang, J. and Chen, L. (2019). Clustering-based undersampling with random over sampling examples and
    support vector machine for imbalanced classification of breast cancer diagnosis. Computer Assisted Surgery,
    24(sup2):62–72.
    47. Zhang, Y. and Wang, D. (2013). A cost-sensitive ensemble method for class-imbalanced datasets. In Abstract and applied analysis, volume 2013. Hindawi.
    48. Zhou, Z.-H. (2012). Ensemble methods: foundations and algorithms. CRC press.