A Novel Weighted Ensemble Method to Overcome the Impact of Under-ﬁtting and Over-ﬁtting on the Classiﬁcation Accuracy of the Imbalanced Data Sets

In the data mining communal, imbalanced class dispersal data sets have established mounting consideration. The evolving ﬁeld of data mining and information discovery seeks to establish precise and effective computational tools for the investigation of such data sets to excerpt innovative facts from statistics. Sampling methods re-balance the imbalanced data sets consequently improve the enactment of classiﬁers. For the classiﬁcation of the imbalanced data sets, over-ﬁtting and under-ﬁtting are the two striking problems. In this study, a novel weighted ensemble method is anticipated to diminish the inﬂuence of over-ﬁtting and under-ﬁtting while classifying these kinds of data sets. Forty imbalanced data sets with varying imbalance ratios are engaged to conduct a comparative study. The enactment of the projected method is compared with four customary classiﬁers including decision tree(DT), k-nearest neighbor (KNN), support vector machines (SVM), and neural network (NN). This evaluation is completed with two over-sampling procedures, an adaptive synthetic sampling approach (ADASYN) and a synthetic minority over-sampling (SMOTE) technique. The projected scheme remained efﬁcacious in diminishing the impact of over-ﬁtting and under-ﬁtting on the classiﬁcation of these data sets.


Background
In the data mining community, imbalanced class distribution data sets have received mounting consideration (He and Garcia, 2009). These data sets have established a vigorous share in every field of exploration (Lewis and Catlett, 1994;Chawla et al., 2002;Yang et al., 2009;Tavallaee et al., 2010). An imbalanced data set has unequal majority and minority class examples (Chawla et al., 2004). While learning from these data sets, two significant issues over-fitting and under-fitting have to be faced by investigators also these two are the foremost reasons for the poor enactment of the machine learning (ML) algorithms. Overfitting states a modeling error that occurs when a function too narrowly agrees to a data set and under-fitting refers to a description that cannot simulate a training data set or generalize it to a new data set (Ying, 2019). Classifier learning with such imbalanced data sets is a thought-provoking task (Leevy et al., 2018). Classifiers show meager enactment as they could not be trained for minority classes (Mathew et al., 2017; Zhang and Chen, 2019; Paing et al., 2018). Numerous sampling procedures have been anticipated which pre-processed the data and balanced its class distribution, henceforth improved classification results are obtained (Pattanayak and Rout, 2018;Kong et al., 2020). Over-sampling methods are desired over under-sampling methods as they stunned the restraint of data loss and deliver better-quality classification performance (Barandela et al., 2003;Kaur and Gosain, 2018). SMOTE and ADASYN are two noteworthy over-sampling methods. SMOTE (Chawla et al., 2002) balances data distribution by synthetically spawning conceivable minority class instances. Whereas, its modification MSMOTE (Modified SMOTE) method which takes into account the latent noise of data and minority data distribution and works on eliminating the noisy data was proposed by Hu et al. (2009). ADASYN generates more synthetic examples for difficult-to-learn minority class examples. Consequently, provides a more presentable form of data (He and Garcia, 2009). ADASYN sufficiently reduces the class imbalance bias and concentrates more on difficult to learn minority class examples by adaptively transferring the classification decision boundary to them. A combination of under-sampling and over-sampling techniques was also suggested by researchers which can better handle the imbalanced nature of data (Barandela et al., 2003). In addition to sampling methods, ensemble methods have gained much popularity to knob imbalanced data sets (Schclar et al., 2009;Zhou, 2012;Hsu, 2017). An ensemble method combines numerous classifiers to achieve robust classification and generalized results. Several investigations have shown that balanced data sets enhance the performance of base classifiers contrasted to imbalanced data sets (Laurikkala, 2001;Estabrooks et al., 2004). Sampling is one of the broadly implemented techniques to manipulate imbalanced data sets. Kubat et al. (1998) introduced a one-sided selection process that scraps borderline or noisy majority class examples. Another study suggested the Edited Nearest Neighbor Rule (ENN), this method erases the examples whose class label was different from the class of at least two of its three Nearest Neighbors (NN) (Li and The rest of the paper is organized as follows: material and methods related to this research work is discussed in Section 2, the proposed weighted ensemble method is presented in Section 3, the results of a simulation study is discussed in Section 4. Thereafter, results and conclusions are summarized in Section 5. References are provided at the end of this paper.

Material and Methods
Four well-established classifiers are involved in this analysis. A brief introduction of these classifiers is provided in the following subsections.

Decision Tree
DT learning is the most widely used predictive modeling technique for classification (Anyanwu and Shiva, 2009). DTs partition the input space into the cells, where each cell belongs to a class. Partitioning is seen as a series of measures. Every interior node in the DT corresponds to one test value of some input variable. Some input variables and node divisions are labeled with the potential test results. The leaf nodes represent the cells and define the class to be returned if the leaf node is reached. The classification of a particular instance is then carried out by beginning at the root node and, based on the results of the test, following the necessary branches until the leaf node is reached (Freund and Mason, 1999). The complete steps taken for the implementation of a DT is shown in a pseudo code 1. Given an example x to be classified, k nearest points (based on Euclidean distance) of x are noticed, and permitting to the majority of its neighbors found, x is classified to its relevant class. Any ties in voting are broken at random. KNN is known as a lazy learner as it slows down the procedure of modeling until it is needed to classify an example. It does not take into account any assumption of data distribution, it is working on. Thus, it is also mentioned as a non-parametric procedure. The steps adopted for KNN are shown in pseudo code 2 SVM is a supervised learning algorithm that linearly classifies two-class imbalanced data problems. For multi-class imbalanced data problems, this data is first converted into two classes and then categorized by SVM. This algorithm builds a hyperplane between two sets of data. Since the data adjoining on the borderline is tricky to discover so it is essential to pick the precise hyperplane for increased accuracy. The best hyperplane is the one that has an extreme distance to the nearest data point (Bennett and Blue, 1998). Given training data (x i , yi) for i = 1 ...N, with xi ∈ R d and yi ∈ −1, 1, learn a classifier f (x) such that i.e. y i f (x i ) > 0 for a correct classification. The easiest model of SVM begins with a maximal margin classifier. Its running is restricted to linearly distinct instances in feature space and it presumes that there is no training error. The pseudo code of SVM is provided in 3.

Input layer
Hidden layer Ouput layer Figure 1: Flow of Neural Network the threshold it will generate "1" as an output or else "0". In general, there are three layers in a neuron; input layer, hidden layer, and output layer. The input layer merely accelerates the recorded values to the following layer. There may be numerous hidden layers in a single neural network. In the output layer, a specific class is represented by each node. This whole course of layers results in the assignment of values to each output node and it classifies the record to the class node with the highest value (Hansen and Salamon, 1990; Abdi), 1994). The activation function used for NN is sigmoid (see Eq 2).
The complete working of NN is shown in the flow chart in Figure 1.

Synthetic Minority Over Sampling Technique
SMOTE is an over-sampling technique for the imbalanced data sets proposed by Chawla et al. (2002). In contrast to the other techniques, SMOTE uses an over-sampling of the minority class by creating synthetic samples. Subject to the amount of over-sampling requirement, neighbors from the k nearest points are carefully chosen. For example, if the enormity of the required over-sampling is 200 percent then only two are chosen from the five nearest neighbors and generate one sample in the direction of each. The steps which are taken to produce a synthetic sample by using SMOTE are provided in pseudo code 4.

Algorithm 4 Procedure of SMOTE Begin
1. Compute the difference between the NN and the feature vector (sample).
2. Multiply the difference by a random number generated between 0 and 1.
3. Add it to the concerned feature vector. This will invent the choice of a random point near the line segment between the two features.

Adaptive Synthetic Sampling Approach
ADASYN produces the samples of the minority class centered on their density distributions. It determines the nearest K-neighbors of each minority instance, then gets the minority-majority class ratio to generate new samples. By replicating this process, it adaptively shifts the decision boundary to concentrate on samples that are challenging to learn (

Proposed Weighted Ensemble Method
The proposed method, the weighted ensemble method (WEM)is designed to enhance the accuracy of classifiers. The novelty here is examining various classifiers combinations and techniques and thus crave out a design that compensates for the issue of under-fitting and over-fitting and provides the most accurate predictive output. Predictive accuracy can be achieved by developing a simple but successful method based on the theory of diversity to direct the process of generating the optimal combination of classifiers, that does not suffer from under-fitting, overfitting, and overgeneralization issues, leading to better accuracy of expected outputs. Consequently, the usage of a combination of various classifiers will increase prediction accuracy by excluding the limitations of certain data knots and incorporating the strengths of other data knots to obtain an accurate prediction. Moreover, ensemble combination methods have a huge impact on the overall design outcome, in addition to the generalization capability of the design itself. Combination approaches are used to combine classifiers of the same type, thereby producing better results, and promoting different applications of both ML and pattern recognition. This motivates us to explore various aspects of combination methods, such as output types, level of operation, combination, and algebraic rules, as well as to explore different methods for classifier combination. Sophisticated combination methods are currently utilized in science and technology, for more complicated problems, the more advanced the model should be to deliver the best prediction and decision-making outcomes. The motivation behind this study is becoming acquainted with some of the most contemporary combination methods. To offer a superior algorithm, two essential things should be taken care of: collective accuracy and decline in the error prediction. Ensemble methods are no doubt essential for the classification task. However, any ensemble design and every combination approach have its flaws and constraints when it extends to the real application. The mounting assumption is that individual classifiers typically do not seek to strike a balance between fitting and generalization when they infer the model from data sets. Thus, the models they infer can suffer from problems of under-fitting, over-fitting, and overgeneralization, and this triggers their poor performance. For that reason, a novel weighted ensemble method is presented to counter the issue of under-fitting and over-fitting of individual classifiers and to get more comprehensive results of classification. Over-fitting occurs when a model can accurately distinguish a data point that is very closely related to training data but does not work perfectly with the data that are not closely connected to the training data. On the other hand, underfitting happens when the model is incapable to secure the underlying data pattern, or intuitively does not fit the data well adequately. Over-fitting and under-fitting result in bad estimates in new data sets. Over-generalization happens when a model erroneously claims to be able to accurately classify large amounts of data that are not closely related to the training data. In general, the proposed ensemble method intends to minimize the false-negative and false-positive rates and thus exploiting accuracy in a weighted fashion. The intended framework consists of the following phases: data processing, over-sampling, clustering of data, model building, weights application, model evaluation. All the processing is done by utilizing Matlab software.
Exploratory data analysis is conducted out before any structured modeling commences. As the first stage in every predictive modeling, the task is to understand the purpose and data of the modeling process. The dilemma addressed by the proposed ensemble method is thus identified, along with the anticipated results. In this process, an assessment of the currently current data is done, and the features of the data to deliver ensembles and all the necessary tools are explored. While this phase has a major impact on the superiority of the ensembles (following the popular " garbage in, garbage out" rule), and preparation of data has a substantial impact on the predictive capability of the model. In this work, we have noticed the number of examples, imbalanced ratio (IR), a total of attributes, etc. of each data set.
Imbalance data sets have an inequitable number of positive and negative class examples. Without over-sampling, when the data sets are passed on to these classifiers, they show poor performance, i.e. the classifiers could not be trained for the minority class. Therefore, data sets are over-sampled to generate synthetic samples for proper classification. Therefore, it is necessary to re-balance, imbalanced data sets before classification. In this paper, SMOTE and ADASYN are applied to over-sample the minority class. SMOTE balances the data distribution by synthetically generating minority class examples. This methodology is efficient as it creates the synthetic minority class instances that are plausible; the instances are relatively close in feature space to current minority instances. Neighbors from the nearest k neighbors are chosen arbitrarily. The default settings implementation uses five of the closest neighbors. This approach effectively forces the decision-making area of the minority class to be more general.
Data partition has a decisive effect on the quality of the final models. For example, if the training data set is small, the learning algorithms used to create the models will not be efficient to capture the underlying patterns. On the other hand, a small test data set will have restricted skills to evaluate the execution of models. In this study, for each data set, the cluster of positive and negative instances is made and then data is partitioned into 40:60 ratios for training and testing purposes from both clusters. Thus, both positive and negative classes have equal representation in training and testing data sets. The spawned data partitions are utilized as inputs to the next model building stage.
The model building process is an extremely essential step of the proposed WEM. A combination of multiple classifiers will likely outperform a single classifier. However, in some cases, the fusion of different models could only add to the complexity of the system without any change in performance. Thus, for the ensemble to be efficient, the errors made by its base models should not be highly correlated and the output of each base model should be appropriate. Another main design function is the selection of the optimal models for learning algorithms leading to the maximum performance of the individual models/classifiers. A combination of four classifiers (i.e. DT, KNN, SVM, and NN) is the ensemble for classification. Model evaluation is key in any predictive modeling journey. In ensemble predictive modeling, the relative performance and variety of classifiers must be thoroughly evaluated, which becomes even more important. Usually, some accuracy test is used to determine classifier efficacy. Nevertheless, accuracy can be measured differently, each with a minor discrepancy.
The evaluation measures are computed for each classifier after modeling of data sets. The results generated by each classifier are given random weights spawned by uniform distribution U(0,1). Therefore, in the end, the weighted average of all evaluation measures is obtained and thus optimal accuracy is attained by diminishing errors and retrieving maximum G, F, and Acc values. The performance of the intended method is compared with the individual classifier's results. It is seen that the proposed novel WEM works adequately improved than the individual classifiers and effectively tackle the issue of over-fitting, under-fitting, and over-generalization. The pseudo code of the proposed method is given.

Simulation Study
The performance of the intended technique is assessed with the help of a comparative evaluation by achieving all possible analogies with other traditional classifiers including DT, KNN, SVM, and NN. The comparisons are rendered by using two types of over-sampling methods ADASYN and SMOTE. Forty imbalanced data sets with varying imbalance ratio (IR) are taken from a well-know data sets repository KEEL (Alcalá-Fdez et al., 2011). The comprehensive details of the these data sets are provided in Table 1. All the data sets are partitioned by using the ratio of 40:60 for training and test sets respectively. For NN, ten hidden layers are applied.

Performance Evaluation Measures
Three commonly used performance evaluation measures including G-mean (G), F-measure (F), and accuracy (acc) are computed to assess the enactment of all the classifiers (see Eq 3). G is the square root of the sensitivity and specificity of the classifiers whereas, F is computed with the help of the following formulas: G = Sensitivity * Specif icity F = 2 * P ercision * Sensitivity (P ercision + Sensitivity) acc = total number of correct predictions total number of data sets And Acc demonstrates the exact identification of minority and majority class instances. All the data sets are oversampled by using ADASYN and SMOTE. Subsequently, over-sampled data sets are classified by using the above mentioned four classifiers (DT, KNN, SVM, and NN). G, F, and, Acc of these data sets are monitored and are shown in Table 2. The over-fitting and under-fitting behaviors of conventional classifiers incorporating DT, SVM, KNN, and NN can be noticed from Tables 2 and 3. For most of the data sets with ADASYN, over-fitting is noticed by DT. Amongst all the classifiers, KNN showed the under-fitting classification results for most of the data sets. It is seen from Table 2 that SVM and NN demonstrated a mixed behavior of over-fitting and under-fitting. However, the proposed method (WEM) performed exceptionally by generating the well-balanced classification results without under-fitting and over-fitting for 35 imbalanced data sets out of 40 with varying IR's. Table 3 revealed the classification results of all classifiers involving their G, F, and Acc with SMOTE. Subsequent carefully monitoring the results of Table 3, it can be said that the results generated by SMOTE are substantially different from ADASYN (see Table 2). This time we observed the over-fitted results by the majority of the conventional classifiers involving (DT and NN). However, this  time, SVM and KNN produced diverse results, even though mostly preceded to the over-fitting issue, particularly, from NN. Once again, the proposed method (WEM) performed well on 29 out of 40 imbalanced data sets by overcoming the impact of under-fitting and over-fitting issues of classification results. These results by WEM are highlighted in Table 3.

Performance of the proposed method with Noisy and Borderline Imbalanced data sets
The performance of the proposed method (WEM) is also explored on noisy borderline data sets. Borderline examples are categorized as examples located in the area surrounding the class boundaries, where the minority and the majority classes coincide (Saez et al., 2015). Six noisy borderline data sets are taken from KEEL (reference) and their complete details including their names, abbreviations used, and IR are provided in Table 4. The problem of under-fitting and over-fitting is observed by above mentioned traditional classifiers after applying two types of over-sampling techniques ADASYN and SMOTE. The results of all classifiers including the proposed method WEM are shown in Table 5. It is obvious from the table that except SVM, the remaining three traditional classifiers including DT, KNN, and NN showed over-fitted results to these data sets. SVM showed its poor performance in terms of under fitted results to these data sets. However, WEM showed improved results by overcoming the over-fitting and under-fitting problems of these classifiers. Almost the same behaviors of classifiers can also be seen with SMOTE (see Table 6). The classification results obtained by conventional classifiers are unsatisfactory. DT, KNN, and NN constantly scored 1 for G and F measures and thus provided maximum accuracy. These classifiers delivered over-fitted results which are also not well

Concluding Remarks
Imbalanced datasets continuously acquired the extraordinary consideration of the investigators as these data sets frequently occur in real life, predominantly in the medical sciences. The enactment of conventional ML algorithms, to knob these types of data sets are inspiring. However, few problems are unmovingly accompanying these algorithms which cannot be snubbed while doing the classification tasks of these data sets. Amongst these, the problem of overfitting and under-fitting are two substantial matters that the investigators have to face while the classification chore of imbalanced data sets. Thus, the provocation of this work is to knob this problem by implementing some sophisticated methods. Therefore, a novel WEM is intended in this study to diminish the effect of these two substantial problems and to obtain improved accuracy for these types of data sets. The projected method ensembles the measures obtained by conventional classifiers with the help of the weights generated by the uniform distribution. The performance evaluation of the novel method is assessed by employing imbalanced data sets and noisy borderline data sets, in a comparative study by making the comparisons with the performance measures obtained from existing classifiers. It is noticed that DT, KNN, and NN are highly over-fitted, whereas SVM showed poor results i.e. under fitted results for forty imbalanced data sets both with ADASYN and SMOTE. Overall, our proposed method effectively diminished the effect of this over-fitting and under-fitting of imbalanced data sets and also shown enhanced accuracy for various types of imbalanced data sets.