Black hole algorithm as a heuristic approach for rare event classiﬁcation problem

The logistic regression is generally preferred when there is no big difference in the occurrence frequencies of two possible results for the considered event. However, for the events occurring rarely such as wars, economic crisis and natural disasters, namely having relatively small occurrence frequency when compared to the general events, the logistic regression gives biased parameter estimations. Therefore, the logistic regression underestimates the occurrence probability of the rare events. In this study, a modiﬁcation of the black hole algorithm (BHA) is proposed as an alternative to the classical logistic regression method in order to obtain more reliable and unbiased rare event parameter estimates. To examine the performance of the proposed approach, we calculate bias and root mean square errors based on Monte Carlo (MC) simulations. We used logistic regression to generate data for the rare event in the simulations and gave values to the β 0 parameter to obtain different rarity levels. The performance of the methods was examined in different scenarios using comprehensive MC simulations under different conditions for the rarity level and number of subjects. In addition, real-life data was used to examine the classiﬁcation performance of the proposed approach and the precision, sensitivity and speciﬁcity values of the two methods were compared. As a result, we obtained that the proposed BHA gives less biased predictions than logistic regression in simulation and real-life data and has higher classiﬁcation performance. Additionally, rareness levels have a signiﬁcant impact on the parameter estimates of the methods.


Background
The logistic regression determines the effect of the independent variables on the explained (dependent) variable by constructing a model between independent variables and explained variable.Unlike the standard regression in which the dependent variable takes continuous values, here a categorical and dependent variable is considered, such that the variable consists of two-level 0 and 1, or more than two levels (Hosmer and Lemeshow, 2000).In large samples, maximum likelihood estimators of the model parameters have asymptotic unbiased and efficiency properties, and therefore, the logistic regression is used as an efficient model for classification problems of groups with homogeneous (balanced) distribution.However, for some situations which emerge depending on the internal structures In the literature, there are some meta-heuristic approaches as well as different classical approaches for solving rare event classification problems.Li et al., 2016 proposed a new optimization model using different swarm strategies (Bat-inspired algorithm and PSO) to increase classification performance in imbalanced data sets.They applied their proposed approach on five different unbalanced datasets consisting of lung surgery and bioanalysis scanning data and discovered that their proposed approach gave better results than other classes of balancing problems.Vergé et al., 2016 used the island particle algorithm called Monte Carlo square in their study to accurately predict rare event probabilities and discussed their results on different aviation test scenarios.Li et al., 2017 created two different methods for the unbalanced data classification problem, consisting of the effects of synthetic minority over-sampling technique (SMOTE) and the meta-heuristic method.Since the Swarm Balancing Algorithms they first used were not effective in relatively small and unbalanced data sets, they proposed the Adaptive Swarm Balancing Algorithms, which they created by updating the parameter estimates, and achieved more effective results.Ling and Zhenzhou, 2021 proposed a novel two-stage meta-model importance sampling based on the support vector machine approach for multiple failure regions and rare events.In their study, they tested the applicability of this two-stage method on different examples.In this article, we propose a new meta-heuristic algorithm that includes BHA modification as an alternative to the classification and prediction problems encountered in rare event datasets consisting of unbalanced data sets.Unlike other studies, we test our proposed approach with a simulation study consisting of different scenarios.In the simulation study, we compare the parameter estimates, bias values, root mean square error (RMSE), classification table, and accuracy rates of the proposed BHA and the logistic regression model at different sample sizes and rareness degrees.Consequently, in small samples and cases where the imbalance is large between the groups, it is found that the proposed BHA gives less biased and reliable results compared to the logistic regression.Moreover, we compare the classification performance of the methods on the Militarized Interstate Disputes (MID) Data which is a literature data as for the rare events.As a result, we observe that the proposed BHA gives better results than the logistic regression for the specificity ratios.

Rare Events
The rare events are defined as the events occur less frequently, but make a wide effect and may give a big damage through the society (King and Zeng, 001a).Because the rare events encountered less frequently they are statistically Black hole algorithm as a heuristic approach for rare event classification problem unexpected and non-continuous events.Therefore, the calculations of the probabilities the events statistically less probable, namely non-continuous for the occurrence in the future, and directly their prediction is a quite difficult problem.The credit card fraud, improper pronunciation of words, hurricanes, telecommunication tool crashes, militarized interstate disputes (King and Zeng, 001b) , landslides, derailments of trains, earthquakes, forest fires, diagnose of rare diseases in medicine and etc. can be given as an example for the rare events (Paal, 2014) .While the experimental data often have very large event/sample number for one of the groups, the other group is represented by only a few sample in rare event applications.The rareness can be given in two forms.The first one is called as the relative rareness or imbalanced data, and it is a data set in which one of the considered group is much much smaller than the majority of the other group.The classification algorithm experiences problems when the data are skewed through one of the groups.The second one is called as the absolute rareness, and it is basically described as the small sample problem.Because the sampling size is very small maximum likelihood estimation starts to give worse predictions.Therefore, Allison 2012 claims that only absolute rareness becomes problematic in the framework of logistic regression, while Agresti 2002 states that the relative rareness must be considered with the comparison of number of estimators (Paal, 2014).

Binary logistic regression
In order to predict the relation between two or more variables having a cause-effect relation, to summarize the data, to make coefficient prediction among other variables, to determine and investigate the important variables effecting the dependent variable, the regression analysis is commonly used (Alpar, 2011).In linear regression analysis as one of the most important regression analysis, the dependent variable takes continuous values while the independent variable can take continuous and discrete values.If k is the independent variable and N is the number of observation, the general form of the linear regression model for the i-th observation is given by However, if the dependent variable takes two categorized values such as 0 and 1, the binary logistic regression is used instead of the linear regression.In binary logistic regression, the dependent variable is given by y=1 for the considered event, while the other event is given by y=0.Accordingly, the general form of the logistic regression is given as; The logistic regression model for expressing the probability of considered event occurrence is given by Similarly, the probability of same event not occurring is Because the dependent variable takes categorical values in the logistic model, constant-variance assumption breaks and the least square method used for parameter estimation in linear regression does not give unbiased and consistent estimations.Because the constant-variance assumption is not satisfied, the maximum likelihood method is generally used for regression coefficient estimation of logistic model (Pampel, 2000).This method enables us to estimate the parameters which make the probability of reaching observed data set maximum.In order to use the maximum likelihood method, it is necessary to construct the likelihood function.This function presents the probabilities of observed data for the unknown parameters (Santner and Duffy, 1986).Other methods used in parameter estimation are the weighted least square method and minimum logit chi-squared method (S ¸ahin, 1999).When the parameters are estimated by using the maximum likelihood method, some other iterative methods like Newton Raphson method is used in order to obtain parameter estimations because a linear form of the parameters cannot be obtained.

Cases that logistic regression estimations give bad results
We generally encounter a case in which the quasi or complete separation occurs (Elizabeth and Thomas, 2002).For instance, in binary logistic regression model with single independent variable, when Y = 1 let 10 ≤ X ≤ 20, when Y = 0 let 30 ≤ X ≤ 50.We call this case as complete separation because the values of independent variable are separated with respect to the dependent variable.If the independent variable is categorical, only the corresponding cross cells contain data in 2 × 2 table (X = 0 values correspond to Y = 0, or X = 1 values correspond to Y = 1, and vice versa) (Boyle, 1996).Moreover, if the dependent variable is Y = 1 the independent variable is given as 10 ≤ X ≤ 30, then we call this case as quasi separation.For the cases in which the independent variable is categorical this corresponds to only one empty cell in 2 × 2 table (Zorn, 2005).When there occurs a complete or quasi separation maximum likelihood function cannot estimate the parameters (Elizabeth and Thomas, 2002).
When the dependent variable is heterogeneous, the category of the event that we consider occurs less frequently compared to other category.When the parameter estimations of the logistic regression model for these type of data sets are conducted by maximum likelihood function, it may not give unbiased, consistent, efficient, sufficient and minimum-variance parameter estimations (Croux and Haesbroeck, 2003).

Black Hole Event
John Michell and Pierre Laplace are the first scientists who proposed the black hole concept in 18 th century.Although they formulate the unseen stellar theory by using Newton's gravitational law, this phenomenon is not considered as a black hole.Firstly, the American physicist John Wheeler (1967) called the mass collapse events as the black hole.A black hole is formed due to the collapse of giant stars in space.The magnitude of a black hole's gravitational field is very large as much that even the light cannot escape from its attraction because the mass of collapsed star is concentrated in a very small region.Everything going into the effective boundary of a black hole is swallowed by the black hole and nothing can escape from its extreme gravitational force.The spherical boundary of a black hole is called as the event horizon of the black hole.The radius of the event horizon is called as the Schwarzschild radius.
The minimum escape velocity from this radius is greater than the maximum universal speed, as the speed of light, and therefore even the light cannot escape from this radius.Consequently, because nothing can exceed the speed of light nothing can escape from the event horizon once it goes into the horizon.Schwarzschild radius is given by where G is gravitational constant and its value is 6.67 × 10 −11 N(m/kg) 2 , M is the mass of black hole and c is the speed of light.
If an object gets close to the event horizon, then it is attracted to the inside of the event horizon, then it can no more escape from the event horizon, and terminated forever according to the outer observer.

Black Hole Algorithm
Black hole concept is firstly introduced by Zhang et al., 2008 in order to bring a new mechanism to the PSO.This method is developed as an extension of the PSO and a new produced particle named as the black hole attracts all other particles under certain conditions.These conditions are used for accelerating the converging speed of PSO, and for preventing the local optimum problems.The event horizons of black holes, and the termination of candidates (stars) concepts are not used in this method.The better positions found by the candidates are assigned as the best positions.However, the new BHA proposed by Hatamlou, 2013 introduced these concepts, and while the best candidate is assigned as the black hole, the other candidates become the stars which get close to the best candidate -black hole.Moreover, the candidate stars falling into the event horizon vanish and some other new candidate stars are produced instead of them in the search space.By doing so the BHA is improved for the whole population.Consequently, this new proposed BHA imitates the nature of astrophysical black hole phenomenon more than the first version, and it completely differs from the previous PSO algorithms (Hatamlou, 2013).
BHA is an optimization algorithm having some common properties with other population based algorithms, such as Genetic algorithm (GA) and PSO.In the population based algorithms, a population of candidate solutions is arbitrarily produced and distributed in the search space for a given problem.Later on, the produced population is improved through the optimum solution by using certain mechanisms.For example, while the improvement is conducted by the mutation and crossing mechanisms on genes in GA, it is improved by using the best positions obtained in the search space, and by transferring the candidate solutions through the best position in PSO.
As the core part of the study, we propose the BHA, as a novel version of the standard BHA, in which the population (stars) of arbitrarily produced candidate solutions is placed into the search space of logistic regression model.Later on, the fitness values of the population are determined and the best candidate having the best fit value are described as a black hole.Remaining candidates other than the black hole is considered as the stars.
After assigning the black hole and stars, the black ho starts to attract the stars around itself and all the stars move toward the black hole.The gravitational attraction of stars by the black hole is formulated as where x i (t) and x i (t+ 1) are the position of the i-th star in t-th and (t + 1)-th iteration, respectively.Also, x KD is the position of the black hole in the search space, rand is an arbitrary number in [0, 1] interval, and N is the number of candidate solutions (stars) (Hatamlou, 2013).
When a star moves toward the black hole, it may find a better position with less cost than the black hole.In such a situation, the black hole replaces its position with that star.The BH algorithm proceeds with its new position and the remaining stars start to move toward this new black hole position.
In addition, there is the probability of stars falling into the event horizon of the black hole during their movement toward the black hole.Accordingly, each star (candidate solution) falling into the event horizon of the black hole is terminated by the black hole.In our BHA, the radius of the event horizon is obtained by Here, we identify the termination criterion as 5000 iteration because of the capacity of computers.

Simulation study
The main assumption of logistic regression requires the both group to be balanced.However, if there exists a group imbalance in the sample this assumption is not satisfied, and the maximum likelihood estimations therefore become biased.We then compare the performance of the modification BHA and logistic regression for these imbalanced cases.For this purpose, we obtain the mean bias and RMSE values in different rareness degrees of the considered group (group "1") by performing 1000 MC simulation studies, and we investigate that for 3 different cases.
Simulation Algorithm 1.In order to determine the effect of sample size on the methods in rare events, n = 100, n = 500 and n = 1000 are assigned.
2. By generating the independent variable from X ∼ N(0, 1) for Case 1, and Case 2, and from binary categorical variable for Case 3, we determine the performance of BHA for the complete separation case.
3. Regression coefficient leads to different rareness degrees depending on the values of β 0 as given in Table 1 β 1 = 0 for Case 1, and β 1 = 1 for Case 2 and Case 3 are assigned.4. In order to generate the binary dependent variable Y , probability values are obtained from p i = P(Y = 1) = 1 1+e −(β 0 +β 1 X) , and then by comparing a dummy value u generated from the uniform distribution it is found as 5. After obtaining the dependent and independent variables, by using these variables BHA and logistic regression Black hole algorithm as a heuristic approach for rare event classification problem methods are applied to the same data.Accordingly, the mean bias and RMSE E( θ − θ ) 2 obtained from the methods are investigated and the performance of the methods are compared.In Case 1, the mean bias values of two methods have become very close to the real parameter value for β 0 = 0 case with no group imbalance.When the percentage ratio of the considered rare event decreases, namely when the rareness degree increases by -2, -3, -4 for β 0 , the bias of the logistic regression increases.However, in the BHA, mean bias values are approximately close to zero despite the occurrence of rareness.The occurring bias is found to be in intercept parameter rather than the slope parameter.When the rareness increases in both methods, the RMSE value increases.
On the other hand, the RMSE values in logistic regression are relatively higher than the ones in BHA.
Table 2: Different rareness percentages with respect to the values of β 0 (β 1 = 0 and x∼normal)(n=1000)  Black hole algorithm as a heuristic approach for rare event classification problem In Case 2, when the two groups are balanced with β 0 = 0 value, mean bias values of two methods are approximately close to zero, similar to Case 1.When the percentage ratio of the considered rare event decreases 0.16% from %30, the mean bias values of logistic regression increases higher than the BHA.
Table 3: Different rareness percentages with respect to the values of β 0 (β 1 = 1 and x∼normal) (n=1000)   Black hole algorithm as a heuristic approach for rare event classification problem For the cases the independent variable is categorical (or it has a complete separation), we obtain much biased values because the logistic regression fails parameter estimation (In Table 3).Correspondingly, we find that the BHA has much better mean bias and RMSE values for the complete separation case.4, we investigate the mean bias values of both methods with respect to the given sample sizes and conclude that the BHA method gives better results compared to the logistic regression methods.Moreover, we observe that the bias values of both method decreases when the sample size increases.

Data Analysis
In this section, we make a real life application on the MID Data with a rareness degree of 0.3% by choosing a sample including 1000 units with a rareness degree of 5% (King and Zeng, 001b).Here, the dependent variable "conflict" represents the international disputes and takes the values 1 if dispute exists, and 0 if dispute does not exist.The independent variables are codded as major, contig, power, maxdem, mindem and years.The descriptive variables consist of the variables used in this region.Major represents whether a country has a super power or not; contig represents whether a country has foreign policy portfolio and allies or not; power represents the military power; maxdem and mindem represent the maximum and minimum degrees of business dependencies, respectively; years represent the time passing from the last dispute (King and Zeng, 001b).In order to compare the correct classification percentage of both methods for the rare group, we use the Specificity percentages.According to the classification performance of both methods in Table 6, the logistic regression can assign only 7 correct groups for situations that the confliction may exists.The precision (Positive predictive value) percentage of the test is found to be 23.33 for the logistic regression.However, the specificity percentage of BHA is found to be 83.33,much better than the logistic regression.

Conclusions
It is very important to describe the events which have vital importance and rare frequency, and whose results may be fatal or very dangerous (earthquakes, fatal virus pandemics, wars, etc.), by using correct prediction models.The logistic regression model which is generally used for the classification, cannot correctly predict the probabilities of the rare events mostly because of the bias in intercept parameters estimation.Various classical statistical methods are available in the literature to improve the bias caused by the rare event.However, meta-heuristic algorithms were not used for the bias in parameter estimates of the rare event.In this article, a modification of the BHA, which had different applications on clustering and variable selection, was developed on rare event classification and an alternative was proposed for logistic regression parameter estimations.In order to compare the parameter estimates obtained from the logistic regression and BHA, we conducted a simulation study with three different scenarios based on the intercept parameter and different rareness percentages.When the simulation results are examined, it is seen that BHA gives less biased parameter estimates compared to logistic regression in all scenarios.In addition, in the simulation study conducted to examine the effect of sample size on the bias in parameter estimations, we observed that BHA gave better results compared to the logistic regression model.At the same time, we observed that the BHA was better than the logistic regression model, according to the AIC value, which was viewed to examine the relative superiority of the models to each other.
We examined the precision, sensitivity and specificity values obtained from the MID data set to examine the classification performance of the BHA and logistic regression applied for the bias in rare event parameter estimations.It is seen that the precision value obtained from the BHA is higher than the logistic regression model.As a result, in the rare event situation, it is seen that the BHA predicts actually positive values more accurately than logistic regression.Likewise, it was seen that the BHA predicted negative values more accurately than logistic regression.
As a result, we observed that the developed modification of the BHA gives better results than the logistic regression in rare events according to different rareness levels, different sample sizes and classification percentages examined on real-life data.Accordingly, we recommend the use of meta-heuristic algorithms such as black hole instead of logistic regression in order to achieve correct classification percentages in rare events such as earthquakes, deadly epidemics, and wars, and to obtain less biased parameters and to make correct interpretations.
The approach proposed in this study can be extended with different meta-heuristic algorithms and mathematical approaches to obtain a more reliable and accurate prediction of rare events.The proposed BHA approach for rare events can be applied on models simulated from different distributions (Almuqrin et

Figure 1 :
Figure 1: Comparison of bias values for both methods in Case 1

Figure 2 :
Figure 2: Comparison of bias values for both methods in Case 2

Figure 3 :
Figure 3: Comparison of bias values for both methods in Case 3

Figure 4 :
Figure 4: Box-plot graph of mean bias values for both methods in Case 2In Figure4, we investigate the mean bias values of both methods with respect to the given sample sizes and conclude that the BHA method gives better results compared to the logistic regression methods.Moreover, we observe that the bias values of both method decreases when the sample size increases.
) where f KD and f i are the logistic regression fitness values of the black hole and i-th star, respectively.In our logistic regression model, if the obtained candidate solution is less than R this candidate collapses and a new candidate is produced and arbitrarily distributed in the search space.Next j Assign the best star with best fitness value as the black hole While a ¡ max iter or conv criteria For i=1:number of star x i (t+ 1)=x i (t) +rand× (x BH − x i (t) ) Evaluate fitness value of the star x i If fitness of x i > fitness of x BH x BH =x i End if Replace the new fitness value of the x i with the previous value Update fitness array f and calculate R= f KD If (x BH − x i ) 2 ¡ R Replace x i with a newstar in the search scope End if End for set a = a + 1 End while Return best solution End

Table 1 :
Different rareness percentages with respect to the values of β 0

Table 5 :
Coefficient estimations of both methods for MID DataWhen we investigate the coefficient estimations for the MID Data in Table5, we observe that the BHA gives much smaller AIC value than that of the logistic regression method.Consequently, we infer that the best model is the BHA method.

Table 6 :
Classification performance of both methods al., 2022, Rasekhi et al., 2022, Althubyani et al., 2022, Alghamdi et al., 2023, Atchadé et al., 2023) and the performance of the model can be examined.Additionally, by combining the BHA with different meta-heuristic algorithm, a new hybrid algorithm can be proposed, its performance on rare events can be compared, and new approaches can be proposed for the rare event prediction and classification problem.