in The Effect of Different Similarity Distance Measures in Detecting Outliers Using Single-Linkage Clustering Algorithm for Univariate Circular Biological Data

Clustering algorithms can be used to create an outlier detection procedure in univariate circular data. The circular distance between each point of angular observation in circular data is used to calculate the similarity measure to appropriately group observations. In this paper, we present a clustering-based procedure for detecting outliers in univariate circular biological data using various similarity distance measures. Three circular similarity distance measures; Satari distance, Di distance and Chang-chien distance were used to detect outliers using a single-linkage clustering algorithm. Satari distance and Di distance are two similarity measures that have similar formulas for univariate circular data. This study aims to develop and demonstrate the effectiveness of the proposed clustering-based procedure with various similarity distance measures in detecting outliers. The circular similarity distance of SL-Satari/Di and other similarity measures, including SL-Chang, were compared at various dendrogram cutting points. It is found that a clustering-based procedure using a single-linkage algorithm with various similarity distances is a practical and promising approach to detect outliers in univariate circular data, particularly for biological data. According to the results, the SL-Satari/Di distance outperformed the SL-Chang distance for certain data conditions. This study also found that all clustering-based procedures using SL-Satari/Di and SL-Chang distances successfully detected all outliers in the Eye, Sea star, Pigeon and Frog datasets with no masking and swamping effects at 95% and 90% of confidence intervals. For the Turtle dataset, the SL-Satari/Di distance detected all outliers with no masking effect but has a swamping effect at the 95% and 90% of confidence intervals. On the other hand, SL-Chang did not successfully detect all outliers at both 95% and 90% of confidence intervals with both masking and swamping effects occurring when detecting the outliers in the Turtle dataset. Under certain data conditions, the SL-Satari/Di distance appears to outperform the SL-Chang distance. proposed clustering-based procedure based on algorithms and Chang particularly biological We propose that this procedure be extended using other agglomerative clustering algorithms with different data conditions. A simulation study can aid in the confirmation of findings in various scenarios and conditions.


Introduction
Data recorded in angular measurements, or circular data, has been used to extend outlier detection procedures from linear cases to circular cases over the last few decades. Outliers are objects or observations in a dataset that are inconsistent with the rest of the data (Collett, 1980). In other words, outliers are observations that deviate significantly or are dissimilar to the others. Many researchers have proposed outlier detection procedures in univariate circular data using various methods such as the row deletion method (Abuzaid et al., 2009;Abuzaid et al., 2012;Collett, 1980;Rambli, 2015), the robust method (Mahmood et al., 2017) and the clustering-based method (Abuzaid, 2012;Ahmed et al., 2019).
A clustering-based procedure can be used to detect outliers by classifying the outliers and inliers. Clustering is used to classify data points into distinct clusters or groups, whereas outlier detection is used to identify data points that do not appear to fit naturally into these distinct clusters. Clustering and outlier detection have a symbiotic relationship. The clustering-based outlier detection method has an advantage in that it can find clusters while also identifying outliers. Combining clustering and outlier detection has the following benefits: (i) the clusters become more compact Pakistan Journal of Statistics and Operation Research and semantically coherent, (ii) the clusters become more robust against data perturbations and (iii) the outliers are contextualised by the clusters and more interpretable (Ott et al., 2014).
Clustering-based outlier detection in circular data has been previously proposed using a single-linkage clustering algorithm for bivariate circular data (Di et al., 2017;Satari, 2015;Satari et al., 2021). There is no single-linkage clustering-based outlier detection method for univariate circular data, hence we developed a procedure for detecting outliers in univariate circular data using an agglomerative hierarchical clustering method based on a single-linkage algorithm with various similarity distance measures. The agglomerative hierarchical clustering method produces a dendrogram, which is circular sed to represent the clusters. The single-linkage algorithm is the simplest agglomerative hierarchical clustering method that merges observations by taking the shortest distance.
Circular distance in a circular dataset is defined as the distance between two points on the circumference, i.e., between any two angles (Jammalamadaka and Sengupta, 2001). Meanwhile, circular similarity distance can be constructed from the circular distance in order to determine the similarity measure. Chang-chien et al. (2012) and Hung et al. (2012) developed the mean shift-based clustering algorithm for circular data using the formulation of circular distance introduced by Jammalamadaka and Sengupta (2001). Satari (2015) used the single-linkage clustering algorithm and introduced the circular city-block distance, also known as the Satari distance, to calculate the similarity measure in detecting outliers for bivariate circular data. Di et al. (2017) incorporated the circular Euclidean distance into a singlelinkage bivariate circular outlier detection procedure. Outliers are identified when a cluster exceeds a certain height of the dendrogram, which is then cut using the cutting rule proposed by Satari (2015).
In this study, we present a clustering-based procedure for detecting outliers in univariate circular data using a singlelinkage clustering algorithm and various similarity distance measures. Univariate circular data is common in biological studies such as ecological and biomedical research. The performance of the proposed procedure was compared to that of its counterparts using five real univariate circular biological datasets: Turtle data (the directions of turtles after laying eggs) (Chang-chien et al., 2012; Hung et al., 2012), Eye data (eye angles obtained from glaucoma patients) (Abuzaid, 2020;Alkasadi et al., 2018;Rambli, 2015), Sea star data (the direction of sea stars after removal from the natural habitat) (Mahmood et al, 2017), Pigeon data (vanishing direction of homing pigeons) (Jammalamadaka & Sengupta, 2001), and Frog data (the homing ability of the frog) (Abuzaid et al., 2009;Abuzaid et al., 2012;Collett, 1980;Zulkipli et al., 2020). These biological data have been used in various studies of outlier detection procedures in univariate circular data. Therefore, the datasets were used in this study to demonstrate the applicability of the proposed clustering-based procedure with various similarity distance measures to detect outliers.

Similarity Distance Measures for Circular Data
Jammalamadaka and Sengupta (2001) define the circular distance between any two points as the lesser of the two arc lengths between the points along the circumference, i.e., for any two angles i  and j  . The circular distance 0 d can be defined as follows (Jammalamadaka and Sengupta, 2001): , , 1, 2,..., .
Another closely related definition of the circular distance between angles i  and j  is (Jammalamadaka and Sengupta, In a clustering-based procedure, circular distance has been used to identify outliers in circular data (Di et al., 2017;Satari, 2015;Satari et al., 2021). The similarity measure is used in clustering to appropriately group observations. The similarity measure can be calculated using the circular distance between each point observation. In this study, three circular similarity measures were used to develop the proposed clustering-based procedure for detecting outliers in univariate circular biological data. A circular city-block distance (Satari distance) and two circular Euclidean distances (Di distance and Chang-chien distance) were used to calculate the similarity distance in the single-linkage algorithm.
The formulas for the Satari, Di and Chang-chien similarity distances are as follows: Only when 1 p  in the case of bivariate or multivariate data do the Satari and Di distances yield different similarity distances. Since the Satari and Di distances are the same for univariate circular data, these distances in this study are referred as the Satari/Di distance.

Clustering-based Procedure for Outlier Detection in Univariate Circular Data
The single-linkage algorithm, also known as the minimum method, is one of the most basic agglomerative clustering methods that merges two clusters with the smallest minimum pairwise distance at each iteration. This study focuses on the single-linkage algorithm which is sensitive to the presence of outliers (Klutchnikoff et al., 2021). The singlelinkage algorithm is as follows (Johnson and Wichern, 2014): Step 1: Begin with n clusters, each containing a single observation 12 Step 2: Calculate the distance matrix between all possible pairs of clusters using the circular similarity distances in equations (3)-(5). Step Step 4: The rows and columns in the distance matrix responding to the merged cluster(s) are deleted. A new distance matrix is updated by recalculating the distance matrix after forming a new cluster in Step 3.
Step 5: If more than one cluster remains, repeat Steps 3 and 4 until all clusters have been merged into a single cluster.
The single-linkage algorithm then generates a dendrogram or cluster tree as an output. The clusters are represented as branches of a tree. A dendrogram is used to find the optimal number of clusters, and it must be "cut" at a certain height to separate outliers from inliers. In this study, the dendrogram was cut using the cutting rule proposed by Satari (2015), as shown in equation 8. Another cutting rule is shown in equation 9.

,
where h is the average height of the cluster trees of all 1 N − clusters and h s is the circular standard deviation of the heights given by: where h R is the mean resultant length of the height for 1 N − clusters. The cluster that exceeds a certain point of cutting height is classified as a candidate for outliers at 95% and 90% confidence intervals.

Illustrative Examples and Results
Five univariate circular biological datasets with outliers were used to demonstrate the applicability of the proposed outlier identification procedure. The datasets are Turtle data, Eye data, Sea star data, Pigeon data and Frog data taken from Collett (1980) (2015), respectively. These datasets have been widely used in studies of univariate circular data outlier identification. Table 1 details the datasets used in this study.

Turtle data
The Turtle dataset contains 76 observations of turtle directions after laying eggs (Fisher, 1993). Chang-chien et al.   As shown in Figure 2(a), the dendrogram of the SL-Satari/Di distance forms three clusters after its height was cut at 95% and 90% confidence intervals. Cluster 1 consists of observations 57 to 74, while Cluster 3 consists of observations 75 and 76, both of which exceed the 95% and 90% confidence intervals of cutting height. Therefore, all observations in Cluster 1 and Cluster 3 were identified as outliers. Cluster 2 consists of the remaining majority of observations and does not exceed the cutting height, indicating that the cluster consists of inliers.
In Figure 2(b), the dendrogram of the SL-Chang distance forms two clusters after its height was cut at 95% and 90% confidence intervals. Cluster 1 which consists of observations 57 to 74 exceeds the cutting height at 95% and 90% confidence intervals, indicating that these observations are outliers. Cluster 2 consists of the remaining majority of observations and does not exceed the cutting height, indicating that the cluster consists of inliers.

Eye data
The Eye dataset contains 23 observations of the eye angle of posterior corneal curvature of glaucoma patients (Abuzaid, 2020;Alkasadi et al., 2018;Rambli, 2015). Figure 3 shows that there are two suspected outliers in the circular plot of the dataset, namely observations 10 and 17. Rambli (2015) successfully identified these two observations as outliers using the row deletion method. Figure 4 shows the results of applying the proposed clusteringbased procedure to the Eye dataset to confirm the suspected outliers.  Figure 4 shows two clusters were formed after the dendrograms were cut at 95% and 90% confidence intervals for both the SL-Satari/Di and SL-Chang distances. Cluster 1 consists of observations 10 and 17, while Cluster 2 consists of the remaining observations. Observations 10 and 17 were identified as outliers since Cluster 1 exceeds the cutting height at 95% and 90% confidence intervals for both the SL-Satari/Di and SL-Chang distances. Cluster 2 consists of the remaining majority of the observations and does not exceed the cutting height, indicating that the cluster consists of inliers. Both SL-Satari/Di and SL-Chang distances successfully identified observations 10 and 17 as outliers at 95% and 90% confidence intervals.

Sea star data
The Sea star dataset contains 22 observations of the direction of sea stars after removal from their natural habitat (Fisher, 1993). Mahmood et al. (2017) used the robust method to identify observation 13 as an outlier in this dataset. Figure 5 shows that there is one suspected outlier in the circular plot of the dataset, namely observation 13. Figure 6 shows the results of applying the proposed clustering-based procedure to the Sea star dataset to confirm the suspected outlier.
Obs. 10,17 The effect of different similarity distance measures in detecting outliers using Single-linkage clustering algorithm for Univariate Circular Biological Data 567  Figure 6: The dendrogram with corresponding cutting height for the Sea star dataset using a) SL-Satari/Di and b) SL-Chang similarity distances. Figure 6 shows two clusters were formed after the dendrograms were cut at 95% and 90% confidence intervals for both the SL-Satari/Di and SL-Chang distances. Cluster 1 consists of observation 13, while Cluster 2 consists of the remaining observations. Observations 13 was identified as an outlier since Cluster 1 exceeds the cutting height at 95% and 90% confidence intervals for both SL-Satari/Di and SL-Chang distances. Cluster 2 consists of the remaining majority of the observations and does not exceed the cutting height, indicating that the cluster consists of inliers. Both SL-Satari/Di and SL-Chang distances successfully identified observation 13 as an outlier at 95% and 90% confidence intervals.

Pigeon data
The Pigeon dataset contains 15 observations of the vanishing direction of homing pigeons (Jammalamadaka and Sengupta, 2001). Figure 7 shows that there are two suspected outliers in the circular plot of the dataset, namely observations 1 and 15. Figure 8 shows the results of applying the proposed clustering-based procedure to the Pigeon dataset to confirm the suspected outliers.  Figure 8 shows two clusters were formed after the dendrograms were cut at 95% and 90% confidence intervals for both the SL-Satari/Di and SL-Chang distances. Cluster 1 consists of observations 1 and 15, while Cluster 2 consists of the remaining observations. Observations 1 and 15 were identified as outliers since Cluster 1 exceeds the cutting height at 95% and 90% confidence intervals for both the SL-Satari/Di and SL-Chang distances. Cluster 2 consists of the remaining majority of the observations and does not exceed the cutting height, indicating that the cluster consists of inliers. Both SL-Satari/Di and SL-Chang distances successfully identified observations 1 and 15 as outliers at 95% and 90% confidence intervals.

Frog data
The Frog dataset contains 14 observations of frog homing ability (Collett, 1980). Various methods have been used to detect outliers in the dataset (Collett, 1980;Abuzaid et al., 2009;Abuzaid et al., 2012;Zulkipli et al., 2020). Figure 9 shows that there is one suspected outlier in the circular plot of the dataset, namely observation 14. Figure 10 shows the results of applying the proposed clustering-based procedure to the Frog dataset to confirm the suspected outlier.  Figure 9 shows that there is one suspected outlier in the circular plot of the dataset, namely observation 14. Figure 10 shows the results of applying the proposed clustering-based procedure to the Frog dataset to confirm the suspected outlier.

Performance Measure and Discussion
This section compares and discusses the performance of the SL-Satari/Di and SL-Chang similarity distances in the single-linkage clustering algorithm for each dataset. The performance of the proposed clustering-based algorithm was measured using three measurements introduced by Sebert  i.
Probability of all outliers are successfully detected, pout. ' '/ pout success out = where 'success' is the number of observations that clustering-based procedure successfully identified all the outliers and out is the number of outliers. The closer the pout value to 1, the better the proposed outlier detection procedure detects all the outliers. ii.
Probability of outliers are falsely detected as inliers (masking effect), pmask. ' '/ pmask failure out = where 'failure' is the number of outliers in dataset that detected as inliers and out is the total number of outliers. The pmask value ranges from 0 to 1, with a value close to 0 indicating that no outliers were detected as inliers. iii.
Probability of inliers detected as outliers (swamping effect), pswamp. ' '/ ( ) pswamp false n out =− where 'false' is the number of inliers in all data set that detected as outliers, n is total number of observations and out is total number of outliers. The pswamp value ranges from 0 to 1, with a value close to 0 indicating that no inliers were detected as outliers. Table 2 shows performance measures for the proposed clustering-based procedures on the five biological datasets.  The SL-Satari/Di and SL-Chang distances successfully identified observations 10 and 17 as outliers ( 1 pout = ) in the Eye dataset at 95% and 90% confidence intervals. There is no masking effect ( 0 pmask = ), hence both algorithms do not misclassify any outlier as an inlier. There is also no swamping effect ( 0 pswamp = ), implying that the procedures do not misclassify any inlier as an outlier.
On the other hand, the SL-Satari/Di and SL-Chang distances successfully identified observation 13 as an outlier ( 1 pout = ) in the Sea Star dataset at 95% and 90% confidence intervals. There is no masking effect ( 0 pmask = ), hence both algorithms do not misclassify any outlier as an inlier. There is also no swamping effect ( 0 pswamp = ), implying that the procedures do not misclassify any inlier as an outlier.
Meanwhile, the SL-Satari/Di and SL-Chang distances successfully identified observations 1 and 15 as outliers ( 1 pout = ) in the Pigeon dataset at 95% and 90% confidence intervals. There is no masking effect ( 0 pmask = ), hence both algorithms do not misclassify any outlier as an inlier. There is also no swamping effect ( 0 pswamp = ), implying that the procedures do not misclassify any inlier as an outlier.
Lastly, the result of performance for Frog dataset based on three measurements in Table 3 shows that, the SL-Satari/Di and SL-Chang distances successfully identified observation 14 as an outlier ( 1 pout = ) in the Frog dataset at 95% and 90% confidence intervals. There is no masking effect ( 0 pmask = ), hence both algorithms do not misclassify any outlier as an inlier. There is also no swamping effect ( 0 pswamp = ), implying that the procedures do not misclassify any inlier as an outlier. Table 4 shows a summary of the best single-linkage algorithms for detecting outliers for each univariate circular biological dataset.

Conclusion
This study found that the similarity distance measure for univariate circular data, namely Di distance which utilised Euclidean distance, has a similar effect to the Satari distance, which has been used as city-block distance. In particular, the proposed clustering-based procedure with single-linkage and Satari/Di and Chang distances are both practical to detect outliers in univariate circular biological datasets at various points of cutting height. The performance of each clustering-based outlier detection procedure was evaluated by calculating the probability of successfully detecting all outliers, the probability of misclassifying outliers as inliers (masking effect), and the probability of misclassifying inliers as outliers (swamping effect). The ideal clustering-based outlier detection procedure would detect all outliers with the least amount of masking and swamping effects.
This study also found that all clustering-based procedures using SL-Satari/Di and SL-Chang distances successfully detected all outliers in the Eye, Sea star, Pigeon and Frog datasets with no masking and swamping effects at 95% and 90% of confidence intervals. For the Turtle dataset, the SL-Satari/Di distance detected all outliers with no masking effect but has a swamping effect at the 95% and 90% of confidence intervals. On the other hand, SL-Chang did not successfully detect all outliers at both 95% and 90% of confidence intervals with both masking and swamping effects occurring when detecting the outliers in the Turtle dataset. Under certain data conditions, the SL-Satari/Di distance appears to outperform the SL-Chang distance.
In conclusion, the proposed clustering-based procedure based on single-linkage algorithms with Satari/Di and Chang distances is a practical and promising approach for detecting outliers in univariate circular data, particularly biological data. We propose that this procedure be extended using other agglomerative clustering algorithms with different data conditions. A simulation study can aid in the confirmation of findings in various scenarios and conditions.