Testing the Semi Markov Model Using Monte Carlo Simulation Method for Predicting the Network Traffic

Semi-Markov processes can be considered as a generalization of both Markov and renewal processes. One of the principal characteristics of these processes is that in opposition to Markov models, they represent systems whose evolution is dependent not only on their last visited state but on the elapsed time since this state. Semi-Markov processes are replacing the exponential distribution of time intervals with an optional distribution. In this paper, we give a statistical approach to test the semi-Markov hypothesis. Moreover, we describe a Monte Carlo algorithm able to simulate the trajectories of the semi-Markov chain. This simulation method is used to test the semi-Markov model by comparing and analyzing the results with empirical data. We introduce the database of Network traffic which is employed for applying the Monte Carlo algorithm. The statistical characteristics of real and synthetic data from the models are compared. The comparison between the semi-Markov and the Markov models is done by computing the Autocorrelation functions and the probability density functions of the Network traffic real and simulated data as well. All the comparisons admit that the Markovian hypothesis is rejected in favor of the more general semi Markov one. Finally, the interval transition probabilities which show the future predictions of the Network traffic are given.


Introduction
Semi-Markov processes (SMPs) are a broad class of stochastic processes that overgeneralize the Markov and renewal processes at the same time. These processes have been introduced independently of each other by (Levy, 1954) and (Smith, 1955). The (SMPs) are a very important and powerful tool for capturing the main features and specifying the long run behavior of the data. The waiting time distribution functions in the states of these processes do not have any restrictions and can be of any type, and this characteristic is against the Markov chain assumption in which the waiting times in a state before making a transition to another state are geometrically distributed. Therefore, the (SMPs) have become progressively important in statistical modeling and have been applied successfully in a wide range of fields such as reliability (Malefaki, 2014), geophysics (Pertsinidou, 2017), queuing theory (Peschansky, 2015), finance (D'Amico, 2018) , wireless systems (Vishnevski, 2017), etc. See (Georgiadis, 2015), (Devolder, P.,Janssen,J. and R. Manca, 2015) for studying the semi-Markov process(SMP) in details. The advantage of the (SMP) in comparison with the Markov model is additionally confirmed by employing a statistical test of a hypothesis. As an application, we apply the (SMP) to the Network traffic time series. Traffic analysis is a fruitful and wide research area. Realizing the characteristics of traffic generated by new applications and flowing in current networks permits us to design new architectures and to improve their performance effectively. Predicting Network traffic plays an important role in many fields such as adaptive applications, network management, congestion control, and traffic engineering. The Network traffic was forecasted by different mathematical models such as follows. (Dainotti,A.,Pescape,A., Rossi,P.S.,Palmieri,F. ,Ventre,G.l, 2008) suggested a Hidden Markov Model for Internet traffic sources at the packet level, jointly analyzing Inter Packet time and size. (Marnerides, A.K.,Pezaros, D.P., Hutchison,D., 2018) focused on the validation of the statistical characteristics of stationarity, Gaussianity and linearity of the packet interarrival time processes by 3rd order statistics. (Katris, 2015) modeled the Network traffic by the time series and Neural Network approaches. (Adeyemi,O.J., Popoola,S.I., Atayero, A.A., Afolayan,D.G., Ariyo, M., Adetiba,E., 2018) investigated the Internet traffic data based on regression Analysis and the Analysis of Variance(ANOVA). The possibility to have synthetic data of Network traffic is a strong instrument for traffic engineering plans such as traffic classification, link capacity, and anomaly detection. There are many models for the generation of synthetic data such as Autoregressive models (Kavasseri, 2009), Neural Network models (Guoa, Z., Zhaob, W., Luc, H.,Wang, J., 2012). Markov chains were frequently applied to generate synthetic data (Nfaoui,H.,Essiarab, H., Sayigh,A.A.M., 2004), (Hocaoglua,F., Gerekb, O., Kurbanb,M., 2010). We applied the semi-Markov chain which can perform better than the Markov models by generating synthetic time series data more precisely. Moreover, the Monte Carlo method is a stochastic simulation technique based on pseudo-random numbers which allow us to forecast the system's behavior. The combination of this method with the semi-Markov model can yield precise results. This modified method based on the semi-Markov simulation method was applied by (Masala, 2014) and (D'Amico,G. ,Manca, R. ,Corini, C., Petroni, F.,Prattico, F., 2016) for wind speed forecasting and by (D'Amico,G. ,Petroni,F., 2011) for stock price changes. This paper is organized as follows. In section 2 we illustrate the theoretical aspects of the semi-Markov model. Section 3 describes the application of the model applied to a real data set by testing the semi-Markov hypothesis, simulating with the Monte Carlo method and calculating the Autocorrelation functions and probability density functions of the network traffic. Moreover, the Network traffic forecasting is given with the semi-Markov model. At last, in section 4 we give some concluding remarks.

Semi markov process
We define a discrete-time semi-Markov process in this section. Let (Ω, ℱ, (ℱ ) ∈ ,P) be a complete filtered probability space and let E={1,2,…,m} be a finite space. A sequence of random variables = ( ) ∈ * , where : Ω → is called time-homogenous Markov chain if ( +1= | 0 = 0 , … , = ) = ( +1= | = ) = ∀ , ∈ , ∀ ∈ (1) The stochastic matrix = ( ) , ∈ is called a Markov transition probability. Suppose = ( ) ∈ be a sequence of positive random variables with values in N. Define = ( ) ∈ * as arrival times as: which shows the successive instants when a specific event occurs. Moreover, 0 ∈ is the initial time. The ( ) ∈ are called waiting times which indicate the times between occurrences of two consecutive events. A random sequence T such that the waiting times which indicate the times = ( ) ∈ * form an iid sequence, and 0 = 0 . ., and is called a renewal chain and the random variables , ≥ 0 are called renewal times. The increasing random sequence = ( ( )) ∈ * defined by is the counting process associated with the renewal chain T which shows the number of events occurred. The function = ( ( ): , ∈ , ∈ * ) which is a matrix-valued function is a discrete-time cumulated semi-Markov kernel if: • ( ) ≥ 0, ∀ , ∈ , ∈ * ; •(lim →∞ ( ) ) , ∈ is a Markov transition probability.
We assumed that ( , ) = ( , ) ∈ * is a Markov renewal process if it satisfies where ( ( ℎ , ℎ ), ℎ ≤ ) indicates the set of past values of the Markov renewal process ( , ). The associated semi-Markov kernel q is defined by The cumulated semi-Markov kernel associated with the Markov renewal chain ( , ) is as follows: It can be realized that The cumulative conditional and unconditional distributions of are defined as follows, respectively: ∀ ∈ , ∈ . The (. ) Presents the waiting time distribution function in the state i given that, with the next transition, the process will be in the state j. The sojourn time distribution (. ) can be any distribution function which is superiority to a Markov model. Now it is feasible to define the homogenous semi-Markov chain Z(t) associated with the Markov renewal chain ( , ) as Then, Z(t) represents the position of the embedded Markov chain at time t. The transition probability of the semi-Markov chain Z is the matrix-valued function defined as: Moreover, the transition probability can express recursively as a function of the semi-Markov kernel as: where is the Kronecker symbol. The transition probability is stated as a sum of two expressions. The first one is the probability of having no transition up to k while the second one denotes the probability to have at least one transition. The evolution of the semi-Markov chain is completely defined by this result and once the semi-Markov kernel is known it helps us to solve numerically the process.
(SMPs) are a very appropriate method to illustrate phenomena that exhibit a duration effect which shows the time system spent in a state influence its transition probabilities. An example of a semi-Markov trajectory is shown in Fig.  1. The sojourn and transition times are depicted in this figure. Moreover, the evolution of a semi-Markov chain from an initial state can be examined by the corresponding transition probability.

TEST
The semi-Markov hypothesis is tested by utilizing a statistical test suggested by (Silvestrov, 2004) and developed in this paper. The waiting time distribution for Markov models is the geometric distribution and other distributions for the sojourn times demonstrate that the Markov modeling is improper. In this case, the model can be regarded as semi-Markovian. The probability distribution function of the sojourn time in state i before making a transition to state j has been denoted by (. ). Define the corresponding probability mass function by Under the geometrical hypothesis, equality where ( , ) indicates the number of transitions from the state i to state j observed in the sample and ̂( ) is the empirical estimator of the probability ( ) which is given by the ratio between the number of transitions from i to j occurring exactly after x units of time and ( , ). This statistics, under the "geometrical" hypothesis 0 (or Markovian hypothesis), has approximately the standard normal distribution. Applying the results given in (Stenberg,F., Manca, R. and Silvestrov, D., 2006), the corresponding proof can be accomplished. In the case, when ̂ is the value of this statistics achieved from sample data, one can compute the quantity 2(1 − (|̂|)), where F(s) is the standard normal distribution. If this value is small, this would be evidence to reject the hypothesis that the distribution is geometrical, i.e., it would be evidence to the benefit of "semi-Markov" hypothesis. We employed this method to our data to perform tests at a significance level of 90%. The data used in this analysis come from the records at the University of Napoli "Federico II" Italy which are indicating the packet rate of traffic traces related to inbound traffic of the UNINA Network. This data trace was collected in the year 2004 from the real networks captured using Plab. The data were collected by a Traffic group whose focus is on Network Monitoring and measurements and their aim is measuring and evaluating both operational and experimental systems/Networks to obtain knowledge and models for Network behaviors. The packet rate was sampled with a period of 2 seconds and the trace lasts 2 hours. These data have been employed for studies on volume-based anomaly detection and are associated with time intervals during which no anomalies were observed on the UNINA Network by the NOC operators. Because we have three states as follows: We estimated 3(3-1) waiting time distribution functions and for each of them, we calculated the value of the test statistics (14). The geometric hypothesis is rejected for 3 of 6 distributions. Quantity 2(1 − (|̂|)) gives the value 0, 0, 0.1 for statistics ̂1 2 , ̂2 1 ̂1 3 , respectively. According to the remarks above, these values can be interpreted as strong evidence to the benefit of the semi-Markov hypothesis. In table 1, we reveal the results of the test applied to the waiting time distribution functions for some states which admit the semi-Markov hypothesis.

MONTE CARLO SIMULATION
In this section, the synthetic time series of the same length as the real one has produced using Monte Carlo simulation. The Monte Carlo algorithm involves repeated random sampling to enumerate successive visited states of the random variables { 0 , 1 , … } up to the horizon time T. The main distinction between Markov and semi-Markov models is that in the second case, the jump times { 0 , 1 , … } between successive transitions are regarded as a random variable. The Monte Carlo algorithm for the semi-Markov model which used to simulate the trajectory in the time interval [0,T] is as follows: 1.Set n=0, 0 = , 0 = 0, horizon time=T; 2. Sample J from ̂a nd set +1 = ( ); 3.Sample W from ̂, +1 and set +1 = +W(w); 4. If +1 ≥ stop, else set = +1 − and n=n+1 and go to (2). We exhibit in figure 2 the real and simulated trajectories of packet rate data.

MODEL VALIDATION
We compute the Autocorrelation Function (ACF) of real and synthetic data both with semi-Markov and Markov models using the Monte Carlo simulation to evaluate the ability of the model to reproduce statistical properties of real packet rate data. If Z indicates the packet rate, the time-lagged( ) autocorrelation of packet rate is defined as The time lag was made to run from 1 second up to 100 seconds. In order to compare results for ∑( ) each simulated time series was generated with a similar length as the real data. From figure 3 it can be concluded that the semi-Markov model reproduces better the autocorrelation present in real data than the Markov model. Moreover, this figure confirms the test results of section 3.1 in which we concluded that the Markov hypothesis is rejected in favor of the semi-Markov model for the packet rate of the traffic network.

PROBABILITY DENSITY FUNCTION
From examining the test hypothesis and the autocorrelation function we concluded that the semi-Markov model should be selected to generate synthetic trajectories of packet rate of traffic trace. In this subsection, we compare the probability density function (pdf) of real and both semi-Markov and Markov synthetic data with respect to 3 states to find the best model. Figure 4 indicates the comparison of the pdfs of the three trajectories. From this figure, it can be inferred that the semi-Markov model is accepted for the real data in comparison with the Markov one. The semi-Markov fitting is very good and there is very little space for improvement.

SEMI-MARKOV PREDICTIONS
We found out that the semi-Markov model should be chosen to generate synthetic packet rate traffic data. Therefore, we can forecast the future behavior of the Network traffic performance with a period of 2 seconds by this model and the Monte Carlo simulation. The transition probability matrix was obtained as: It can be concluded that the packet rate which is in 1 or 3 desired to visit 2 and the packet rate in 2 wanted to visit 1 . In figure 5, we show the interval transition probability matrices of the semi-Markov model which calculated with equation (12). The various matrices are plotted by varying the time t. It can be inferred that after one unit time which is 2 seconds the packet rate which is in 1 or 3 tends to visit 2 with probability 0.1324 and 0.6667, respectively and the packet rate which is in 2 will be desired to visit 3 with probability 1. Supposing that the last packet rate is in 1 , the probabilities of the next network traffic occurrence after 10 unit times with packet rate in 2 and 3 are 0.47 and 0.75, respectively. These interval transition probabilities indicate the probabilities of the occurrences of the high network traffic after another high traffic with the different packet rates in 1 , 2 and 3 after the next 1 to 50 unit times. All in all, the semi-Markov model gives realistic forecasted values. By estimating the interval transition probabilities, we can indicate the transitions between states of the system with respect to time.

4.Conclusions
The semi-Markov processes still have the Markovian hypothesis but in a more flexible way. The main feature of these processes is the opportunity of reproducing the duration effect of the regarded random event. This is made achievable by recognizing sojourn times in the states of the process that are arbitrarily distributed, including nonmemoryless distributions. In this paper, we introduced the statistics for testing the hypothesis of being the semi-Markovian. We employed the network traffic data for showing the capability of this test. The simple interpretation and mathematical tractability are the main advantages of the semi-Markov model. According to the obtained results, there was strong evidence of the benefit of the semi-Markov hypothesis against the Markov hypothesis. Moreover, the synthetic time series of the same length as the real one has produced by means of Monte Carlo simulation. This stochastic simulation technique authorizes us to make forecasting about the system's behavior. The calculations of the autocorrelation functions admitted the hypothesis test results in which the semi-Markov model is accepted in comparison with the Markov model. Moreover, we compared the probability density function of real and both semi-Markov and Markov synthetic data with respect to 3 states to find the best model. This comparison confirmed the semi-Markov model as well. Finally, the transition probability matrix and the interval transition probabilities with respect to different times were calculated which show the predictions of the future network traffic. Monitoring Internet traffic provides input for developing new protocols, algorithms, and systems for the current and future internet, with a particular focus on network management and security.