A zero truncated discrete distribution: Theory and applications to count data

The analysis and modeling of zero truncated count data is of primary interest in many fields such as engineering, public health, sociology, psychology, epidemiology. Therefore, in this article we have proposed a new and simple structure model, named a zero truncated discrete Lindley distribution. The distribution contains some sub models and represents a two-component mixture of a zero truncated geometric distribution and a zero truncated negative binomial distribution with certain parameters. Several properties of the distribution are obtained such as mean residual life function, probability generating function, moments of residual life function, raw moments, estimation of parameters, Shannon and Rényi entropies, a characterization, and stress-strength parameter. Moreover, the collective risk model is discussed by considering the proposed distribution as primary distribution and exponential and Erlang distributions as secondary ones. Test and evaluation statistics as well as three real-life data applications are considered to assess the performance of the distribution among the most frequently zero truncated discrete probability models.

of stroke count data, Phange and Loh (2013) and the reference therein discussed the application of ZTNB in the analysis of abundance of rare species and hospital stay. Groggan and Carson (1991) applied the truncated model to a number of recreational fishing trips taken from a sample of Alaskan fishermen. Moreover, Ghitany et al., (2008), Borah and Saikia (2017), Shanker (2017) and Shanker and Shukla (2017) discussed the ZTPL, ZTDS, ZTPA and ZTPGr distributions respectively and applied to real count data. It is observed in the above discussed zero truncated models that neither have high peakedness nor possess heavy tail. In addition, these models do not work very well in both over and under dispersion phenomena. Moreover, none of the above mentioned models exhibit skewness and kurtosis in the domain of interval (2, ∞) and (6, ∞) respectively, also these models do not possess closed form and algebraically manipulate able survival functions. The mentioned drawbacks above were the motivation to propose a simple structure zero truncated model with two parameters under the name zero truncated discrete Lindley distribution (ZTDL). The proposed ZTDL has closed form structure of its various statistical properties along with heavy right tail than the above competing zero truncated models. Moreover, the peakedness and skewness of the proposed ZTDL model approach to 6 and 2, respectively, which is not observed in the above mentioned models except the ZTPL distribution. Further, the proposed model may be helpful in developing hurdle models by acting as a baseline distribution and in regression modeling. Also, the distribution is over-and under-dispersed. The flexibility of the two parameter ZTDL over the one parameter ZTDL and ZTG (zero truncated geometric) distributions shall be investigated later by using stochastic ordering. Further, the flexibility of the model is investigated by its consonance with some real data sets as revealed through techniques such as empirical distribution function plots, empirical hazard function plots, Anderson-Darling, Kolmogrov-Simnorov (KS), Akaike information criterion (AIC), Bayesian information criterion (BIC) and Akaike information criterion corrected (AICc) statistics. For this purpose three real data sets are analyzed and it is noted that the proposed model is the most suitable distribution among others by exhibiting minimum values for those statistics. In addition, the proposed model only exhibits the increasing failure rate phenomena which is the desired requirement of the life testing experiment. Other motivations of the model in terms of its hazard rate and mean residual life functions are discussed in the coming sections. A Lindley distribution is used extensively to describe the lifetime of a process or a device in a wide variety of fields, including reliability, engineering, biology, and medicine. The first version of Lindley was proposed by Lindley (1958). After that a number of forms of the distribution has been described in academic literature, such as a generalized Lindley  The proposed discrete distribution here is the zero truncated discrete version of the distribution specified by (1.1) and its probability mass function is obtained via the discretization approach by incorporating (1.1) in the following identity where f(x) is the probability density function of a continuous random variable. Therefore in the light of equations above we define the zero truncated discrete Lindley distribution denoted by ZTDL( ) with parameters and as where 0 < < 1 and ≥ 0. For = 0 the distribution reduces to zero truncated geometric distribution and for β = 1 it becomes one parameter zero truncated dis-crete Lindley distribution, respectively, noting that the case for β = 1 is not discussed before. Also, when = 2, the distribution reduces to the zero truncated negative binomial distribution with parameters (2, p). Moreover, the probability mass function in (1.2) shows that the ZTDL( ) distribution is a two-component mixture of a zero truncated geometric distribution with parameter p and a zero truncated negative binomial distribution with parameters (2, p) and mixing proportions 1− 1+ − and 1+ − respectively. Figure 1 indicates how the parameters p and β affect the ZTDL probability mass function and shows right skewness and uni-modality. Moreover, it can be noted that, the ZTDL distribution can be obtained by using the expression where ( = ) is the two parameter discrete Lindley distribution introduced by Hussain et al. (2016) and given by The recursive relation between ZTDL probabilities is given by Since ZTDL is based on two parameters namely shape parameter p and scale parameter β, so Figure 1 under the influence of theses parameters indicates that as p → 0 the mode of the function moves towards left that is probability graph of ZTDL makes a reverse J shape for all values of β. However, as p→1 the mode of the distribution moves towards right for all values of β. Moreover, the distribution does not achieve perfect symmetry for all values of p and β. In addition to, we have also theoretically observe that = 0 as p→1 and = (1 − ) 2 −1 as → ∞. Thus the survival and hazard functions of the ZTDL distribution are (1− )+ as → ∞. However, ℎ = 1 − as → ∞, therefore, the hazard rate function of the ZTDL distribution is bounded above, which is an important property for the lifetime models. Figure 2 gives some plots of the hazard rate function for some selected values of the parameters. It is observed that the ZTDL distribution exhibits an increasing failure rate for all values , anf . However, if p is considered as successfully working of the product then as p → 1 the intensity of failure rate decreases for all values of β . Moreover, if p →0 the distribution has a constant failure rate. The mean residual life (MRL) function of the ZTDL distribution is defined as and for ZTDL is expressed as We may interpret the second term of equation (1.5) as a correction term for the MRL of the ZTDL distribution than the zero truncated geometric that for which β = 0. Also, it is noted that the MRL of ZTDL distribution is greater (equal) than (to) the MRL of zero truncated geometric and one parameter zero truncated discrete Lindley distribution based on value of the parameter β. Further discussion to MRL will be given later.
Rest of the article is organized as follows. Section 2 gives several properties of the ZTDL distribution such as reliability function, mean residual life function, probability generating function, moments, moments of residual life function, estimation of parameters via maximum likelihood method and associated Fisher information matrix, Shannon and Rènyi entropies. A characterization, convolution and stress-strength parameter are investigated in Section 3, besides the collective risk model where the proposed distribution is considered as primary distribution and exponential and Erlang distributions as secondary ones. Test and evaluation statistics as well as three real data application for the ZTDL distribution are discussed in Section 4. Finally, Section 5 deals with conclusions.

Properties of the ZTDL distribution
In this section, some statistical and reliability properties of the ZTDL distribution are investigated by the following theorems, propositions and corollaries.

Log-concaveity and its implications:
A probability distribution is said to be log-concave if its probability mass function (pmf) satisfies the inequality 2 ≥ −1 +1 , ∀ . The log-concavity of a distribution has important implications on the characteristic of its reliability function, failure rate function, tail probabilities and moments. Therefore, the log-concavity produces an increasing failure rate and a monotonically decreasing mean residual life time function (see Chakraborty and Ong, 2016). The next proposition shows the log-concavity of the ZTDL distribution. Proposition 2.1: If Y~ZTDL( , ) then the probability mass function (pmf), of the random variable Y is log-concave for all choices of β and independent of p.
Proof: In order to show that the ZTDL distribution defined by (1.2) is log-concave, it is sufficient to show that 2 ≥ −1 +1 , ≥ and = 2,3,4, ….which follows by noting that (1 + ) 2 ≥ (1 + ) 2 − 2 . i) ZTDL has an increasing failure rate function. ii) Convolution of ZTDL with any other discrete distribution will also result in a log concave distribution. iii) For any integer + ≥ + for < that is + is non increasing in . iv) ZTDL has at most an exponential tail, lim Proof: According to Abouammoh and Mashhour (1981) and Nekoukhou et al., (2012) the unimodality condition is defined as This completes the proof.
The following proposition confirms again the increasing failure rate of the ZTDL distribution.
For infinitely divisibility, self-decomposability and stability property of ZTDL(p,β), it is noted that according to Steutel and Harn (1979) and Nekoukhou et al., (2012), a necessary conditions for infinite divisibility of discrete distribution is that 0 > 0 >0, which in the ZTDL distribution case is not fulfilled as 0 = 0. Moreover, classes of self-decomposable and stable distributions are subclasses of infinitely divisible distributions, so non infinitely divisible distribution is neither self-decomposable nor stable, see Nekoukhou et al., (2012).
The recognition of any discrete probability distribution is usually based on its probability generating function (pgf). The following theorem gives the pgf of ZTDL distribution.

Proof:
The proof can be obtained by routine calculations.

Corollary 2.3:
If t is replaced by , we get the moment generating function (mgf) of the ZTDL distribution as where 0 < < 1.
A zero truncated discrete distribution: Theory and applications to count data 172 Corollary 2.4: The ℎ . derivatives of equation (2.1) with respect to to = 0. yield the rth moment Hence, from above corollary we have the first four moments about origin as (see Gradshteyn, and Ryzhik, 2008) is the poly logarithm function. Therefore, by using these expressions, the mean and variance of the distribution are, respectively, given by In statistics, the phenomena of over-and under-dispersion relative to the Poisson distribution is generally observed in count data. There are various causes of such phenomenon, like heterogeneity and aggregation for over-dispersion and repulsion for under-dispersion although less frequent, see Kokonendji and Mizre (2005). As the index of dispersion (ID) is a measure of such phenomena and defined as variance to mean ratio so for the ZTDL distribution it can be expressed as From the above expression it is obvious that as → 0 the ID becomes equal to zero, i.e., for smaller values of p the distribution is under dispersed. Similarly as → 1 the ID approaches to infinity. Moreover, for → 0 the ID becomes  It is noted that, for β < 8 and p < 0.4145, the distribution is under-dispersed and as p → 1 even at smaller values of β the distribution is over dispersed, i.e., β → 0 and p → 1, the ID > 1.Another important statistical properties is the skewness which usually refer to the departure from symmetry which is the ratio of square of the third mean moments to the cube of the second moment about mean, for ZTDL distribution it is expressed as While, kurtosis measures the degree of peakedness or flatness of a unimodal frequency curve and defined as the ratio of fourth mean moment to the square of the second moment about mean, for the ZTDL distribution it is expressed as 4 (1 + (7 + )) + 2 4 (1 + (10 + )) + 3 (1 − )(7 + )(1 + (10 + )) + 2 (1 − ) 2 (9 + (89 + (35 − ) )) + (−1 + ) 3 (5 + (45 + (21 + )))}  Figure 4 gives plots of the skewness and kurtosis for different values of the distribution parameters. It is observed that the distribution is positively skewed and leptokurtic in nature. Skewness becomes equal to 2 as p and increases and peakedness of the distribution gives higher values at smaller p and , however as p and increases it settled down to 6. From the discussion above it is obvious that the ordinary moments are generally helpful in estimating the unknown parameters and characterization of the shape of the distributions.

Residual life function with some properties
Residual life random variable is used much in risk analysis and so we investigate some of its properties, like survival function, mean, variance and rth moment for the ZTDL distribution. The residual life is defined by the conditional random variable, R(t) = X − t|X > t, t ≥ 0, and described as the the period from time t until the time of failure. The survival function of the residual lifetime R(t), for the ZTDL distribution is and the corresponding probability mass function can be obtained by (R(t) = ) = R(t) ( ) − R(t) ( + 1). The mean residual life is defined as where and t are given in (1.2) and (1.3), respectively. Based on the behaviour of the failure function hx of the ZTDL distribution, which is increasing, we conclude that K is decreasing. The variance residual life has considerable interest in the recent years and for the ZTDL distribution it is given as

Entropies
In statistical mechanics, the entropy may be interpreted as measure of disorder in the distribution. The most frequently discussed entropies are Shannon and Renyi entropy. The derived expressions for these entropies when Y is a discrete ZTDL random variable are given below. The Shannon entropy is defined by E[−log(P(Y = x))] and expressed by using the Taylor series expansion for ln(1 + z) as , where µ′ and µ are the ℎ order moment about origin and mean of the distribution. The Renyi entropy is defined as where λ > 0 and λ ≠ 1.

Parameter Estimation with Inference
Let 1 , 2 , . . . , be a random sample drawn identically independently from the ZTDL distribution with observed values 1 , 2 , . . . , , then their joint probability function as a log-likelihood function is Now on partially differentiating both sides of equation (2.3) with respect to p and β and equating them to zero we get MLEs of p and β, respectively, as Proof of Proposition 2.4 is investigated by the next corollaries. , on using the above expression and then simplifying it, we get (2.5).

Proof:
The proof is simple.

Corollary 2.8 :
The minimum variance bound estimator of the ZTDL (p,β) when β is known is ̅ with variance

Convolution
Convolution is used in the mathematics of many fields, such as probability and statistics. In engineering, convolution is used to describe the relationship between three variables of interest: the input signal, the impulse response, and the output signal. The following theorem deals with the convolution of the ZTDL distribution. , which on simplification yields equation (3.1). This completes the proof. In the next corollaries, the abbreviations ZTDL 1 (. ), G 1 (. ), G 0 (. )and DL 0 (. )denote the ZTDL zero truncated geometric, geometric and one parameter zero truncated discrete Lindley distributions, respectively.

Collective risk model
Generally, the random sum models are widely used in risk theory, queueing systems, reliability theory, economics, communications, and medicine. In insurance contexts, the compound distribution of random sum variables arises naturally as follows (see Bakouch and Severini (2009) and Bhati et al., (2015)). Let N be number of claims in a certain period which is a random variable and Xi be claim severity random variable, which is independent of N, is the size of ith claim. Therefore, aggregate loss is defined as = ∑ =1 . It is well known that the pdf of S is given as , where denotes the probability of n claims (primary distribution) and * ( )is the ℎ fold convolution of ( )the claim amount (secondary distribution). Moreover, the distribution of the random variable S is known as a compound distribution of the random variables N and . Now, we shall consider two such situations: In the first situation, the primary distribution is as defined in equation (1.2) and claim severity distribution as exponential distribution with parameter λ. In the second situation, we consider the Erlang distribution with parameters r and λ as the secondary distribution which usually arises in insurance settings when the individual claim amount also follow the gamma distribution.

Theorem 3.2:
Suppose that ZTDL( , ) acts as primary distribution and an exponential distribution with parameter acts as secondary a distribution, then the pdf of aggregate loss random variable = ∑ =1 where X~Gamma( , ) is given by

Proof:
If the claim severity follows an exponential distribution with parameter λ > 0, since the nth fold convolution of exponential distribution is gamma distribution with parameter n and λ, the ℎ fold convolution is given by * ( ) = ( − 1)! −1 − , ∈ ℵ.
Hence, the pdf of the random variable S is given by which after simplification yield

In the actuarial literature the mean of the aggregate loss random variable S can be obtained as E(S) = E(X)E(N).
Hence, if ~Exp( ) then

Proof:
By assuming that the claim severity follows an Erlang distribution with parameter 2, λ > 0, the ℎ fold convolution is given by hence the pdf of the random variable S is given by Now using the series ∑ It is also observed that the mean of the aggregate loss random variable S can be obtained as E(S) = E(X)E(N). Hence, if ~Erlang(2, ) then A zero truncated discrete distribution: Theory and applications to count data 180

Stress-strength parameter
In the context of reliability, the stress-strength model describes the life of a component which has a random strength 1 that is subjected to a random stress 2 . The component fails at the instant when the stress applied to it exceeds the strength, and the component will function satisfactorily whenever 1 > 2 . Therefore, = ( 1 > 2 ) is a measure of component reliability. It has many applications especially in engineering concepts such as structures, deterioration of rocket motors, fatigue failure of aircraft structures, and the ageing of concrete pressure vessels. Proposition 3.3: Suppose 1~Z TDL( 1 , 1 ) and 2~Z TDL( 2 , 2 ) are two independent discrete random variables, then stress-strength parameter Proof:The proof obtained by the definition of R and simplification to (1 + 1 )(1 + 2 )(1 − 1 ) + 1 1 ).

Figure 5: Stress Strength Parameter Graph for the indicated values of p and β
The plots in Figure 5 depict that as the stress parameter 2 approaches to 1, the probability of the system approaches to zero for all values of 1 and 2 .

Test and evaluation statistics
In

Data Application
For comparison purposes, we have used three data sets which are reported by Mir  These smaller values of the model selection statistics advocate that the proposed model is the best probability model among the competing models for this under dispersed data set. Experimental Data sets: The second and third data sets are given in Tables 2(a) and 3(a) attached in appendix. The second data set consists of remission times in weeks for 20 leukemia patients randomly assigned to a certain treatment. Treatment for leukemia aims to achieve remission. Remission means that no leukemia cells can be found in the blood or bone marrow and the bone marrow is working normally again. In people treated for acute leukemia, remission may last many years, and then they are A zero truncated discrete distribution: Theory and applications to count data 182 considered cured. Whereas the third data set represents at least one goal scored directly from a penalty kick, foul kick or any other direct kick (all of them together will be called as kick goal) by any team. These data sets are taken from Bakouch et al., (2014) and Kundu and Gupta (2009) respectively. From Table 2 (b) and 3(b) attached in appendix, it is evident that these data sets are over dispersed data sets. The KS, * 0 , AIC, BIC and AICc statistics are provided in the Tables 2(d) and 3(d). From these results it is observed that the proposed model is also appropriate for overdispersed data structure by showing minimum − , * 0 and KS with higher p-values. Moreover, the corresponding distribution plots comparing the fitted and observed distribution functions of the three data sets are shown in Figures  6-8. Also, the hazard function plots comparing the fitted and empirical functions are displayed by Figures 9-11 . The distribution plots and empirical hazard function show that the proposed ZTDL distribution provides better fits than other competing distributions for at least three datasets

Conclusions
A Zero truncated discrete Lindley with two parameters is proposed and it contains the zero truncated geometric and the one parameter zero truncated discrete Lindley distributions as submodels. Moreover, the distribution represents a two-component mixture of a zero truncated geometric distribution and a zero truncated negative binomial distribution with certain parameters. Various distributional properties, reliability characteristics and a characterization are studied. Also, the collective risk model is discussed by considering the proposed distribution as primary distribution and exponential and Erlang as secondary distributions. Moreover, it is found that the distribution has not only simple structure, mathematically amenable, more flexible and longer tail than other models but also more appropriate for modeling actuarial ecology, health, psychology, sociology and engineering data sets.