The normal-tangent-G class of probabilistic distributions: properties and real data modelling

This paper introduces a novel class of probability distributions called normal-tangent-G, whose submodels are parsimonious and bring no additional parameters besides the baseline’s. We demonstrate that these submodels are identifiable as long as the baseline is. We present some properties of the class, including the series representation of its probability density function (pdf) and two special cases. Monte Carlo simulations are carried out to study the behavior of the maximum likelihood estimates (MLEs) of the parameters for a particular submodel. We also perform an application of it to a real dataset to exemplify the modelling benefits of the class.


Introduction
Works on new probability distributions are vast in the statistical literature; the authors generally have the intent to propose extended models that provide more flexibility than the commonly used ones do. For this purpose, many generalizations of probabilistic distributions are formulated by incorporating one or more additional parameters. To this extent, Mudholkar and Srivastava (1993) presented one of the most common ways to perform this technique, wherein they exponentiated the Weibull cumulative distribution function (cdf) to a shape parameter. Besides making the distribution more flexible, it allowed the hazard rate function (hrf) to have new shapes. The procedure mentioned above was used by many authors to create generalized distributions. For instance, De Gusmão et al. (2011) exponentiated the cdf of the complementary Weibull model, discussed by Drapella (1993), to a shape parameter γ > 0 and created a three-parameter distribution called generalized inverse Weibull (GiW, for short), whose cdf is F GiW (x|α, β, γ) = e −γα β x −β . The inclusion of γ brings more flexibility; however, it is not seldom that distributions with many parameters may present major problems, like non-identifiability. Cordeiro

and Castro (2011) intro-
The normal-tangent-G class of probabilistic distributions: properties and real data modelling duced a new family of generalized distributions called Kumaraswamy-G, whose cdf is where a > 0, b > 0 and G(x) is the cdf of a parent distribution. This generator is versatile and able to produce powerful submodels, whose extra parameters are related to the skewness and tail weights. Now, let us consider the Kumaraswamy-Exponential model where θ = (λ, a, b) is the corresponding parametric vector; since F G (x|θ) = 1 − 1 − (1 − e −λ ) a b , it is easy to see that if θ 1 = (λ 1 , 1, b) and θ 2 = λ 2 , 1, λ1 λ2 b , then F G (x|θ 1 ) = F G (x|θ 2 ) for {λ 1 , λ 2 , b} ⊂ R * + . In other words, there can be an infinite number of parametric vectors associated with the same cdf. Such property is undesirable and may entail complications regarding estimation of parameters. The GiW undergoes the same problem as well for any pair of parametric vectors (α 1 , β, γ) and α 2 , β, γ α1 The list of classes like the Kumaraswamy-G, that is, classes containing at least one unidentifiable submodel (even when the baseline G is identifiable) is large. Some recent examples are the exponentiated logarithmic generated family (Marinho et al., 2018), the Gompertz-G family (Alizadeh et al., 2017), the Lindley family (Cakmakyapan and Ozel, 2017) and the exponentiated Kumaraswamy-G class . It does not mean that all of the submodels from the aforementioned classes are unidentifiable nor that one should keep from using them. Even so, researching and studying classes of probability distributions that somehow furnish identifiable submodels is worthwhile. For example, as long as the baseline G is identifiable, any distribution emerged from the normal-G class (Silveira et al., 2019) is identifiable; its submodels are generally more flexible than the original baselines, and they have the advantage of adding no extra parameters to the cdf. Some models, like the normal-Weibull, are able to fit asymmetrical data with either positive or negative skewness. Moreover, the distributions from the normal-G class can be continuous or discrete, depending on the baseline. It was designed according to the method of generating classes of probability distributions presented by Brito et al. (2019).
The method has a high power of generalization. It introduces a general expression given by: where H is a cdf, n ∈ N, ζ, ν : R → R and L j , U j , M j , V j : R → R ∪ {±∞}; certain conditions on these functions must be satisfied to assure that F is a cdf. Setting n = 1, and H(t) = Φ(t), where Φ is the standard normal cdf, this paper brings the normal-tangent-G class of probability distributions (NT-G, for short), whose cdf is: such that G(x) is the baseline cdf. This novel class shares some important features with the normal-G (particularly with regard to identifiability) but also presents new potentialities. We investigate some interesting models from the NT-G class and make comparisons with other well-known extended probability distributions.

The normal-tangent-G class and some mathematical properties
are the upper limits of the integral in (1) for the NT-G and the normal-G classes respectively, it is easy to verify that lim Considering that φ(t) = 1 √ 2π e −t 2 /2 , and Φ(x) = x −∞ φ(t)dt, one can write the NT-G cdf as follows: Notice that there are no additional parameters besides those of G(x|θ). In case of continuous G(x), one can take the The normal-tangent-G class of probabilistic distributions: properties and real data modelling derivative of (3) with respect to x to obtain the following pdf: where g(x) is the pdf of the random variable whose cdf is G(x). The hrf of the NT-G class is given by: The lack of identifiability is a problem that may cause some complications on parametric estimation since the uniqueness of the estimates is uncertain in this case. Hence, inferences on the parameters of an identifiable distribution are more reliable. As Theorem 2.1 states, members of the NT-G class are always identifiable as long as the baseline G is.
Theorem 2.1. If the cdf F G belongs to the NT-G class and the cdf G is identifiable, then F G is identifiable.

Quantile Function
The quantile function Q F (p) of the NT-G class is obtained without difficulty by inverting (3). That is: where Q G is the quantile function of the parent distribution G. We use Q F along with a uniform random number generator to perform the simulation of random variables following (3). Namely, if Z ∼ U(0, 1), then Q F (Z) ∼ NT-G. An advantage of the NT-G class over the normal-G is that (6) is continuous on the real interval (0, 1), whereas the normal-G quantile function has a point of discontinuity at p = 0.5. Thus, it would not be an obstacle for the NT-G class to define some meaningful quantities that depend on the median, like the Galton's skewness.

Shapes
The characterization of the distribution shape and the number of modes can be determined by examining the critical points of the NT-G pdf (4); its first derivarite is: where and g (x) is the derivative of g(x). Since the terms multiplying T (x) in (7) are positive, we have ∂ ∂x f G (x) = 0 ⇔ T (x) = 0. In other words, the critical points are the roots of T (x). If the sign of the second derivative of (4) evaluated at a particular critical point is non-positive, then the point is a mode. The explicit expression for ∂ 2 ∂x 2 f G (x) is presented in appendix A.

Special submodels from the normal-tangent-G class
Here we discuss two distributions from the NT-G class.
The normal-tangent-G class of probabilistic distributions: properties and real data modelling

Normal-tangent-Weibull distribution
The Weibull distribution is well-known by practitioners of diverse science fields. It is widely used in reliability analysis but also in hydrology and oceanography modelling data. For example, in recent paper Huang and Dong (2019) constructed probability distributions of wave periods in mixed sea states using a mixture Weibull distribution. Given (3) by G W . Likewise, using (4), one gets to the corresponding pdf, which is given below: The Weibull hrf may be constant, increasing or decreasing, depending on the value of the shape parameter k. On the other hand, the NT-Weibull hrf presents new non-monotonic shapes, as we can see in Figure 2. Notice yet that there are some values of the parameters, for which the pdf is bimodal (Figure 1).

Normal-tangent-log-logistic distribution
The log-logistic distribution is commonly used in survival analysis, but also in different areas like flood frequency analysis (Ahmad et al., 1988), economics and actuarial sciences (Kleiber and Kotz, 2003), where it is known as Fisk distribution. Some important extensions of the log-logistic distribution were used to model cancer data. Among them, one may cite the beta-log-logistic (Lemonte, 2014), used to model the remission times of bladder cancer patients and the McDonald-log-logistic, which was applied by Tahir et al. (2014) to study the survival times of patients with breast cancer. The log-logistic cdf is G LL (x|α, β) = 1− 1 + (x/α) β −1 ; replacing G in equation (3) by G LL , we get to the NT-log-logistic cdf F N T LL (x|α, β) = Φ tan π 1 2 − 1 + (x/α) β −1 , for x ≥ 0 where α, β > 0. According to (4), its corresponding pdf is: We bring a simulation study and an application of f N T LL to a dataset related to a breast cancer biomarker in sections 3 and 4. Plots of the NT-log-logistic pdf and hrf are exhibited in Figures 4 and 5 respectively, considering different values of α and β. It is worth remarking that the distribution can have different shapes of hrf, such as inverse "J", unimodal or increasing.

Series representation
The standard normal cdf may be linearly represented by: Using the result of equation (8) in equation (3), we have: The normal-tangent-G class of probabilistic distributions: properties and real data modelling   (10), the expression inside the wider brackets in equation (9) can be written as a series; that is: The normal-tangent-G class of probabilistic distributions: properties and real data modelling such that a m−1 = π 2m−1 (−1) m−1 2 2m (2 2m − 1)B 2m /((2m)!). Making m = k + 1 and then inserting (11) in (9), we get to: A well-known result on power series raised to integer powers states that: where c 0 = a N 0 , c k = 1 and using (13), the expression A1 in equation (12) can be written as follows: such that c 0 = a 2n+1 0 and c k = 1 ka0 k s=1 (2s[n + 1] − k)a s c k−s for k ≥ 1. Now using the binomial theorem, we can rewrite (14) as: Replacing A1 in the equation (12) by the expression in (15), we get to: Finally, using Fubini's theorem on differentiation we can write the derivative of (16) with respect to x: where δ n,k,j ∈ R and g j (x) = jG(x) j−1 g(x) is the pdf of a random variable from the exponentiated family (Mudholkar and Srivastava, 1993). Thus, we can say that (17) is the NT-G pdf (4) expressed as a linear combination of pdfs of exponentiated distributions. This result allows that many important quantities that depend on the pdf may also be represented by a series expansion, like the raw moments, the moment generating function, the characteristic function, the Rényi entropy and the order statistics.

Inference
The maximum likelihood method was used to determine the estimates of the parameters of the NT-log-logistic distribution in the next two sections. Its interesting properties such as asymptotic normality, efficiency and consistency The normal-tangent-G class of probabilistic distributions: properties and real data modelling motivate us using it. Here we present the estimates considering the general pdf (4): let X be a random variable following a NT-G distribution, such that G(x|θ) = G θ (x) is the baseline, g(x|θ) = g θ (x) is its corresponding pdf and θ = (θ 1 , . . . , θ m ) T is the m × 1 vector of parameters; given that X = (x 1 , . . . , x n ) is a complete random sample of size n from X, the log-likelihood function is: and the score vector is U (θ|X) = ∇ θ (θ|X) = (u i ) 1≤i≤m , where: One can obtain the MLEs by solving the system of equations u i = 0, for i = 1, . . . , m. For interval estimation of θ, one requires the information matrix J (θ|X) (see appendix B).
Under standard conditions of regularity, √ n( θ − θ) follows approximately a multivariate normal distribution N m (0 m , I θ −1 ), where 0 m is an m×1 vector of zeros and I θ is the expectation of J (θ|X), that is, the expected Fisher information matrix.

Simulation study
The software R version 3.4.4 (R Core Team, 2018) was used to perform the Monte Carlo simulation study. At each of the 10,000 replications, we generated samples of sizes n = 50, 100, 200, 500 from the NT-log-logistic distribution using the quantile function and a uniform random number generator as mentioned in section 2.1. We considered five different values for the parametric vector (α, β), as shown in second and third columns of Table 1, and calculated the bias and the mean squared error (MSE) of the estimates under the maximum likelihood method as follows: where α j and β j are the estimates for α and β at the j-th replication. We used the simulated annealing technique to find the global maximum of the log-likelihood function. This metaheuristic technique is available in R by the optim function, for which one has to define previously a vector of initial values, namely (α 0 , β 0 ). To do so, we firstly set (α 0 , β 0 ) = (1, 1) and run one single replication considering sample size n = 50; thereafter, the estimates produced in this step are assigned to (α 0 , β 0 ) and used in all of the scenarios.
As we can see in the fourth column of Table 1, the bias for α is small even in the worst scenario (namely n = 50 and α = 4.5) where the bias is lesser than 3% of the actual value of the scale parameter; similar behavior occurs for the bias of β, that is, it is small in all rows of the table. It denotes that the estimates are quite close to the real values of the parameters.
The MSE measures the quality of the estimates and being close to zero is a desirable behavior. Furthermore, one expects the MSE to decrease in as much as the sample size increases; we can see along the rows of sixth and seventh columns of Table 1 that it happens for all of the parametric vectors in study. These results indicate that the MLEs for the NT-log-logistic distribution behave appropriately.
The normal-tangent-G class of probabilistic distributions: properties and real data modelling

Application
Certainly one of the most significant elements that increase the probability of good results in the treatment against breast cancer is the early detection and an essential ally for the precise diagnosis is the use of biomarkers, namely Since the information criteria are related to the amount of information lost by a given model, one would expect that the smaller the values of AIC and BIC, the better the model. In this sense, we can say that NTLL is the best choice among the other distributions in study (see fourth and fifth columns of Table 2). The NTLL also presents the smallest values of A* and W*. As both statistics indicate the difference between the empirical distribution function and the real underlying cdf, we may say that NTLL fits the leptin hormone dataset better than the other options listed in the first column of Table 2. Figure 6 displays the histogram of the data and the overlapping pdfs of the three fitted models with the smallest values of A*. At first sight, the three curves seem to be plausible approximations, as they are quite close to the histogram. Notice, though, that the blue curve (NTLL) accommodates to the data slightly better than NLL and GoLL, especially for x > 35 ng/mL. We have, therefore, good reasons to point the NTLL as the best alternative, since it outperforms  The normal-tangent-G class of probabilistic distributions: properties and real data modelling

Conclusions
A new class of probability distributions called normal-tangent-G is presented, and some of its mathematical properties are discussed, like the quantile function and the series representation of the pdf. The submodels generated by the class are parsimonious; they bring no additional parameters besides those of the baseline G and enjoy the property of identifiability whenever G is identifiable. The class can have unimodal or bimodal distributions with different shapes of pdf and hrf. Monte Carlo simulation studies are presented to evince the good performance of the MLEs of a NT-G special case, namely, the NT-log-logistic distribution and an application of this particular submodel to a real dataset is carried out to exemplify its modelling benefits. The fitted model is compared to other seven competitive distributions (the baseline itself and six extensions of it) considering AIC, BIC, the Anderson-Darling and the Cramér-von Mises statistics as criteria for goodness-of-fit. All the results attest that the distribution emerged from the NT-G class outperforms the alternative models in study and allows us to point it as a useful and flexible tool for modelling real data.