A linear programming-based approach to estimate discrete probability functions with given quantiles

The aim of this paper is to estimate probability distribution functions with maximum entropy and known quantiles. The paper formulates the problem as a nonlinear optimization problem, and converts it into a system of nonlinear equations by Lagrange multipliers method. Finally, an efficient method is proposed to obtain a solution of the nonlinear system. The method needs to solve a linear programming problem in each iteration. Since linear programming problems can be solved in a reasonable time, our proposed method is faster than generic methods of solving nonlinear programming problems. Several computational experiment are provided to demonstrate the performance and validation of our proposed method.


Introduction
In probability and statistics, a random variable is described informally as a variable whose values depend on outcomes of a random phenomenon. The formal mathematical definition of random variables is a topic in probability theory. A random variable is a measurable function defined on a probability space that maps from the sample space, the set of all outcomes of a random phenomenon, to the set of real numbers. Whenever the image, or range, of a random variable is countable, it called a discrete random variable. Moreover, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for a random variable.
Sometimes it is needed to find a probability distribution function by some initial observations. If some data are available, we achieve the goal by fitting a probability function on the data. There are a variety of methods, such as method of moments, maximum spacing estimation, method of L-moments (Hosking, 1990), maximum likelihood method (Aldrich et al., 1997), and statistical software (Oosterbaan, 2019) that can be used to obtain a probability distribution function. These methods may provide several different solutions for one problem because their approach is different.
In the case that some certain parameters of a distribution are known, instead of its data, there are also several methods to estimate the distribution. Since the selection of an appropriate distribution depends on several different factors, these methods usually do not provide a unique solution. For example, suppose that we know the data are symmetrically distributed around the mean while the frequency of occurrence of data farther away from the mean diminishes. One may select the normal distribution, the logistic distribution, or the student's t-distribution. The first two cases are very similar, while the last one, with one degree of freedom, has "heavier tails" meaning that values farther from the mean often occur relatively more, i.e., the kurtosis is higher.
By a pessimistic viewpoint, the maximum entropy provides an approach to estimate a unique solution to the problem (Cover and Thomas, 2006). Let us discuss it in the following. Shannon (1948) first introduced entropy for a random variable with a sample space = { 1 , 2 , … , } and corresponding probabilities , = 1,2, … . as follow: and ≥ 0 for = 1,2, … . In information theory, the maximum entropy problem is formulated as follows (Cover and Thomas, 2006): where 's are known real numbers and 's are real-value functions. The problem is to estimate a discrete probability function satisfying some given conditions (1c). Note that is the decision variable of the problem, which indicates the probability of observing . So its value must be non-negative (1d), and furthermore, the sum of its values must be equal to one (1b) to satisfy the principles of probability. Let us now review some papers in this field. At first, Jaynes (1957) estimated a probability distribution with minimal probability under some certain conditions using Shannon's maximum entropy. Subsequently, many researchers focused on this issue. Zografos (2008) studied the features, applications and generalization of the maximum entropy problem. Landsman and Makov (1999) and Najafabadi et al. (2012) used it in belief theory. Krvavych and Mergel (2000) and Sachlas and Papaioannou (2014) examined the distribution modeling using the maximum entropy method. Most works were focused on estimating continuous probability functions.  Basset (2015) summarized this issue. Many researchers have used maximum entropy method to find a probability function with highest uncertainty under some known conditions. For example, Zhao and Zhang (2011) proposed an ensemble neural network, which combines the component networks using the entropy theory. The entropy-based ensemble neural network searches the best structure of each component network first, and employs entropy as an automating design tool to determine the best combining weights. Bajgiran et al. (2020) explored the maximum entropy models that are the minimum elaborations of the uniform and moment-based ME models by quantiles. This property provided a diagnostic for the utility of elaboration in terms of the information value of each type of information over the other. They said "the maximum entropy model with quantiles and moments is represented as the mixture of truncated distributions on consecutive intervals whose shapes and existence are determined by the moments". In another article, the application of maximum entropy principle has been discussed to establish a probability distribution when the mentioned summary statistics are available, and its extension to moment constraints has been introduced to satisfy the requirements of metrology (Barzdajn, 2014). In this paper, we investigate the maximum entropy problem of estimating discrete probability functions whenever some quantiles are known. Similar to the above argument, we obtain a nonlinear system and propose a novel method to solve the system. Our proposed method has an iterative process and needs to solve a linear programming problem in each iteration. Since there are several efficient methods, such as interior point methods (Vanderbei, 2015), to solve linear programming problems, our proposed method obtain a solution of the system in a reasonable time.
Although there are many numbers of papers to estimate probability functions for continuous random variables, to the best of our knowledge, there is not any paper to investigate our proposed case (for discrete random variables with known quantities). Specially, the approach of solving the obtained nonlinear optimization problem is not presented in any paper. This is a bad news, because we cannot compare our results with the other available approaches. To resolve the problem, the validation of approach is evaluated by comparing its solutions with ones obtained from the nonlinear programming solver of Matlab software whose solutions are exact. The reminder of this paper is organized as follows. Section 2 defines the problem and formulates it, mathematically. In Section 3, our proposed approach is presented to solve the problem. In Section 4, some computational experiments are conducted to demonstrate the validation and performance of the proposed approach. Finally, Section 5 points some concluding remarks, and proposes some subjects to future works.

Problem statement
Consider a discrete random variable with a sample space = { 1 , 2 , … , } in which 1 < 2 < ⋯ < . One can find many different distribution functions with some predetermined information, such as mean, variance, moments or quantiles. In this paper, we want to obtain a probability distribution function of the discrete random variable with prescribed quantiles. By a pessimistic viewpoint, we select one that has the greatest entropy value among all distributions with this property. This function will have the highest possible uncertainty. Consequently, the actual probability distribution function will be more definite than this function. To formulate the problem, assume that , = 1,2, … , − 1, is the th quantile with rank such that Moreover, suppose that the value of , = 1,2, … , − 1, belongs to the interval [ , +1 ). So, we can formulate the problem of obtaining 's with prescribed quantiles as follows: This is an optimization problem containing nonlinear objective function (2a) as well as linear constraints (2b-2d). So it is a constrained nonlinear programming problem. By multiplying the objective function by −1, one can convert the problem into a minimization problem. Similar to the papers (Arandjelovi´ c et al., 2014), (Van der Straeten, 2009), and (Templeman and Xingsi, 1987), we neglect the constraint ≤ 1 because this is satisfied by the constraints (2b) and (2d).

Our proposed method
In this section, we convert problem (2) into a nonlinear system. Then, we propose a method based on linear programming to solve this system. To find an optimal solution of problem (2), we need to use nonlinear optimization techniques. One of the most popular techniques is Lagrange multipliers method. This method adds the constraints with some coefficients to the objective function. Hence, the problem changes to an unconstrained nonlinear programming problem. Then, by setting partial differentiations of the new objective function equal to zero, a critical point is found. Now, consider an arbitrary instance of problem (2). To construct the Lagrange function, we do neglect nonnegativity constraint (2d), although this constraint will be added later to the problem. The corresponding Lagrangian function is in which 's are Lagrange multipliers. Instead of solving problem (2), it is needed to find a non-negative solution that minimizes the function ( , ). Since the entropy function is concave (For a proof, see (Cover and Thomas, 2006), Theorem 2.7.3) and constraints (2b-2d) are linear, it follows that ( , ) is convex. Hence, any local minimum point of ( , ) is a global minimum. To find a local minimum, it is sufficient to look for a critical point. In the other words, we find a non-negative solution for the following system: ( , ) = 0, = 1,2, … , , ( , ) = 0, = 1,2, … , .
By substituting (3) in (4) and (5), we have where = [ 1 , . . . , ] and is the th column of the coefficient matrix of the problem (2). If = 0 , then its th element of is as Otherwise, its jth element is as follows: In system (6), only equation (6a) is nonlinear due to the existence of . Although one can use any method of nonlinear systems, such as Newton method (Deuflhard, 2011), solving the system, it is not an easy issue due to its nonlinearity. To solve this system, we propose an iterative procedure. It generates a vector ( ) in th iteration which is an estimation of . The process begins with an initial vector (0) > 0. Suppose that the vector ( −1) is found in ( − 1) th iteration. The following problem is solved in th iteration to generate ( ) .
Constraints (7b-7d) are the same equations of system (6) with this difference that a new variable is replaced with ( ). Constraint (7e) is added to the problem to satisfy the nonnegativity of . The purpose of solving the problem is to find the vectors * , * and * so that the distance between ( −1) * and * ( −1) is minimized. So * is an appropriate estimation of . After solving the problem, we set ( ) = * and repeat the process until the optimal objective value * becomes zero. It is remarkable that the condition > 0 is required to be added to the problem because the logarithm function is not defined at zero. However, we do not need to express explicitly it in practice because (0) is defined to be negative infinity, as a limitation value, in many programming languages. Hence, if this value appears in the objective function, it increases the objective value to positive infinity. This never occurs because it is a minimizing optimization problem. Theorem 3.1. If problem (7) has an optimal solution * , * and * with the optimal value * = 0, then the vectors * and * are a solution of system (6).
The reason for transforming the nonlinear optimization problem to a linear programming problem is that linear programming problems can be solved exactly in a finite number of iterations by several algorithms, such as active set, and interior point methods (Terlaky, 2013). This aids us to use the inherent simplicity of linear programming. Now, we are ready to explain our proposed approach in complete details. This approach begins with an initial solution (0) > 0. This solution can be obtained from solving the system of linear equations (8d-8f). Then, problem (8) is solved for every number = 1,2,3, . .. and the sequence { ( ) } is generated. The process is repeated until the optimal value becomes zero (see Theorem 3.1).

Computational results
In this section, some computational experiments are conducted to demonstrate the validation and the performance of our proposed approach. All experiments are performed on a 4-core computer with 8GB of RAM and Windows 10 operating system. We used the software of MATLAB for implementing programs. To perform computations, two stopping conditions are used. The first condition is * > 10 −10 , that is, the process terminates if the optimal value of problem (8) is sufficiently close to zero (see theorem 3.1). Results show that the accuracy of solutions is guaranteed even if the number of iterations is small. For this reason, the second stopping condition is that the number of iterations is at most equal to 5. To solve linear programming problems, the function "linprog" of Matlab is used. We compared our proposed approach with a nonlinear optimization method, called the intrinsic method that runs directly on problem (2). This approach is the default method used in the command "fmincon" of Matlab.

Validation
To check the validation of the approach, we have solved a small instance of the problem and have compared its solution with the one obtained from solving the problem by Matlab solver. For this purpose, we have supposed that is a random variable on a sample space = {1,2, . . . ,20}, moreover, 1 = 2, 2 = 5.5, 3 = 11, 4 = 12.5, with rank r = 5. Our approach solved the problem in 0.425796 seconds while Matlab solved it in 0.706412 seconds. The entropy values are respectively −2.7966 and −2.7967. Table 1 shows the results and Figure 1 depicts them. It is obvious that the solutions are very close together. The results show the validation of our approach for this small instance.
However, comparing entropy values of randomly generated large instances is also another useful tool to check the validation of our approach. This issue together with comparing running times are performed in the next subsection.  A linear programming-based approach to estimate discrete probability functions with given quantiles 845 All experiments were repeated ten times and the average of results (running time and entropy value) were reported. Two types of experiments were performed for comparison. In the first experiment, it is assumed that n = 100 and the number of quantiles varies from = 3 to = 10 (see Figure 2). In the second experiment, the number of quantiles is equal to 4, and = 50,75,100,125,150,175,200 (see Figure 3). As we expected, the running time of our proposed approach was significantly less than that of the generic nonlinear method because ours applies linear programming methods as fast tools to solve the nonlinear problem. On the other hand, the entropy values obtained from both the approaches were approximately the same. This establishes that our proposed approach has a fairly good performance both in running time and in solution accuracy.

Sensitivity Analysis
Here, we conduct a sensitivity analysis of our approach. Since the problem does not contain any single parameter, we run the sensitivity analysis by changing the range of randomly generated data. Recall that input data were generated randomly as ∼ (0, ), = 1, . . . , ∼ (0, 1 ) , = 1, . . . , .
for = 1. Now, assume that = 20 and varies from 1 to 4. This sequentially changes the range of . Specially, all values are in the range [0,5] for = 4 to show the probability distribution function has positive skewness. Table 2 and Figure 4 provide the numerical results as well as the graphs obtained from the sensitivity analysis. The graphs show that the skewness changes significantly whenever s varies from 1 to 4. By comparing the numerical results, it is easily observed that our approach performs correctly and its time is better than Matlab Solver.

Conclusions
In this paper, the problem of estimating a discrete probability function with given quantiles was investigated. The well-known Lagrange multipliers method was used to obtain a nonlinear system for finding a probability distribution with some known quantiles and maximum entropy value. Then, a novel approach was proposed to solve the nonlinear system. The use of linear programming approach is the most important feature of our approach because in spite of the fact that the problem is inherently nonlinear, we could design the approach that uses the intrinsic simplicity of linear programming problems at each iteration. This causes that the approach has a fairly good performance both in running time and in solution accuracy. Another feature of the approach is that the linear programming problems corresponding to two consecutive iterations are different only in a set of constraints. So one can apply the sensitivity analysis approach of linear programming problems to obtain the optimal solution of an iteration from that of the previous iteration. The problem has the property that any probability value cannot be zero because the logarithmic value at zero is assumed to be equal to −∞. This imposes the limitation of using the well-known simplex method because this method obtains a basic optimal solution which contains many nonbasic variables being equal to zero. So other methods have to be used for solving linear programming problems.
As an application of our approach, it will be meaningful to investigate the maximum entropy problem for estimating continuous probability functions whenever some quantiles are known. For this purpose, one can apply a numerical discretization method and then, uses our approach to obtain a near-optimal solution. Due to the existence of different discretization methods and time-consuming computations, this can be a vital issue in this field. As another suggestion, one can use nonlinear programming techniques for designing other approaches of estimating probability distribution functions.