The Poisson binomial distribution (in the following abbreviated as PBD) is becoming increasingly important, especially in the areas of statistics, finance, insurance mathematics and quality management. This package provides functions for two types of PBDs: ordinary and generalized PBDs (henceforth referred to as O-PBDs and G-PBDs).
The O-PBD is the distribution of the sum of a number n of independent Bernoulli-distributed random indicators Xi ∈ {0, 1} (i = 1, ..., n): $$X := \sum_{i = 1}^{n}{X_i}.$$ Each of the Xi possesses a predefined probability of success pi := P(Xi = 1) (subsequently P(Xi = 0) = 1 − pi = : qi). With this, mean, variance and skewness can be expressed as $$E(X) = \sum_{i = 1}^{n}{p_i} \quad \quad Var(X) = \sum_{i = 1}^{n}{p_i q_i} \quad \quad Skew(X) = \frac{\sum_{i = 1}^{n}{p_i q_i(q_i - p_i)}}{\sqrt{Var(X)}^3}.$$ All possible observations are in {0, ..., n}.
The G-PBD is defined very similar. Again, it is the distribution of a sum random variables, but here, each Xi ∈ {ui, vi} with P(Xi = ui) = : pi and P(Xi = vi) = 1 − pi = : qi. Using ordinary Bernoulli-distributed random variables Yi, Xi can be expressed as Xi = uiYi + vi(1 − Yi) = vi + Yi ⋅ (ui − vi). As a result, mean, variance and skewness are given by $$E(X) = \sum_{i = 1}^{n}{v_i} + \sum_{i = 1}^{n}{p_i (u_i - v_i)} \quad \quad Var(X) = \sum_{i = 1}^{n}{p_i q_i(u_i - v_i)^2} \\Skew(X) = \frac{\sum_{i = 1}^{n}{p_i q_i(q_i - p_i)(u_i - v_i)^3}}{\sqrt{Var(X)}^3}.$$ All possible observations are in {U, ..., V} with $U := \sum_{i = 1}^{n}{\min\{u_i, v_i\}}$ and $V := \sum_{i = 1}^{n}{\max\{u_i, v_i\}}$. Note that the size m := V − U of the distribution does not generally equal n!
Computing these distributions exactly is computationally demanding,
but in the last few years, some efficient algorithms have been
developed. Particularly significant in this respect are the works of Hong (2013), who
derived the DFT-CF procedure for O-PBDs, Biscarri, Zhao &
Brunner (2018) who developed two immensely faster algorithms for
O-PBDs, namely the DC and DC-FFT procedures, and Zhang, Hong and
Balakrishnan (2018) who further developed Hong’s (2013)
DFT-CF algorithm for G-PBDs (in the following, this generalized
procedure is referred to as G-DFT-CF). Still, only a few R packages
exist for the calculation of either ordinary and generalized PBDs,
e.g. poibin
and poisbinom
for O-PBDs and GPB
for
G-PDBs. Before the release of this PoissonBinomial
package,
there has been no R package that implemented the DC and DC-FFT
algorithms of Biscarri, Zhao &
Brunner (2018), as they only published a reference
implementation for R, but refrained from releasing it as a package.
Additionally, there are no comparable approaches for G-PBDs to date.
The poibin
package implements the DFT-CF algorithm along
with the exact recursive method of Barlow & Heidtmann
(1984) and Normal and Poisson approximations. However, both exact
procedures of this package possess some disadvantages, i.e. they are
relatively slow at computing very large distributions, with the
recursive algorithm being also very memory consuming. The G-DFT-CF
procedure is implemented in the GPB
package and inherits
this performance drawback. The poisbinom
package provides a
more efficient and much faster DFT-CF implementation. The performance
improvement over the poibin
package lies in the use of the
FFTW C library. Unfortunately, it
sometimes yields some negative probabilities in the tail regions,
especially for large distributions. However, this numerical issue has
not been addressed to date. This PoissonBinomial
also
utilizes FFTW for both DFT-CF and G-DFT-CF algorithms, but corrects that
issue. In addition to the disadvantages regarding computational speed
(poibin
and GPB
) or numerics
(poisbinom
), especially for very large distributions, the
aforementioned packages do not provide headers for their internal C/C++
functions, so that they cannot be imported directly by C or C++ code of
other packages that use for example Rcpp
.
In some situations, people might have to deal with Poisson binomial
distributions that include Bernoulli variables with pi ∈ {0, 1}.
Calculation performance can be further optimized by handling these
indicators before the actual computations. Approximations also benefit
from this in terms of accuracy. None of the aforementioned packages
implements such optimizations. Therefore, the advantages of this
PoissonBinomial
package can be summarized as follows:
In total, this package includes 10 different algorithms of computing
ordinary Poisson binomial distributions, including optimized versions of
the Normal, Refined Normal and Poisson approaches, and 5 approaches for
generalized PBDs. In addition, the implementation of the exact recursive
procedure for O-PBDs was rewritten so that it is considerably less
memory intensive: the poibin
implementation needs the
memory equivalent of (n + 1)2 values of type
double
, while ours only needs 3 ⋅ (n + 1).
In this package implements the following exact algorithms for computing ordinary Poisson binomial distributions:
For generalized Poisson binomial distributions, this package provides:
Examples and performance comparisons of these procedures are presented in a separate vignette.
In addition, the following O-PBD approximation methods are included:
For G-PBDs, there are
Examples and performance comparisons of these approaches are provided in a separate vignette as well.
Handling special cases, such as ordinary binomial distributions, zeros and ones is useful to speed up performance. Unfortunately, some approximations do not work well for Bernoulli trials with pi ∈ {0, 1}, e.g. the Geometric Mean Binomial Approximations. This is why handling these values before the actual computation of the distribution is not only a performance tweak, but sometimes even a necessity. It is achieved by some simple preliminary considerations.
These cases are illustrated in the following example:
# Case 1
dpbinom(NULL, rep(0.3, 7))
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
dbinom(0:7, 7, 0.3) # equal results
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
dpbinom(NULL, c(0, 0, 0, 0, 0, 0, 0)) # only 0 is observable
#> [1] 1 0 0 0 0 0 0 0
dpbinom(0, c(0, 0, 0, 0, 0, 0, 0)) # confirmation
#> [1] 1
dpbinom(NULL, c(1, 1, 1, 1, 1, 1, 1)) # only 7 is observable
#> [1] 0 0 0 0 0 0 0 1
dpbinom(7, c(1, 1, 1, 1, 1, 1, 1)) # confirmation
#> [1] 1
# Case 2
dpbinom(NULL, c(0, 0, 0, 0, 1, 1, 1)) # only 3 is observable
#> [1] 0 0 0 1 0 0 0 0
dpbinom(3, c(0, 0, 0, 0, 1, 1, 1)) # confirmation
#> [1] 1
# Case 3
dpbinom(NULL, c(0, 0, 0.1, 0.2, 0.4, 0.8, 1)) # only 1-5 are observable
#> [1] 0.0000 0.0864 0.4344 0.3784 0.0944 0.0064 0.0000 0.0000
dpbinom(1:5, c(0, 0, 0.1, 0.2, 0.4, 0.8, 1)) # confirmation
#> [1] 0.0864 0.4344 0.3784 0.0944 0.0064
dpbinom(NULL, c(0, 0, 0.4, 1)) # only 1 and 2 are observable
#> [1] 0.0 0.6 0.4 0.0 0.0
dpbinom(1:2, c(0, 0, 0.4, 1)) # confirmation
#> [1] 0.6 0.4
These cases are illustrated in the following example:
set.seed(1)
pp <- runif(7)
va <- sample(0:6, 7, TRUE)
vb <- sample(0:6, 7, TRUE)
# Case 1
dgpbinom(NULL, pp, rep(1, 7), rep(0, 7))
#> [1] 8.114776e-05 3.112722e-03 4.063146e-02 2.115237e-01 3.793308e-01
#> [6] 2.735489e-01 8.297278e-02 8.798512e-03
dpbinom(NULL, pp) # equal results
#> [1] 8.114776e-05 3.112722e-03 4.063146e-02 2.115237e-01 3.793308e-01
#> [6] 2.735489e-01 8.297278e-02 8.798512e-03
dgpbinom(NULL, pp, rep(0, 7), rep(1, 7))
#> [1] 8.798512e-03 8.297278e-02 2.735489e-01 3.793308e-01 2.115237e-01
#> [6] 4.063146e-02 3.112722e-03 8.114776e-05
dpbinom(NULL, 1 - pp) # equal results
#> [1] 8.798512e-03 8.297278e-02 2.735489e-01 3.793308e-01 2.115237e-01
#> [6] 4.063146e-02 3.112722e-03 8.114776e-05
dgpbinom(NULL, pp, c(rep(1, 3), rep(0, 4)), c(rep(0, 3), rep(1, 4)))
#> [1] 3.062225e-02 1.998504e-01 3.769239e-01 2.828424e-01 9.450797e-02
#> [6] 1.426764e-02 9.620692e-04 2.331571e-05
dpbinom(NULL, c(pp[1:3], 1 - pp[4:7])) # reorder for 0 and 1; equal results
#> [1] 3.062225e-02 1.998504e-01 3.769239e-01 2.828424e-01 9.450797e-02
#> [6] 1.426764e-02 9.620692e-04 2.331571e-05
# Case 2 a)
dgpbinom(NULL, rep(0, 7), rep(4, 7), rep(2, 7)) # only 14 is observable
#> [1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
dgpbinom(7 * 2, rep(0, 7), rep(4, 7), rep(2, 7)) # confirmation
#> [1] 1
# Case 2 b)
dgpbinom(NULL, rep(1, 7), rep(4, 7), rep(2, 7)) # only 28 is observable
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
dgpbinom(7 * 4, rep(1, 7), rep(4, 7), rep(2, 7)) # confirmation
#> [1] 1
# Case 2 c)
dgpbinom(NULL, rep(0.3, 7), rep(4, 7), rep(2, 7))
#> [1] 0.0823543 0.0000000 0.2470629 0.0000000 0.3176523 0.0000000 0.2268945
#> [8] 0.0000000 0.0972405 0.0000000 0.0250047 0.0000000 0.0035721 0.0000000
#> [15] 0.0002187
dbinom(0:7, 7, 0.3) # equal results, but on different support set
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
# Case 2 d)
dgpbinom(NULL, pp, rep(4, 7), rep(2, 7))
#> [1] 8.114776e-05 0.000000e+00 3.112722e-03 0.000000e+00 4.063146e-02
#> [6] 0.000000e+00 2.115237e-01 0.000000e+00 3.793308e-01 0.000000e+00
#> [11] 2.735489e-01 0.000000e+00 8.297278e-02 0.000000e+00 8.798512e-03
dpbinom(NULL, pp) # equal results, but on different support set
#> [1] 8.114776e-05 3.112722e-03 4.063146e-02 2.115237e-01 3.793308e-01
#> [6] 2.735489e-01 8.297278e-02 8.798512e-03
# Case 3 a)
dgpbinom(NULL, c(0, 0, 0, 0, 0, 0, 0), va, vb) # only sum(vb) is observable
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
dgpbinom(sum(vb), rep(0, 7), va, vb) # confirmation
#> [1] 1
# Case 3 b)
dgpbinom(NULL, c(1, 1, 1, 1, 1, 1, 1), va, vb) # only sum(va) is observable
#> [1] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
dgpbinom(sum(va), rep(1, 7), va, vb) # confirmation
#> [1] 1
# Case 3 c)
dgpbinom(NULL, c(0, 0, 0, 1, 1, 1, 1), va, vb) # only sum(va[4:7], vb[1:3]) is observable
#> [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
dgpbinom(sum(va[4:7], vb[1:3]), c(0, 0, 0, 1, 1, 1, 1), va, vb) # confirmation
#> [1] 1
# Case 4
dgpbinom(NULL, c(0, 0, 0.3, 0.6, 1, 1, 1), va, vb)
#> [1] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.00 0.12 0.42 0.00 0.00 0.28
#> [16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sure <- sum(va[5:7], vb[1:2])
x.transf <- sum(pmin(va[3:4], vb[3:4])):sum(pmax(va[3:4], vb[3:4]))
dgpbinom(sure + x.transf, c(0, 0, 0.3, 0.6, 1, 1, 1), va, vb)
#> [1] 0.18 0.00 0.00 0.12 0.42 0.00 0.00 0.28
dgpbinom(x.transf, c(0.3, 0.6), va[3:4], vb[3:4]) # equal results
#> [1] 0.18 0.00 0.00 0.12 0.42 0.00 0.00 0.28
dgpbinom(NULL, c(0, 0, 0, 0.6, 1, 1, 1), va, vb)
#> [1] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.4 0.0 0.0 0.0 0.0
#> [20] 0.0 0.0 0.0 0.0 0.0 0.0
sure <- sum(va[5:7], vb[1:3])
x.transf <- va[4]:vb[4]
dgpbinom(sure + x.transf, c(0, 0, 0, 0.6, 1, 1, 1), va, vb)
#> [1] 0.6 0.0 0.0 0.4
dgpbinom(x.transf, 0.6, va[4], vb[4]) # equal results; essentially transformed Bernoulli
#> [1] 0.6 0.0 0.0 0.4
How to import and use the internal C++ functions in Rcpp
based packages is described in a separate
vignette.