Hyperparameter Estimation with openEBGM

Background

DuMouchel used an “Empirical Bayes” (EB) method. That is, the data are a driving force behind the choice of the prior distribution (in contrast to typical Bayesian methods, where the prior is chosen without regard to the actual data). One outcome of this is a known posterior distribution that relies on a set of hyperparameters derived from the prior distribution. These hyperparameters are estimated by their most likely value in the context of Empirical Bayesian methods. One process by which this can occur involves maximizing the likelihood function of the marginal distributions of the counts (or in our case, minimizing the negative log-likelihood function).

Optimization

Global optimization is a broad field. There are many existing R packages with minimization routines which may be used in the estimation of these hyperparameters. The hyperparameter estimation functions offered by openEBGM utilize the following:

Optimization using PORT routines
Newton-type algorithm
A quasi-Newton method (also known as a variable metric method)
A version of the Expectation-Maximization (EM) Algorithm known as the Expectation/Conditional Maximization (ECM) Algorithm (1)

openEBGM’s hyperparameter estimation functions use local optimization algorithms. The Newton-like approaches use algorithms implemented in R’s stats package and allow the user to choose multiple starting points to improve the chances of finding a global optimum. The user is encouraged to explore a variety of optimization approaches because the accuracy of a global optimization result is extremely difficult to verify and other approaches might work better in some cases.

References

Meng X-L, Rubin D (1993). “Maximum likelihood estimation via the ECM algorithm: A general framework.” Biometrika, 80(2), 267-278.
DuMouchel W, Pregibon D (2001). “Empirical Bayes Screening for Multi-item Associations.” In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
Millar, Russell B (2011). “Maximum Likelihood Estimation and Inference.” John Wiley & Sons, Ltd, 111-112.

Data Squashing

DuMouchel & Pregibon (2) discuss a methodology they call “data squashing”, which “compacts” the dataset to reduce the amount of computation needed to estimate the hyperparameters. openEBGM provides an implementation for data squashing.

The actual counts (\(N\)) and expected counts (\(E\)) are used to estimate the hyperparameters of the prior distribution. A large contingency table will have many cells, resulting in computational difficulties for the optimization routines needed for estimation. Data squashing (DuMouchel et al., 2001) transforms a set of 2-dimensional points to a smaller number of 3-dimensional points. The idea is to reduce a large number of points \((N, E)\) to a smaller number of points \((N_k, E_k, W_k)\), where \(k = 1,...,M\) and \(W_k\) is the weight of the \(k^{th}\) “superpoint”. To minimize information loss, only points close to each other should be squashed.

For a given \(N\), squashData() combines points with similar \(E\)s into bins using a specified bin size and uses the average \(E\) within each bin as the \(E\) for that bin’s “superpoint”. The new superpoints are weighted by bin size. For example, the points (1, 1.1) and (1, 1.3) could be squashed to the superpoint (1, 1.2, 2).

An example is given below using unstratified expected counts:

library(openEBGM)
data(caers)

processed <- processRaw(caers)
processed[1:4, 1:4]
#>                      var1                  var2 N            E
#> 1         1-PHENYLALANINE  HEART RATE INCREASED 1 0.0360548272
#> 2 11 UNSPECIFIED VITAMINS                ASTHMA 1 0.0038736591
#> 3 11 UNSPECIFIED VITAMINS CARDIAC FUNCTION TEST 1 0.0002979738
#> 4 11 UNSPECIFIED VITAMINS            CHEST PAIN 1 0.0360548272

squashed <- squashData(processed) #Using defaults
head(squashed)
#>   N            E weight
#> 1 1 0.0002979738     50
#> 2 1 0.0002979738     50
#> 3 1 0.0002979738     50
#> 4 1 0.0002979738     50
#> 5 1 0.0002979738     50
#> 6 1 0.0002979738     50

nrow(processed)
#> [1] 17189
nrow(squashed)
#> [1] 1568

As shown above, the squashed data set has 9.12% of the observations as the full dataset. Using this squashed data, we can then estimate the hyperparameters in a far more efficient manner.

squashData() can be used iteratively for each count (\(N\)):

squash1 <- squashData(processed)
squash2 <- squashData(squash1, count = 2, bin_size = 8)

Or autoSquash() can be used to squash all counts at once:

squash3 <- autoSquash(processed)
ftable(squash3[, c("N", "weight")])
#>    weight   1   2   3   6   8  31
#> N                                
#> 1         100   0   0   1   0 514
#> 2          75   0   1   0  79   0
#> 3          51  71   0   0   0   0
#> 4          80   0   0   0   0   0
#> 5          41   0   0   0   0   0
#> 6          31   0   0   0   0   0
#> 7          17   0   0   0   0   0
#> 8          20   0   0   0   0   0
#> 9          14   0   0   0   0   0
#> 10          6   0   0   0   0   0
#> 11          7   0   0   0   0   0
#> 12          3   0   0   0   0   0
#> 13          3   0   0   0   0   0
#> 14          2   0   0   0   0   0
#> 16          4   0   0   0   0   0
#> 17          2   0   0   0   0   0
#> 18          5   0   0   0   0   0
#> 19          1   0   0   0   0   0
#> 21          2   0   0   0   0   0
#> 22          1   0   0   0   0   0
#> 23          1   0   0   0   0   0
#> 24          1   0   0   0   0   0
#> 27          1   0   0   0   0   0
#> 28          1   0   0   0   0   0
#> 42          1   0   0   0   0   0
#> 53          1   0   0   0   0   0
#> 54          1   0   0   0   0   0

Likelihood Functions

As previously mentioned, the hyperparameters are estimated by minimizing the negative log-likelihood function. There are actually 4 different functions, depending on the use of data squashing and zero counts. All 4 functions, however, are based on the marginal distribution of the counts, which are mixtures of two negative binomial distributions. The hyperparameters are denoted by the vector \(\theta=(\alpha_1,\beta_1,\alpha_2,\beta_2,P)\), where \(P\) is the mixture fraction.

The most commonly used likelihood function is negLLsquash(), which is used when squashing data and not using zero counts. negLLsquash() is not called directly, but rather by some optimization function:

theta_init <- c(alpha1 = 0.2, beta1 = 0.1, alpha2 = 2, beta2 = 4, p = 1/3)
stats::nlm(negLLsquash, p = theta_init,
           ni = squashed$N, ei = squashed$E, wi = squashed$weight, N_star = 1)
#> $minimum
#> [1] 4162.496
#> 
#> $estimate
#> [1] 3.25086926 0.39971401 2.02581887 1.90806636 0.06539159
#> 
#> $gradient
#> [1]  3.076907e-05 -2.253273e-04 -1.156701e-04  2.452500e-04 -3.937066e-04
#> 
#> $code
#> [1] 1
#> 
#> $iterations
#> [1] 71

The N_star argument allows the user to choose the smallest value of \(N\) used for hyperparameter estimation. The user must be careful to match N_star with the actual minimum count in ni. If the user wishes to use a larger value for N_star, the vectors supplied to arguments ni, ei, and wi must be filtered. Here, we are using all counts except zeroes, so no filtering is needed. In general, N_star = 1 should be used whenever practical.

Note that you will likely receive warning messages from the optimization function–use your own judgment to determine whether or not they would likely indicate real problems.

The other likelihood functions are negLL(), negLLzero(), and negLLzeroSquash(). Make sure to use the appropriate function for your choice of data squashing and use of zero counts.

Special Wrapper Functions

Hyperparameters can be calculated by exploring the parameter space of the likelihood function using either the full data set of \(N\)s and \(E\)s or the squashed set. The methodology implemented by this package essentially maximizes the likelihood function (or more specifically, minimizes the negative log-likelihood function). Starting points must be chosen to begin the exploration. DuMouchel (1999, 2001) provides a “lightly justified” set of initial hyperparameters. However, openEBGM’s functions support a large set of starting choices to help reach convergence and reduce the chance of false convergence. We begin by defining some starting points:

theta_init <- data.frame(alpha1 = c(0.2, 0.1, 0.3),
                         beta1  = c(0.1, 0.2, 0.5),
                         alpha2 = c(2,   10,  6),
                         beta2  = c(4,   10,  6),
                         p      = c(1/3, 0.2, 0.5)
)

`autoHyper()`

Now that the initial guesses for the hyperparameters have been defined, the function autoHyper() can be used to determine the actual hyperparameter estimates. autoHyper() performs a “verification check” by requiring the optimization routine to converge at least twice within the bounds of the parameter space. The estimate corresponding to the smallest negative log-likelihood is chosen as a tentative result. By default, this estimate must be similar to at least one other convergent solution. If the algorithm fails to consistently converge, an error message is returned.

system.time(
hyper_estimates_full <- autoHyper(data = processed, theta_init = theta_init, 
                                  squashed = FALSE)
)
#>    user  system elapsed 
#>   20.43    0.31   20.75

squashed2 <- squashData(squashed, count = 2, bin_size = 10, keep_pts = 50)
system.time(
hyper_estimates_squashed <- autoHyper(data = squashed2, theta_init = theta_init)
)
#>    user  system elapsed 
#>    1.03    0.03    1.06

hyper_estimates_full
#> $method
#> [1] "nlminb"
#> 
#> $estimates
#>     alpha1      beta1     alpha2      beta2          P 
#> 3.25596685 0.39998994 2.02376139 1.90614108 0.06530755 
#> 
#> $conf_int
#> NULL
#> 
#> $num_close
#> [1] 2
#> 
#> $theta_hats
#>   guess_num   a1_hat    b1_hat   a2_hat   b2_hat      p_hat code converge
#> 1         1 3.255974 0.3999893 2.023751 1.906132 0.06530713    0     TRUE
#> 2         2 3.256254 0.4000054 2.023724 1.906091 0.06530283    0     TRUE
#> 3         3 3.255967 0.3999899 2.023761 1.906141 0.06530755    0     TRUE
#>   in_bounds  minimum
#> 1      TRUE 4162.456
#> 2      TRUE 4162.456
#> 3      TRUE 4162.456

hyper_estimates_squashed
#> $method
#> [1] "nlminb"
#> 
#> $estimates
#>     alpha1      beta1     alpha2      beta2          P 
#> 3.25374319 0.39988512 2.02611022 1.90806892 0.06534614 
#> 
#> $conf_int
#> NULL
#> 
#> $num_close
#> [1] 2
#> 
#> $theta_hats
#>   guess_num   a1_hat    b1_hat   a2_hat   b2_hat      p_hat code converge
#> 1         1 3.251938 0.3997615 2.026192 1.908225 0.06536609    0     TRUE
#> 2         2 3.237922 0.3989700 2.027493 1.910069 0.06557221    0     TRUE
#> 3         3 3.253743 0.3998851 2.026110 1.908069 0.06534614    0     TRUE
#>   in_bounds minimum
#> 1      TRUE 4161.93
#> 2      TRUE 4161.93
#> 3      TRUE 4161.93

As seen above, the process is much faster when utilizing the squashed data, with estimates that are nearly identical. Of course, the amount of efficiency increase depends on the parameter values used in the call to squashData() and the size of the original data set. Another factor affecting run time is the number of starting points used. In general, you should use five or fewer starting points and limit the number of data points to a maximum of about 20,000 if possible. Notice that we squashed the same data again, which is allowed if we use a different value for count each time.

autoHyper() utilizes multiple minimization techniques to verify convergence. It first attempts the stats::nlminb() function, which implements a quasi-Newton unconstrained optimization technique using PORT routines. If nlminb() fails to consistently converge, autoHyper() next attempts stats::nlm(), which is a non-linear maximization algorithm based on the Newton method. Finally, if the first two approaches fail, a quasi-Newton method (also known as a variable metric algorithm) is used, which “…uses function values and gradients to build up a picture of the surface to be optimized.” (source: R documentation for stats::optim) This routine is implemented by the stats::optim() function with the method = BFGS argument.

autoHyper() can now return standard errors and confidence intervals for the hyperparameter estimates:

autoHyper(squashed2, theta_init = theta_init, conf_ints = TRUE)$conf_int
#>        pt_est     SE   LL_95  UL_95
#> a1_hat 3.2537 2.2875 -1.2297 7.7372
#> b1_hat 0.3999 0.1439  0.1179 0.6819
#> a2_hat 2.0261 0.4514  1.1414 2.9108
#> b2_hat 1.9081 0.4327  1.0599 2.7562
#> p_hat  0.0653 0.0358 -0.0048 0.1355

`exploreHypers()`

autoHyper() is a semi-automated (but imperfect) approach to hyperparameter estimation. The user is encouraged to explore various optimization approaches as no single approach will always work (the functions in openEBGM are not the only optimization functions available in R). One way to explore is with openEBGM’s exploreHypers() function, which is actually called by autoHyper() behind-the-scenes. exploreHypers() can now also return standard errors for the hyperparameter estimates:

exploreHypers(data = squashed2, theta_init = theta_init, std_errors = TRUE)
#> $estimates
#>   guess_num   a1_hat    b1_hat   a2_hat   b2_hat      p_hat code converge
#> 1         1 3.251938 0.3997615 2.026192 1.908225 0.06536609    0     TRUE
#> 2         2 3.237922 0.3989700 2.027493 1.910069 0.06557221    0     TRUE
#> 3         3 3.253743 0.3998851 2.026110 1.908069 0.06534614    0     TRUE
#>   in_bounds minimum
#> 1      TRUE 4161.93
#> 2      TRUE 4161.93
#> 3      TRUE 4161.93
#> 
#> $std_errs
#>   guess_num    a1_se     b1_se     a2_se     b2_se       p_se
#> 1         1 2.280957 0.1435365 0.4512203 0.4323834 0.03573021
#> 2         2 2.235704 0.1410427 0.4504548 0.4304327 0.03543024
#> 3         3 2.287523 0.1438980 0.4513726 0.4327411 0.03578473

exploreHypers() offers three gradient-based optimization methods and is basically just a wrapper around commonly used functions from the stats package mentioned earlier: nlminb() (the default), nlm(), & optim().

Expectation-Maximization Approach with `hyperEM()`

hyperEM() implements a version of the EM algorithm referred to by Meng & Rubin (1) as the Expectation/Conditional Maximization (ECM) algorithm. hyperEM() uses a single starting point to find a local maximum likelihood estimate for \(\theta\) either by finding roots of the score functions (partial derivatives of the log-likelihood function) or by using stats::nlminb() to directly minimize the negative log-likelihood function. The method described by Millar (3) is used to accelerate the estimate of \(\theta\) every 100 iterations of the algorithm.

data(caers)
proc <- processRaw(caers)
squashed <- squashData(proc, bin_size = 100, keep_pts = 0)
squashed <- squashData(squashed, count = 2, bin_size = 12)
hyperEM_ests <- hyperEM(squashed, theta_init_vec = c(1, 1, 2, 2, .3),
                        conf_ints = TRUE, track = TRUE)
#> change in LL: 1.15e+00 | theta: 3.12  0.63  1.66  2.101 0.159 
#> change in LL: 2.88e-01 | theta: 3.364 0.528 2.222 2.281 0.104 
#> change in LL: 5.67e-02 | theta: 3.46  0.47  2.387 2.263 0.081 
#> change in LL: 2.04e-02 | theta: 3.458 0.44  2.359 2.188 0.072 
#> change in LL: 1.27e-02 | theta: 3.41  0.425 2.282 2.113 0.069 
#>     Current iterations: 50 
#> change in LL: 7.42e-03 | theta: 3.357 0.416 2.21  2.055 0.068 
#> change in LL: 3.93e-03 | theta: 3.312 0.411 2.155 2.014 0.067 
#> change in LL: 1.98e-03 | theta: 3.276 0.407 2.116 1.985 0.067 
#> change in LL:  9.7e-04 | theta: 3.248 0.404 2.089 1.966 0.067 
#> change in LL: 1.24e-04 | theta: 3.135 0.395 2.032 1.927 0.067 
#>     Current iterations: 100 
#> change in LL: 1.75e-05 | theta: 3.143 0.395 2.032 1.928 0.068 
#> change in LL:  1.8e-05 | theta: 3.138 0.395 2.033 1.929 0.068 
#> change in LL: 8.34e-06 | theta: 3.132 0.395 2.034 1.93  0.068 
#> change in LL: 8.36e-06 | theta: 3.128 0.394 2.035 1.931 0.068 
#> change in LL: 7.43e-06 | theta: 3.125 0.394 2.036 1.932 0.068 
#>     Current iterations: 150 
#> change in LL: 6.08e-06 | theta: 3.121 0.394 2.036 1.933 0.068 
#> change in LL: 5.56e-06 | theta: 3.118 0.394 2.037 1.933 0.068 
#> change in LL: 5.15e-06 | theta: 3.115 0.394 2.038 1.934 0.068 
#> change in LL: 4.79e-06 | theta: 3.112 0.394 2.038 1.935 0.068 
#> change in LL: 4.45e-06 | theta: 3.109 0.394 2.039 1.936 0.068 
#>     Current iterations: 200 
#> 
#>    Iterations used: 203 
#> 
#>    Timing:  
#>    user  system elapsed 
#>    2.91    0.00    2.91
str(hyperEM_ests)
#> List of 9
#>  $ estimates : num [1:5] 3.1083 0.3935 2.0392 1.9358 0.0683
#>  $ conf_int  :'data.frame':  5 obs. of  4 variables:
#>   ..$ pt_est: num [1:5] 3.1083 0.3935 2.0392 1.9358 0.0683
#>   ..$ SE    : num [1:5] 1.9646 0.125 0.4645 0.4447 0.0356
#>   ..$ LL_95 : num [1:5] -0.7422 0.1484 1.1288 1.0642 -0.0014
#>   ..$ UL_95 : num [1:5] 6.959 0.639 2.95 2.807 0.138
#>  $ maximum   : num -4164
#>  $ method    : chr "score"
#>  $ elapsed   : num 2.91
#>  $ iters     : num 203
#>  $ score     : num [1:5] -0.01142 0.03072 0.01033 0.01029 0.00216
#>  $ score_norm: num 0.0359
#>  $ tracking  :'data.frame':  204 obs. of  7 variables:
#>   ..$ iter  : int [1:204] 0 1 2 3 4 5 6 7 8 9 ...
#>   ..$ logL  : num [1:204] -4328 -4250 -4221 -4200 -4187 ...
#>   ..$ alpha1: num [1:204] 1 1.89 2.02 2.25 2.49 ...
#>   ..$ beta1 : num [1:204] 1 0.894 0.821 0.77 0.735 ...
#>   ..$ alpha2: num [1:204] 2 2.54 2.06 1.74 1.55 ...
#>   ..$ beta2 : num [1:204] 2 1.93 1.89 1.88 1.89 ...
#>   ..$ P     : num [1:204] 0.3 0.313 0.283 0.254 0.232 ...

Setting track = TRUE tracks the log-likelihood and hyperparameter estimates at each iteration, which we can plot to study the behavior of the algorithm:

library(ggplot2)
library(tidyr)
pdat <- gather(hyperEM_ests$tracking, key = "metric", value = "value", logL:P)
pdat$metric <- factor(pdat$metric, levels = unique(pdat$metric), ordered = TRUE)
ggplot(pdat, aes(x = iter, y = value)) +
  geom_line(size = 1.1, col = "blue") +
  facet_grid(metric ~ ., scales = "free") +
  ggtitle("Convergence Assessment",
          subtitle = "Dashed red line indicates accelerated estimate") +
  labs(x = "Iteration Count", y = "Estimate") +
  geom_vline(xintercept = c(100, 200), size = 1, linetype = 2, col = "red")

Specialized Optimization Packages

Other packages that specialize in optimization, such as DEoptim, can alternatively be used to estimate the hyperparameters:

library(DEoptim)
set.seed(123456)
theta_hat <- DEoptim(negLLsquash,
                     lower = c(rep(1e-05, 4), .001),
                     upper = c(rep(5, 4), 1 - .001),
                     control = DEoptim.control(
                       itermax = 2000,
                       reltol  = 1e-04,
                       steptol = 200,
                       NP      = 100,
                       CR      = 0.85,
                       F       = 0.75,
                       trace   = 25
                     ),
                     ni = squashed$N, ei = squashed$E, wi = squashed$weight
)
#> Iteration: 25 bestvalit: 4165.801982 bestmemit:    3.422317    0.420197    2.321294    2.046197    0.072780
#> Iteration: 50 bestvalit: 4163.993567 bestmemit:    3.132083    0.401064    2.046266    1.935360    0.069050
#> Iteration: 75 bestvalit: 4163.975114 bestmemit:    3.093758    0.392458    2.051624    1.945084    0.068382
#> Iteration: 100 bestvalit: 4163.974353 bestmemit:    3.067418    0.391168    2.045820    1.943878    0.069060
#> Iteration: 125 bestvalit: 4163.974343 bestmemit:    3.063852    0.391105    2.047033    1.945090    0.069152
#> Iteration: 150 bestvalit: 4163.974343 bestmemit:    3.063464    0.391089    2.047299    1.945344    0.069157
#> Iteration: 175 bestvalit: 4163.974343 bestmemit:    3.063485    0.391089    2.047268    1.945317    0.069156
#> Iteration: 200 bestvalit: 4163.974343 bestmemit:    3.063492    0.391090    2.047261    1.945312    0.069156
#> Iteration: 225 bestvalit: 4163.974343 bestmemit:    3.063488    0.391090    2.047258    1.945309    0.069156
(theta_hat <- as.numeric(theta_hat$optim$bestmem))
#> [1] 3.06348827 0.39108968 2.04725750 1.94530893 0.06915639

Although the user is encouraged to explore various approaches for maximum likelihood hyperparameter estimation, autoHyper() will often give reasonable results with minimal effort. Once the hyperparameters have been estimated, they can be used in the calculation of the \(EBGM\) and quantile scores by applying them to the posterior distribution. This process can be found in the Empirical Bayes Metrics with openEBGM vignette.