---
title: "Mhorseshoe"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Mhorseshoe}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(Mhorseshoe)
library(ggplot2)
library(horseshoe)
```

In this vignette, the functions exact_horseshoe, approx_horseshoe are explained 
in the following sections.

1. About the horseshoe estimator in this package
2. modified approximate algorithm

# 1. About the horseshoe estimator in this package

The horseshoe prior is a continuous reduced prior distribution frequently used 
in high-dimensional Bayesian linear models, and can be effectively applied to 
sparse data. This is a method that theoretically guarantees excellent shrinkage 
properties. Mhorseshoe provides an estimator applying the horseshoe prior, The 
likelihood function of the Gaussian linear model to be considered in this 
package is as follows:

$$L(y\ |\ x, \beta, \sigma^2) = (\frac{1}{\sqrt{2\pi}\sigma})^{-N/2}exp
\{ -\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)\},\\ X \in 
\mathbb{R}^{N \times p},\ y \in \mathbb{R}^{N},\ \beta \in \mathbb{R}^{p}$$

And the form of the hierarchical model applying the horseshoe prior to $\beta$ 
is expressed as follows:

$$\beta_{j}\ |\ \sigma^{2}, \tau^{2}, \lambda_{j}^{2}
\overset{i.i.d}{\sim} N\left(0, \sigma^{2}\tau^{2}\lambda_{j}^{2} \right),
\quad \lambda_{j}\overset{i.i.d}{\sim}C^{+}(0,1),\quad j=1,2,...,p, \\ \tau
\overset{i.i.d}{\sim}C^{+}(0,1),\quad p(\sigma^2)\propto gamma\left(
w/2, w/2\right).$$

Where $C^{+}(0,1)$ is a half-Cauchy distribution, 
$\lambda_{1}, ..., \lambda_{p}$ are local shrinkage parameters, and $\tau$ is 
global shrinkage parameter. The exact_horseshoe in this package is a horseshoe 
estimator for the prior defined above, as described in exact MCMC algorithm of 
Johndrow et al. (2020). The blocked Metropolis-within-Gibb that samples 
$\lambda,\ \left(\tau, \sigma, \beta \right)$ is adopted. Since the algorithm 
considers the case of $p >> N$, the computational cost can be 
lowered by using fast sampling(Bhattacharya et al.,2016) in the step of 
sampling $\beta$.

For simulation, the data size was set to N = 300, p = 500, and the design 
matrix $X=\{x_{j} \}_{j=1}^{p}$, true value of beta were set as follows.

$$x_{j} \sim N_{N}\left(0, I_{N} \right), \\ y_{i} \sim N(x_{i}\beta, 4), \\ 
\beta_{j} = \begin{cases} 1, & \mbox{if }\ \mbox{j<51,} \\ 0, & 
\mbox{otherwise.}\end{cases}$$

As a caution, the data is for testing the application of the algorithm and is 
not a simulation that strictly considers sparsity condition.

```{r}
# making simulation data.
set.seed(123)
N <- 300
p <- 500
p_star <- 50
true_beta <- c(rep(1, p_star), rep(0, p-p_star))

# design matrix X.
X <- matrix(1, nrow = N, ncol = p)
for (i in 1:p) {
  X[, i] <- stats::rnorm(N, mean = 0, sd = 1)
}
  
# response variable y.
y <- vector(mode = "numeric", length = N)
e <- rnorm(N, mean = 0, sd = 2)
for (i in 1:p_star) {
  y <- y + true_beta[i] * X[, i]
}
y <- y + e
```

For the corresponding simulation data, the results were compared with the 
horseshoe function of the horseshoe package. We fit the data to the 
horseshoe and exact_horseshoe functions and plot the results for the first 100 
indices, including the first 50 indices that are non-zero coefficients.

```{r fig.width = 6, fig.height= 4}
# horseshoe in horseshoe package.
horseshoe_result <- horseshoe::horseshoe(y, X, method.tau = "halfCauchy",
                                         method.sigma = "Jeffreys",
                                         burn = 0, nmc = 500)

# exact_horseshoe in Mhorseshoe package.
exact_horseshoe_result <- exact_horseshoe(y, X, burn = 0, iter = 500)

df <- data.frame(index = 1:100,
                 horseshoe_BetaHat = horseshoe_result$BetaHat[1:100],
                 horseshoe_LeftCI = horseshoe_result$LeftCI[1:100],
                 horseshoe_RightCI = horseshoe_result$RightCI[1:100],
                 exhorseshoe_BetaHat = exact_horseshoe_result$BetaHat[1:100],
                 exhorseshoe_LeftCI = exact_horseshoe_result$LeftCI[1:100],
                 exhorseshoe_RightCI = exact_horseshoe_result$RightCI[1:100],
                 true_beta = true_beta[1:100])

# Estimation results of the horseshoe in horseshoe package.
ggplot(data = df, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = horseshoe_BetaHat), size = 2, col = "red") +
  geom_errorbar(aes(ymin = horseshoe_LeftCI,
                    ymax = horseshoe_RightCI), width = .1, col = "red") +
  labs(title = "95% Credible intervals of the horseshoe in horseshoe package", 
       y = "beta")

# Estimation results of the exact_horseshoe in Mhorseshoe package.
ggplot(data = df, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = exhorseshoe_BetaHat), 
             size = 2, col = "red") +
  geom_errorbar(aes(ymin = exhorseshoe_LeftCI, 
                    ymax = exhorseshoe_RightCI), width = .1, col = "red") +
  labs(title = "95% Credible intervals of the exact_horseshoe in Mhorseshoe 
       package", y = "beta")
```

The results of both functions are the same. Both functions estimate that 
95% of the credible intervals for all non-zero indices except 4, 36, 41, and 46 
include 1.

# 2. Modified approximate algorithm

Johndrow et al. (2020) proposed a scalable approximate MCMC algorithm that can 
reduce computational costs by introducing a thresholding method while applying 
the horseshoe prior. The set of columns that satisfies the condition 
($\tau^{2}\lambda_{j}^{2} > \delta$) is defined as the active set, and let's 
define $S$ as the set of indices of columns that satisfy the condition.

$$S = \{j\ |\ \tau^{2}\lambda_{j}^{2} > \delta,\ j=1,2,...,p. \}.$$

If $\tau^{2}\lambda_{j}^{2}$ is very small, the posterior of $\beta$
will have a mean and variance close to 0. Let the number of elements 
of $S$ be $s_{\delta}$ and the diagonal matrix for the local shrinkage 
parameters be $\Lambda = diag(\lambda_1^2,...,\lambda_p^2)$. The approximate 
algorithm uses a reduced version of the matrix as 
$\Lambda_{S} \in R^{s_{\delta} \times s_{\delta}}$, which reduces the 
computational cost, is applied to the main sampling part. This changed 
algorithm is implemented in the approx_horseshoe function. 
Johndrow et al. (2020) uses a fixed $\delta$ in the approximate algorithm, On 
the other hand, our package applies a new method that sets $\delta$ with the 
adaptive probability.

The adaptive algorithm developed in this package estimates a new threshold 
$\delta^{New}$ with updated shrinkage parameters by MCMC sampling, Adopt 
$\delta^{New}$ through adaptive probability. the process of updating a new 
threshold is added by applying the properties of the shrinkage weight 
$k_{j},\ j=1,2,...,p$ proposed by Piironen and Vehtari (2017). In the prior of
$\beta_{j} \sim N(0, \sigma^{2}\tau^{2}\lambda_{j}^{2})$, the variable 
$m_{eff}$ is defined as follows.

$$k_{j} = 1/\left(1+n\tau^{2}s_{j}^{2}\lambda_{j}^{2} \right), \quad m_{eff} = \sum_{j=1}^{p}{\left(1-k_{j} \right)}.$$

Where $ns_{j}^2,\ j=1,2,...,p$ are the diagonal components of $X^{T}X$. For the 
zero components of $\beta$, $k_{j}$ is derived close to 1, and nonzero's 
$k_{j}$ is derived close to 0, so the variable $m_{eff}$ is called the 
effective number of nonzero coefficients. In this algorithm, $\delta$ is 
updated to set $s_{\delta} = ceiling(m_{eff})$. Adaptive probability is 
defined to satisfy Theorem 5(diminishing adaptation condition) of Roberts and 
Rosenthal (2007). at $T$th iteration,

$$p(T) = exp[p_{0} + p_{1}T],\quad p_{1} < 0,
\quad u \sim U(0, 1), \\ if\ u < p(T),\ accept\ \delta^{New},\ 
s_{\delta^{New}} = ceiling(m_{eff}).$$

The default is $p_{0} = 0$, $p_{1} = -4.6 \times 10^{-4}$, and 
under this condition, $p(10000) < 0.01$ is satisfied. The approximate algorithm 
that newly added this procedure is called the modified approximate algorithm, 
and simulation is performed.

```{r fig.width = 6, fig.height= 4}
# approximate algorithm with fixed default threshold.
approx_horseshoe_result <- approx_horseshoe(y, X, burn = 0, iter = 500, 
                                            auto.threshold = FALSE)

# modified approximate algorithm.
mapprox_horseshoe_result <- approx_horseshoe(y, X, burn = 0, iter = 500,
                                             auto.threshold = TRUE)

df2 <- data.frame(index = 1:100,
                  approx_BetaHat = approx_horseshoe_result$BetaHat[1:100],
                  approx_LeftCI = approx_horseshoe_result$LeftCI[1:100],
                  approx_RightCI = approx_horseshoe_result$RightCI[1:100],
                  mapprox_BetaHat = mapprox_horseshoe_result$BetaHat[1:100],
                  mapprox_LeftCI = mapprox_horseshoe_result$LeftCI[1:100],
                  mapprox_RightCI = mapprox_horseshoe_result$RightCI[1:100],
                  true_beta = true_beta[1:100])

# Estimation results of the approximate algorithm.
ggplot(data = df2, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = approx_BetaHat), size = 2, col = "red") +
  geom_errorbar(aes(ymin = approx_LeftCI, 
                    ymax = approx_RightCI), width = .1, col = "red") +
  labs(title = "95% Credible intervals of the approx_horseshoe", y = "beta")

# Estimation results of the modified approximate algorithm.
ggplot(data = df2, aes(x = index, y = true_beta)) + 
  geom_point(size = 2) + 
  geom_point(aes(x = index, y = mapprox_BetaHat), 
             size = 2, col = "red") +
  geom_errorbar(aes(ymin = mapprox_LeftCI, 
                    ymax = mapprox_RightCI), 
                width = .1, col = "red") +
  labs(title = "95% Credible intervals of the modified_approx_horseshoe", 
       y = "beta")
```

In the case of exact_horseshoe, since the $\Lambda$ is used as is, always think 
of $length(S) = 500$. On the other hand, since the method of setting the 
threshold is different in approximate algorithm and modified approximate 
algorithm, a difference occurs in $length(S) = s_{\delta}$.

```{r fig.width = 6, fig.height= 4}
exact_activeset <- rep(p, 500)
approx_activeset <- apply(approx_horseshoe_result$ActiveSet, MARGIN = 1, sum)
mapprox_activeset <- apply(mapprox_horseshoe_result$ActiveSet, MARGIN = 1, sum)

# active set plot
ggplot(data = data.frame(X = 1:500,
                         exact_activeset = exact_activeset,
                         approx_activeset = approx_activeset,
                         mapprox_activeset = mapprox_activeset)) +
  geom_line(mapping = aes(x = X, y = exact_activeset,
                          color = "exact")) +
  geom_line(mapping = aes(x = X, y = approx_activeset,
                          color = "approx"),
            alpha = 0.5) +
  geom_line(mapping = aes(x = X, y = mapprox_activeset,
                          color = "modified_approx"),
            alpha = 0.5) +
  scale_color_manual(name = "algorithm",
                     values = c("black", "red", "blue"), 
                     breaks = c("exact", "approx", "modified_approx"), 
                     labels = c("exact", "approx", "modified_approx"))
```

# References

Bhattacharya, A., Chakraborty, A., & Mallick, B. K. (2016). Fast sampling with 
Gaussian scale mixture priors in high-dimensional regression. Biometrika, 
asw042.

Johndrow, J., Orenstein, P., & Bhattacharya, A. (2020). Scalable Approximate 
MCMC Algorithms for the Horseshoe Prior. In Journal of Machine Learning 
Research (Vol. 21).

Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization 
in the horseshoe and other shrinkage priors. Electronic Journal of 
Statistics, 11, 5018-5051.

Roberts G, Rosenthal J. Coupling and ergodicity of adaptive Markov chain Monte 
Carlo algorithms. J Appl Prob. 2007;44:458–475.