---
title: "COMPoissonReg: Usage, the Normalizing Constant, and Other Computational Details"
author:
  - Andrew M. Raim^[
    <andrew.raim@gmail.com>,
    Center for Statistical Research & Methodology,
    U.S. Census Bureau,
    Washington, DC, 20233, U.S.A.
    **Disclaimer**`:` This document is released to inform interested parties of
    ongoing research and to encourage discussion of work in progress. Any views
    expressed are those of the authors and not those of the U.S. Census Bureau.
    ]
  - Kimberly F. Sellers^[
    <kfs7@georgetown.edu>,
    Center for Statistical Research & Methodology,
    U.S. Census Bureau and
    Department of Mathematics and Statistics,
    Georgetown University,
    Washington, DC, 20057, U.S.A.
    ]
abstract: >
  `COMPoissonReg` is an R package which supports Conway-Maxwell Poisson (CMP)
  and Zero-Inflated Conway-Maxwell Poisson (ZICMP) models. This vignette
  describes fundamental computational details, especially those involving the
  normalizing constant and related quantities. The CMP normalizing constant
  does not have a general closed form; furthermore, it requires care to handle
  numerically as its magnitude can vary greatly with changes in the parameters.
  Primary `COMPoissonReg` functions are demonstrated with examples, including
  those implementing basic distribution functions and regression modeling.
bibliography: references.bib
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output: 
  pdf_document: 
    citation_package: natbib
    number_sections: yes
    toc: yes
    toc_depth: 3
    extra_dependencies:
      common: null
    keep_tex: yes
urlcolor: blue
linkcolor: blue
citecolor: blue
vignette: >
  %\VignetteIndexEntry{COMPoissonReg: Usage, the Normalizing Constant, and Other Computational Details}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  prompt = TRUE,
  comment = ""
)
```

```{r setup, include = FALSE}
library(COMPoissonReg)
set.seed(1235)
```

\newpage

# Introduction
\label{sec:intro}

The R package `COMPoissonReg` supports Conway-Maxwell Poisson (CMP) and Zero-Inflated Conway-Maxwell Poisson (ZICMP) models for analysis of count data in a flexible manner, to account for data dispersion relative to a Poisson model. The package provides regression functionality in addition to basic distribution functions. Interested users can refer to @SellersShmueli2010 and @SellersRaim2016 regarding the underlying theoretical developments for the CMP and ZICMP regressions, respectively. A full specification of the public `COMPoissonReg` interface can be found in the manual. In addition to package prerequisites `Rcpp` [@Eddelbuettel2013] and `numDeriv` [@numDeriv], `ggplot2` [@Wickham2016] is also used in this vignette.

One of the challenges of working with CMP and ZICMP lies in computing the normalizing constant and related quantities. The normalizing constant does not have a simple closed form in general and can quickly increase or decrease magnitude as parameters are varied. `COMPoissonReg` takes a hybrid approach of either truncating the infinite series or making use of an approximation, depending on parameter values.

The remainder of the vignette proceeds as follows. Section \ref{sec:cmp} describes functions to support the CMP distribution, including numerical handling of the normalizing constant. Section \ref{sec:zicmp} describes functions for ZICMP. Finally, Section \ref{sec:reg} demonstrates regression functions; Sections \ref{sec:cmp-reg} and \ref{sec:zicmp-reg} give specific examples based on CMP and ZICMP outcomes, respectively. The `COMPoissonReg` package is on CRAN at <https://cran.r-project.org/package=COMPoissonReg> and the source code is on Github at <https://github.com/lotze/COMPoissonReg>.

# Conway-Maxwell Poisson Distribution
\label{sec:cmp}

Let $Y \sim \text{CMP}(\lambda, \nu)$ be a Conway-Maxwell Poisson (CMP) random variable with density
<!-- -->
\begin{align*}
f(y \mid \lambda, \nu) = \frac{\lambda^y}{(y!)^\nu Z(\lambda, \nu)}, \quad y \in \mathbb{N}, \quad
Z(\lambda, \nu) = \sum_{r=0}^\infty \frac{\lambda^r}{(r!)^\nu},
\end{align*}
<!-- -->
where $\lambda > 0$, $\nu > 0$, and $\mathbb{N}$ represents the nonnegative integers $\{ 0, 1, 2, \ldots\}$. Three notable special cases of $\text{CMP}(\lambda, \nu)$ help to demonstrate its flexibility in count modeling.

a. The case $\nu = 1$ corresponds to $\text{Poisson}(\lambda)$. 

a. When $\lambda \in (0,1)$ and $\nu = 0$, the $\text{CMP}(\lambda, \nu)$ distribution is $\text{Geometric}(1-\lambda)$ with density $f(y \mid \lambda) = (1 - \lambda) \lambda^y$ for $y \in \mathbb{N}$, which is overdispersed relative to Poisson.

a. When $\nu \rightarrow \infty$, $\text{CMP}(\lambda, \nu)$ converges to a $\text{Bernoulli}(\lambda / (1 + \lambda))$ distribution which is underdispersed relative to Poisson.

## Normalizing Constant
\label{sec:cmp-normconst}
The normalizing constant $Z(\lambda, \nu)$ presents some challenges in the practical use of CMP models and has been a topic of interest in the CMP literature. In general, there is no simple closed form expression for the series $Z(\lambda, \nu)$. @ShmueliEtAl2005 give the approximation
<!-- -->
\begin{align}
Z(\lambda, \nu) &= \frac{ \exp(\nu \lambda^{1/\nu}) }{ \lambda^{(\nu-1)/2\nu} (2\pi)^{(\nu-1)/2} \nu^{1/2} }
\left\{ 1 + O(\lambda^{-1/\nu}) \right\},
\label{eqn:approx}
\end{align}
<!-- -->
where the $O(\cdot)$ term vanishes as $\lambda^{-1/\nu}$ becomes small. Approximations have been further studied and refined in subsequent literature; see for example @GillispieGreen2015, @DalyGaunt2016, and @GauntEtAl2019. The expression in \eqref{eqn:approx} emphasizes that the magnitude of $Z(\lambda, \nu)$ explodes as $\nu \rightarrow 0$ when $\lambda > 1$. For example, $Z(2, 0.075) \approx e^{780.515}$ is too large to store as a double-precision floating point number, and may evaluate to infinity if care is not taken. In contrast, $Z(\lambda, \nu) \rightarrow 1/(1 - \lambda)$ when $\lambda < 1$ and $\nu \rightarrow 0$.

In practice, the `COMPoissonReg` package does not place constraints on $\lambda$ and $\nu$, except to ensure that they are positive, so that their values are driven by the data or the user's selection. A hybrid strategy motivated by \eqref{eqn:approx} is taken by `COMPoissonReg`. 

To compute $Z(\lambda, \nu)$, suppose we are given a small tolerance $\delta > 0$. If 
<!-- --->
\begin{align}
\lambda^{-1/\nu} < \delta,
\label{eqn:can_approx}
\end{align}
<!-- --->
the first term of \eqref{eqn:approx} dominates the second term, and we take
<!-- -->
\begin{align}
Z(\lambda, \nu) &\approx \frac{ \exp(\nu \lambda^{1/\nu}) }{ \lambda^{(\nu-1)/2\nu} (2\pi)^{(\nu-1)/2} \nu^{1/2} } \nonumber \\
&=\exp\left\{
\nu \lambda^{1/\nu} - \frac{\nu-1}{2\nu} \log \lambda - \frac{\nu-1}{2} \log(2\pi) - \frac{1}{2} \log \nu
\right\}.
\label{eqn:z-approx}
\end{align}
<!-- -->
as an approximation. Otherwise, the series is computed by truncating to a finite number of terms, which is described next. In either case, computations are kept on the log-scale as much as possible to accommodate numbers with potentially very large and very small magnitudes.

We approximate $Z(\lambda, \nu)$ by a finite summation $Z^{(M)}(\lambda, \nu) = \sum_{r=0}^M \lambda^r / (r!)^\nu$ if condition \eqref{eqn:can_approx} fails, so that the remainder is smaller than a given tolerance. The general approach is described in Appendix B of @ShmueliEtAl2005. @Robbins1955 gives bounds for Stirling's approximation as
<!-- -->
\begin{align*}
\sqrt{2\pi} n^{n + 1/2} e^{-n} e^{1 / (12n + 1)} < n! <
\sqrt{2\pi} n^{n + 1/2} e^{-n} e^{1 / (12n)}.
\end{align*}
<!-- -->
Noting that $e^{1 / (12n + 1)} \geq 1$ for $n \geq 1$ and $\sqrt{2\pi} e^{1 / (12n)} \leq e$ for $n \geq 2$, we obtain simpler bounds
<!-- -->
\begin{align*}
\sqrt{2\pi} n^{n + 1/2} e^{-n} \leq n! \leq e n^{n + 1/2} e^{-n},
\end{align*}
<!-- -->
which will be convenient in the following calculations.^[These bounds are also stated at <https://en.wikipedia.org/wiki/Stirling\%27s_approximation>, last accessed 2022-10-09.] We may then bound the truncation error for $Z^{(M)}(\lambda, \nu)$ using
<!-- -->
\begin{align}
\lvert Z(\lambda, \nu) - Z^{(M)}(\lambda, \nu) \rvert &= Z(\lambda, \nu) - Z^{(M)}(\lambda, \nu) \nonumber \\
&= \sum_{r=M+1}^\infty \frac{\lambda^r}{(r!)^\nu} \nonumber \\
&\leq \sum_{r=M+1}^\infty \frac{\lambda^r}{(2\pi)^{\nu/2} r^{\nu r + \nu/2} e^{-r \nu}} \nonumber \\
&\leq \sum_{r=M+1}^\infty \frac{\lambda^r}{(2\pi)^{\nu/2} (M+1)^{\nu r + \nu/2} e^{-r \nu}} \nonumber \\
&= (2\pi)^{-\nu/2} (M+1)^{-\nu/2} \sum_{r=M+1}^\infty \left( \frac{\lambda e^{\nu}}{(M+1)^{\nu}} \right)^r \label{eqn:geom} \\
&= (2\pi)^{-\nu/2} (M+1)^{-\nu/2} \sum_{r=0}^\infty \left( \frac{\lambda e^{\nu}}{(M+1)^{\nu}} \right)^{r+M+1} \nonumber \\
&= (2\pi)^{-\nu/2} (M+1)^{-\nu/2} \left( \frac{\lambda e^{\nu}}{(M+1)^{\nu}} \right)^{M+1} \frac{1}{1 - \frac{\lambda e^{\nu}}{(M+1)^{\nu}}} \nonumber \\
&=: \Delta_M, \nonumber
\end{align}
<!-- -->
assuming that $|\lambda e^\nu / (M+1)^\nu| < 1$ so that the geometric series in \eqref{eqn:geom} converges. To ensure this convergence we choose $M$ at least large enough so that
<!-- -->
\begin{align*}
\lambda e^\nu / (M+1)^\nu < 1 \iff
M > \lambda^{1/\nu} e - 1.
\end{align*}
<!-- -->
For a small given number $\epsilon > 0$, we may consider bounding the relative error by
<!-- -->
\begin{align}
&\frac{\lvert Z(\lambda, \nu) - Z^{(M)}(\lambda, \nu) \rvert}{Z^{(M)}(\lambda, \nu)}
\leq \frac{\Delta_M}{Z^{(M)}(\lambda, \nu)}
< \epsilon.
\label{eqn:rel_error_bound}
\end{align}
<!-- -->
The second inequality of \eqref{eqn:rel_error_bound} can be expressed on the log-scale using
<!-- -->
\begin{align}
\log \Delta_M - \log Z^{(M)}(\lambda, \nu) < \log \epsilon,
\label{eqn:adaptive-log-scale}
\end{align}
<!-- -->
where
<!-- -->
\begin{align*}
\log \Delta_M
= -\frac{\nu}{2} \log(2\pi)
- \nu \left( M+\frac{3}{2} \right) \log (M+1)
+ (M+1) (\nu + \log \lambda)
- \log\left(1 - \frac{\lambda e^\nu}{(M+1)^\nu} \right).
\end{align*}
<!-- -->
Therefore, we compute $Z^{(M)}(\lambda, \nu)$ until at least $M > \lambda^{1/\nu} e - 1$, increasing $M$ and updating $Z^{(M)}(\lambda, \nu)$ until \eqref{eqn:adaptive-log-scale} is satisfied. This is summarized as Algorithm \ref{alg:normconst-trunc}.

\begin{algorithm}
\caption{Compute the CMP normalizing constant using truncation approach.}
\label{alg:normconst-trunc}
\hspace*{\algorithmicindent}\textbf{Input}: $\lambda > 0$ rate parameter. \\
\hspace*{\algorithmicindent}\textbf{Input}: $\nu > 0$ dispersion parameter. \\
\hspace*{\algorithmicindent}\textbf{Input}: $\epsilon > 0$ tolerance. \\
\hspace*{\algorithmicindent}\textbf{Input}: $y_\text{max} \in \mathbb{N}$ upper limit for $M$
\begin{algorithmic}[1]
\Function{Truncated-$Z$}{$\lambda, \nu, \epsilon, y_\text{max}$}
\State $M = 0$, $Z^{(0)} = 1$

\While{$M \leq \lambda^{1/\nu} e - 1$ and $M \leq y_\text{max}$}
\State $Z^{(M+1)} \leftarrow Z^{(M)} + \lambda^{M} / (M!)^\nu$
\State $M \leftarrow M + 1$
\EndWhile

\While{$\log \Delta_M - \log Z^{(M)} \geq \log \epsilon$ and $M \leq y_\text{max}$}
\State $Z^{(M+1)} \leftarrow Z^{(M)} + \lambda^{M} / (M!)^\nu$
\State $M \leftarrow M + 1$
\EndWhile

\State \Return $\{ Z^{(M)}, M \}$
\EndFunction
\end{algorithmic}
\end{algorithm}

The individual terms $\lambda^r / (r!)^\nu$ in the summation may be too large to store at their original scale. Therefore, summation is carried out at the log-scale, wherever possible, using the identity
<!-- -->
\begin{align}
\log(x + y) = \log x + \log(1 + \exp\{ \log y - \log x \});
\label{eqn:logadd}
\end{align}
<!-- -->
this is especially helpful when $0 < y \ll x$, as $\log x$ may be kept on the log-scale by the first term of the right-hand side of \eqref{eqn:logadd}, and the standard library function `log1p` may be used to accurately compute $\log(1 + \phi)$ for very small $\phi > 0$.

Many of the functions in the user interface of `COMPoissonReg` take an optional `control` argument, which can be constructed as follows.

```{r}
control = get.control(
	ymax = 100000,
	hybrid.tol = 1e-2,
	truncate.tol = 1e-6
)
```

The tolerances $\delta$ and $\epsilon$ are specified as `hybrid.tol` and `truncate.tol` respectively. Taking `hybrid.tol` to be a very small positive number results in use of the truncated sum $Z^{(M)}(\lambda, \nu)$, while `hybrid.tol = Inf` uses the approximation method \eqref{eqn:z-approx}, except in extreme cases where $\lambda^{-1/\nu}$ evaluates to zero or $\infty$ numerically. The argument `ymax` specifies upper limit $M$; this is a safety measure which prevents very large computations unless the user opts to allow them. When no control object is specified, a global default (via option `COMPoissonReg.control`) is used.

```{r}
control = getOption("COMPoissonReg.control")
control$ymax
control$hybrid.tol
control$truncate.tol
```

The default may be replaced in your current session if desired.

```{r}
options(COMPoissonReg.control = control)
```

The control object contains several other useful arguments to be discussed later in the vignette.

The `ncmp` function computes the normalizing constant $Z(\lambda, \nu)$ and returns its value either on the original scale or the log-scale.

```{r}
ncmp(lambda = 1.5, nu = 1.2)
ncmp(lambda = 1.5, nu = 1.2, log = TRUE)
ncmp(lambda = 1.5, nu = 1.2, log = TRUE, control = get.control(hybrid.tol = 1e10))
ncmp(lambda = 1.5, nu = 1.2, log = TRUE, control = get.control(hybrid.tol = 1e-10))
```

Before proceeding, let us define a function to display errors and warnings which are intentionally triggered in the remainder of the vignette.

```{r}
print_warning = function(x) { print(strwrap(x), quote = FALSE) }
```

The function `tcmp` returns the truncation value $M$ obtained from Algorithm \ref{alg:normconst-trunc}.

```{r}
nu_seq = c(1, 0.5, 0.2, 0.1, 0.05, 0.03)
tryCatch({ tcmp(lambda = 1.5, nu = nu_seq) }, warning = print_warning)
```

Note that `tcmp` returns `1e6` and produces a warning for the smallest `nu` value `r min(nu_seq)` because Algorithm \ref{alg:normconst-trunc} has reached `ymax = 1e6` before the series could be bounded by a geometric series. Here, it is likely that support values with non-negligible mass are being left out. Let us increase `ymax` to avoid this problem.

```{r}
tcmp(lambda = 1.5, nu = nu_seq, control = get.control(ymax = 3e6))
```

It is also possible to reach the second loop of Algorithm \ref{alg:normconst-trunc} where the geometric series can be used, but `ymax` is not large enough to satisfy \eqref{eqn:adaptive-log-scale}. Here is an example where this occurs.

```{r}
tcmp(lambda = 1.2, nu = 0.03, control = get.control(ymax = 1200))
```

Now that we have a somewhat robust computation for the normalizing constant, let us create a plot of the interesting behavior when $\lambda > 1$ and $\nu$ is decreasing.

```{r, prompt = FALSE}
library(ggplot2)

nu_seq = seq(0.03, 1.5, length.out = 20)
nc1 = ncmp(lambda = 0.5, nu = nu_seq, log = TRUE)
nc2 = ncmp(lambda = 1.05, nu = nu_seq, log = TRUE)
nc3 = ncmp(lambda = 1.20, nu = nu_seq, log = TRUE)
```

```{r, fig.width = 5, fig.height = 3, fig.align = "center", prompt = FALSE, fig.cap = "Log of normalizing constant for $\\lambda = 0.5$ ($\\circ$), $\\lambda = 1.05$ ($\\Delta$), and $\\lambda = 1.20$ ($+$)."}

ggplot() +
	geom_point(data = data.frame(x = nu_seq, y = nc1), aes(x = x, y = y), pch = 1) +
	geom_point(data = data.frame(x = nu_seq, y = nc2), aes(x = x, y = y), pch = 2) +
	geom_point(data = data.frame(x = nu_seq, y = nc3), aes(x = x, y = y), pch = 3) +
	xlab("nu") +
	ylab("log of normalizing constant") +
	theme_bw()
```
We see that with $\lambda = 1.2$ a value of $Z(\lambda, `r round(nu_seq[1], 4)`) \approx e^{`r round(nc3[1], 2)`}$ is obtained, which is an extremely large jump from the next value in the series $Z(\lambda, `r round(nu_seq[2], 4)`) \approx e^{`r round(nc3[2], 2)`}$.

## Density, Generation, CDF, and Quantile Functions
\label{sec:cmp-dist}

The respective functions for CMP density, variate generation, CDF, and quantile functions are `dcmp`, `rcmp`, `pcmp` , and `qcmp`. Their usage is similar to distribution functions provided by the R `stats` package.

```{r}
dcmp(0, lambda = 10, nu = 0.9)
dcmp(0:17, lambda = 10, nu = 0.9, log = TRUE)
dcmp(c(0, 1, 2), lambda = c(10, 11, 12), nu = c(0.9, 1.0, 1.1), log = TRUE)
```

```{r}
rcmp(50, lambda = 10, nu = 0.9)
```

```{r}
pcmp(0:17, lambda = 10, nu = 0.9)
```

```{r}
qq = seq(0, 0.95, length.out = 10)
qcmp(qq, lambda = 10, nu = 0.9)
```

The CMP distribution functions compute the normalizing constant via Section \ref{sec:cmp-normconst}. The density `dcmp` uses the hybrid method, while `rcmp`, `pcmp`, and `qcmp` use the truncation method regardless of condition \eqref{eqn:can_approx}. Variate generation, CDF, and quantiles are then computed on CMP as if it were a discrete distribution on the sample space $\{ 0, \ldots, M \}$. As with `tcmp` and `ncmp`, a warning is emitted in cases where they may not produce reliable results.

```{r}
tryCatch({ rcmp(1, lambda = 2, nu = 0.01) }, warning = print_warning)
```

The truncation method for `qcmp` can result in unreliable quantiles when the requested probability is very close to 1. For example, the actual quantile for probability $1$ is $\infty$, which can be expressed with no computation, but the computed quantity will be the truncation value $M$. More generally, \eqref{eqn:rel_error_bound} implies that
<!-- -->
\begin{align}
\frac{Z(\lambda, \nu) - Z^{(M)}(\lambda, \nu)}{Z(\lambda, \nu)} < \epsilon
&\iff \Prob(Y > M) < \epsilon \nonumber \\
&\iff 1 - F(M \mid \lambda, \nu) < \epsilon \nonumber \\
&\iff F^{-}(1 - \epsilon \mid \lambda, \nu) < M.
\label{eqn:cmp-quantile-limit}
\end{align}
<!-- -->
Therefore, it is possible that quantiles larger than $1 - \epsilon$ may be inaccurately computed with the truncated support. `COMPoissonReg` gives a warning when these are requested.

```{r}
tryCatch({
	qcmp(0.9999999, lambda = 1.5, nu = 0.5)
}, warning = print_warning)
```

As a sanity check, let us ensure that the empirical density values, cumulative probabilities, and quantiles of draws from `rcmp` align with respective results computed via `dcmp`, `pcmp`, and `qcmp`.

```{r, fig.width = 3, fig.height = 3, fig.align = "center", prompt = FALSE, fig.show = "hold"}
library(ggplot2)

n = 100000
lambda = 0.5
nu = 0.1
x = rcmp(n, lambda, nu)

xx = seq(-1, max(x))  ## Include -1 to ensure it gets probability zero
qq = seq(0, 0.99, length.out = 100)

fx = dcmp(xx, lambda, nu)
px = pcmp(xx, lambda, nu)
qx = qcmp(qq, lambda, nu)

qx_emp = quantile(x, probs = qq)
```

```{r, fig.width = 3, fig.height = 3, prompt = FALSE, fig.cap = "Empirical density of draws (histogram) versus density computed via the dcmp function (points)."}
ggplot() +
	geom_bar(data = data.frame(x = x), aes(x = x, y = ..prop..), fill = "NA",
		col = "black") +
	geom_point(data = data.frame(x = xx[-1], fx = fx[-1]), aes(x, fx)) +
	ylab("Density") +
	theme_bw()
```

```{r, fig.width = 3, fig.height = 3, prompt = FALSE, fig.cap = "Empirical CDF of draws (solid line) versus CDF computed via the pcmp function (points)."}
ggplot() +
	stat_ecdf(data = data.frame(x = x), aes(x), geom = "step") +
	geom_point(data = data.frame(x = xx, px = px), aes(x, px)) +
	ylab("Probability") +
	theme_bw()
```

```{r, fig.width = 3, fig.height = 3, prompt = FALSE, fig.cap = "Empirical quantiles of draws (`o`) versus quantiles computed via the qcmp function (`+`)."}
ggplot() +
	geom_point(data = data.frame(x = qq, qx_emp = qx_emp), aes(qq, qx_emp), pch = 1) +
	geom_point(data = data.frame(x = qq, qx = qx), aes(qq, qx), pch = 3) +
	xlab("Probability") +
	ylab("Quantile") +
	theme_bw()
```

## Expected Value and Variance
\label{sec:cmp-ev}

For $Y \sim \text{CMP}(\lambda, \nu)$, we can consider computing the expectation and variance of $Y$ in two ways. First, if there is a moderately-sized $M$ where $\{ 0, \ldots, M \}$ contains all but a negligible amount of the mass of $Y$, we can compute the moments using truncated summations
<!-- -->
\begin{align*}
\E(Y) = \sum_{y=0}^M y \frac{\lambda^y}{(y!)^\nu Z(\lambda, \nu)}, \quad
\E(Y^2) = \sum_{y=0}^M y^2 \frac{\lambda^y}{(y!)^\nu Z(\lambda, \nu)}, \quad
\Var(Y) = \E(Y^2) - [\E(Y)]^2.
\end{align*}
<!-- -->
Otherwise, a different approach is taken. Notice that the expected value is related to the first derivative of the log-normalizing constant via
<!-- -->
\begin{align*}
&\frac{\partial}{\partial \lambda} \log Z(\lambda, \nu)
= \frac{ \frac{\partial}{\partial \lambda} Z(\lambda, \nu) }{ Z(\lambda, \nu) }
= \frac{1}{Z(\lambda, \nu)} \sum_{y=0}^\infty y \frac{\lambda^{y-1}}{(y!)^\nu} \\
&\iff \E(Y) = \lambda \frac{\partial}{\partial \lambda} \log Z(\lambda, \nu).
\end{align*}
<!-- -->
For the second derivative,
<!-- -->
\begin{align*}
&\frac{\partial^2}{\partial \lambda^2} \log Z(\lambda, \nu)
= \frac{ Z(\lambda, \nu) \frac{\partial^2}{\partial \lambda^2} Z(\lambda, \nu) - [\frac{\partial}{\partial \lambda} Z(\lambda, \nu)]^2}{ Z(\lambda, \nu)^2 }
= \frac{1}{Z(\lambda, \nu)} \sum_{y=0}^\infty y(y-1) \frac{\lambda^{y-2}}{(y!)^\nu} - \left[ \frac{\E(Y)}{\lambda} \right]^2 \\
&\iff \lambda^2 \frac{\partial^2}{\partial \lambda^2} \log Z(\lambda, \nu) = \E[Y(Y-1)] - [\E(Y)]^2 = \Var(Y) - \E(Y) \\
&\iff \Var(Y) = \lambda^2 \frac{\partial^2}{\partial \lambda^2} \log Z(\lambda, \nu) + \E(Y).
\end{align*}
<!-- -->
Therefore, we may use first and second derivatives of $\log Z(\lambda, \nu)$, taken with respect to $\lambda$, to compute $\E(Y)$ and $\Var(Y)$. The `ecmp` and `vcmp` functions implement this computation of the expectation and variance, respectively. The control object is used to determine whether to use the truncation or differentiation approach. Condition \eqref{eqn:can_approx} is used to determine whether to use the truncation approach (if false) or differentiation approach (if true).

```{r}
ecmp(lambda = 10, nu = 1.2)
ecmp(lambda = 1.5, nu = 0.5)
ecmp(lambda = 1.5, nu = 0.05)
ecmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e-10))
ecmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e10))
```

```{r}
vcmp(lambda = 10, nu = 1.2)
vcmp(lambda = 1.5, nu = 0.5)
vcmp(lambda = 1.5, nu = 0.05)
vcmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e-10))
vcmp(lambda = 1.5, nu = 0.05, control = get.control(hybrid.tol = 1e10))
```

Provided that an enormously large truncation value $M$ is not required, we may compute other moments by truncated sums using `tcmp`.

```{r}
M = tcmp(lambda = 1.5, nu = 0.05)
print(M)
xx = seq(0, M)
sum(xx^3 * dcmp(xx, lambda, nu))    # E(X^3)
sum(xx^4 * dcmp(xx, lambda, nu))    # E(X^4)
```

# Zero-Inflated Conway-Maxwell Poisson Distribution
\label{sec:zicmp}

Let $S \sim \text{Bernoulli}(p)$ and $T \sim \text{CMP}(\lambda, \nu)$ be independent random variables. Then $Y = (1-S) T$ follows a Zero-Inflated Conway-Maxwell Poisson distribution $\text{ZICMP}(\lambda, \nu, p)$ with density
<!-- -->
\begin{align*}
f(y \mid \lambda, \nu, p) =  (1 - p) \frac{\lambda^{y}}{(y!)^{\nu} Z(\lambda, \nu)}
+ p \cdot \ind(y = 0), \quad y \in \mathbb{N}.
\end{align*}
<!-- -->
Like CMP, several interesting special cases are obtained.

a. Taking $\nu = 1$ corresponds to Zero-Inflated Poisson $\text{ZIP}(\lambda, p)$ with density $f(y \mid \lambda, p) = (1 - p) e^{-\lambda} \lambda^{y} / y! + p \cdot \ind(y = 0)$.

a. When $\lambda \in (0,1)$ and $\nu \rightarrow 0$, $\text{ZICMP}(\lambda, \nu)$ converges to a Zero-Inflated Geometric distribution with density $f(y \mid \lambda, p) = (1 - p) (1 - \lambda) \lambda^y  + p \cdot \ind(y = 0)$ for $y \in \mathbb{N}$.

a. When $\nu \rightarrow \infty$, $\text{ZICMP}(\lambda, \nu, p)$ converges to a "Zero-Inflated Bernoulli" distribution with density 
<!-- -->
\begin{math}
f(y \mid \lambda, p) = (1 - p) \left[ \lambda/(1+\lambda) \right]^y \left[ 1/(1+\lambda) \right]^{1-y} + p \cdot \ind(y = 0).
\end{math}
<!-- -->
which remains a Bernoulli distribution with an adjusted success probability. Here the $\lambda$ and $p$ parameters are not individually identifiable. Therefore, users may want to avoid zero-inflation in CMP data analyses with extreme underdispersion.

a. Finally, $p = 0$ yields $\text{CMP}(\lambda, \nu)$.

## Density, Generation, CDF, and Quantile Functions

There is a close relationship between the CMP and ZICMP distributions, as we have seen from construction of ZICMP. The relationship between the CMP densities and variate generation mechanisms was given earlier in this section. Denote $F(x \mid \lambda, \nu)$ and $F(x \mid \lambda, \nu, p)$ as the CDFs of $\text{CMP}(\lambda, \nu)$ and $\text{ZICMP}(\lambda, \nu, p)$, respectively. We have
<!-- -->
\begin{align*}
F(x \mid \lambda, \nu, p) = (1-p) F(x \mid \lambda, \nu) + p \cdot \ind(x \geq 0).
\end{align*}
<!-- -->
For a given probability $\phi \in [0,1]$, the associated CMP and ZICMP quantile functions are related via
<!-- -->
\begin{align}
F^{-}(\phi \mid \lambda, \nu, p)
&= \inf\{ x \in \mathbb{N} : F(x \mid \lambda, \nu, p) \geq \phi \} \nonumber \\
&= \inf\{ x \in \mathbb{N} : (1-p) F(x \mid \lambda, \nu) + p \cdot \ind(x \geq 0) \geq \phi \} \nonumber \\
&= \inf\{ x \in \mathbb{N} : F(x \mid \lambda, \nu) \geq (\phi - p) / (1 - p) \} \nonumber \\
&= F^{-}\left( \frac{\phi - p}{1 - p} \mid \lambda, \nu \right).
\label{eqn:zicmp-quantiles}
\end{align}

The respective functions for ZICMP density, variate generation, CDF, and quantile functions are `dzicmp`, `rzicmp`, `pzicmp` , and `qzicmp`. They make use of the CMP implementation described in Section \ref{sec:cmp} such as the criteria to either truncate or approximate the normalizing constant.

```{r}
qq = seq(0, 0.95, length.out = 20)
rzicmp(20, lambda = 1.5, nu = 0.2, p = 0.25)
dzicmp(c(0, 1, 2), lambda = 1.5, nu = 0.2, p = 0.25)
pzicmp(c(0, 1, 2), lambda = 1.5, nu = 0.2, p = 0.25)
qzicmp(qq, lambda = 1.5, nu = 0.2, p = 0.25)
```

As with `qcmp`, the `qzicmp` function is computed by the truncation method, and cannot accurately compute quantiles very close to 1. More specifically, from the CMP distribution, \eqref{eqn:cmp-quantile-limit} gives
<!-- -->
\begin{align*}
M &> F^{-}(1 - \epsilon \mid \lambda, \nu) \\
&= F^{-}\left( \frac{\phi_\epsilon - p}{1 - p} \mid \lambda, \nu \right) \\
&= F^{-}\left( \phi_\epsilon \mid \lambda, \nu, p \right)
\end{align*}
<!-- -->
where $\phi_\epsilon = (1 - \epsilon)(1-p) + p$ and the last equality uses \eqref{eqn:zicmp-quantiles}. This motivates a warning from `qzicmp` when the argument is larger than $\phi_\epsilon$.

```{r}
tryCatch({
	qzicmp(0.9999999, lambda = 1.5, nu = 0.5, p = 0.5)
}, warning = print_warning)
```

Let us repeat the sanity check from Section \ref{sec:cmp-dist} to ensure that the empirical density values, cumulative probabilities, and quantiles line up with the ones computed via `dzicmp`, `pzicmp`, and `qzicmp`.

```{r, fig.width = 3, fig.height = 3, fig.align = "center", prompt = FALSE, fig.show = "hold"}
library(ggplot2)

n = 100000
lambda = 0.5
nu = 0.1
p = 0.5
x = rzicmp(n, lambda, nu, p)

xx = seq(-1, max(x))  ## Include -1 to ensure it gets probability zero
qq = seq(0, 0.99, length.out = 100)

fx = dzicmp(xx, lambda, nu, p)
px = pzicmp(xx, lambda, nu, p)
qx = qzicmp(qq, lambda, nu, p)

qx_emp = quantile(x, probs = qq)
```

```{r, fig.width = 3, fig.height = 3, prompt = FALSE, fig.cap = "Empirical density of draws (histogram) versus density computed via the dzicmp function (points)."}
ggplot() +
	geom_bar(data = data.frame(x = x), aes(x = x, y = ..prop..), fill = "NA",
		col = "black") +
	geom_point(data = data.frame(x = xx[-1], fx = fx[-1]), aes(x, fx)) +
	ylab("Density") +
	theme_bw()
```

```{r, fig.width = 3, fig.height = 3, prompt = FALSE, fig.cap = "Empirical CDF of draws (solid line) versus CDF computed via the pzicmp function (points)."}
ggplot() +
	stat_ecdf(data = data.frame(x = x), aes(x), geom = "step") +
	geom_point(data = data.frame(x = xx, px = px), aes(x, px)) +
	ylab("Probability") +
	theme_bw()
```

```{r, fig.width = 3, fig.height = 3, prompt = FALSE, fig.cap = "Empirical quantiles of draws (`o`) versus quantiles computed via the qzicmp function (`+`)."}
ggplot() +
	geom_point(data = data.frame(x = qq, qx_emp = qx_emp), aes(qq, qx_emp), pch = 1) +
	geom_point(data = data.frame(x = qq, qx = qx), aes(qq, qx), pch = 3) +
	xlab("Probability") +
	ylab("Quantile") +
	theme_bw()
```

## Expectation and Variance
The expected value and variance of $Y \sim \text{ZICMP}(\lambda, \nu, p)$ are apparent from the construction $Y = (1-S) T$ given earlier in this section. Namely,
<!-- -->
\begin{align*}
\E(Y) = (1-p) \E(T)
\quad \text{and} \quad
\Var(Y) = (1-p) \left\{ \Var(T) + p[\E(T)]^2 \right\}
\end{align*}
<!-- -->
may be obtained using formulas for iterated conditional expectations and variances. They are evaluated in `COMPoissonReg` using the `ezicmp` and `vzicmp` functions respectively. These functions make use of the `ecmp` and `vcmp` functions described in Section \ref{sec:cmp-ev} to compute $\E(T)$ and $\Var(T)$.

```{r}
ezicmp(lambda = 1.5, nu = 0.5, p = 0.1)
ezicmp(lambda = 1.5, nu = 0.5, p = c(0.1, 0.2, 0.5))
```

```{r}
vzicmp(lambda = 1.5, nu = 0.5, p = 0.1)
vzicmp(lambda = 1.5, nu = 0.5, p = c(0.1, 0.2, 0.5))
```

# Regression Modeling with CMP and ZICMP
\label{sec:reg}

Suppose there are $n$ subjects with outcomes $y_1, \ldots, y_n \in \mathbb{N}$ and covariates $\vec{x}_i \in \mathbb{R}^{d_1}$, $\vec{s}_i \in
\mathbb{R}^{d_2}$, and $\vec{w}_i \in \mathbb{R}^{d_3}$ for $i = 1, \ldots, n$. The `COMPoissonReg` package fits both CMP and ZICMP regression models.

The CMP regression model assumes that
<!-- -->
\begin{align*}
Y_i \indep \text{CMP}(\lambda_i, \nu_i), \quad i = 1, \ldots, n,
\end{align*}
<!-- -->
where $\log \lambda_i = \vec{x}_i^\top \vec{\beta}$ and $\log \nu_i = \vec{s}_i^\top \vec{\gamma}$. Writing $\btheta = (\vec{\beta}, \vec{\gamma})$, the likelihood is
<!-- -->
\begin{align}
L(\btheta) =
\prod_{i=1}^n \left[
\frac{\lambda_i^{y_i}}{(y_i!)^{\nu_i} Z(\lambda_i, \nu_i)}
\right].
\label{eqn:likelihood-cmp}
\end{align}
<!-- -->
The ZICMP regression model assumes that
<!-- -->
\begin{align*}
Y_i \indep \text{ZICMP}(\lambda_i, \nu_i, p_i), \quad i = 1, \ldots, n,
\end{align*}
<!-- -->
where $\log \lambda_i = \vec{x}_i^\top \vec{\beta}$, $\log \nu_i = \vec{s}_i^\top \vec{\gamma}$, and $\logit p_i = \vec{w}_i^\top \vec{\zeta}$. Writing $\btheta = (\vec{\beta}, \vec{\gamma}, \vec{\zeta})$, the likelihood is
<!-- -->
\begin{align}
L(\btheta) =
\prod_{i=1}^n \left[(1 - p_i)
\frac{\lambda_i^{y_i}}{(y_i!)^{\nu_i} Z(\lambda_i, \nu_i)}
+ p_i \ind(y_i = 0)
\right].
\label{eqn:likelihood-zicmp}
\end{align}
<!-- -->
We will write $d = d_1 + d_2 + d_3$ for the total dimension of $\btheta$. The $n \times d_1$ design matrix whose rows consist of $\vec{x}_i$ will be denoted $\vec{X}$. Similarly, we will write $\vec{S}$ and $\vec{W}$ as the $n \times d_2$ and $n \times d_3$ design matrices constructed from $\vec{s}_i$ and $\vec{w}_i$, respectively. The `glm.cmp` function provides a formula interface to fit models of both types: \eqref{eqn:likelihood-cmp} and \eqref{eqn:likelihood-zicmp}.

```{r, eval = FALSE, prompt = FALSE}
out = glm.cmp(formula.lambda, formula.nu = ~ 1, formula.p = NULL,
	data = NULL, init = NULL, fixed = NULL, control = NULL, ...)
```

The interface contains three formulas: `formula.lambda` specifies the regression $\vec{x}_i^\top \vec{\beta}$ used for $\lambda_i$, while `formula.nu` and `formula.p` correspond to $\vec{s}_i^\top \vec{\gamma}$ for $\nu_i$ and $\vec{w}_i^\top \vec{\zeta}$ for $p_i$, respectively. ZICMP regression is utilized when `formula.p` is set to something other than its default `NULL` value; otherwise, CMP regression is assumed. The `data` argument is used to pass a `data.frame` explicitly rather than having the data be read from the local environment. The `init`, `fixed`, and `control` arguments and associated helper functions are described below.

The `init` argument represents an initial value for the optimizer. The following functions can be used to construct it.

```{r}
get.init(beta = c(1, 2, 3), gamma = c(-1, 1), zeta = c(-2, -1))
get.init.zero(d1 = 3, d2 = 2, d3 = 2)
```

The `fixed` argument is used to specify indices of the three coefficients which will remain fixed at their initial value during optimization. 

```{r}
get.fixed(beta = c(1L, 2L), gamma = c(1L))
```

The specification above requests the first two elements of `beta` and the first element of `gamma` to be fixed. Notice that indices must be integers and that the default value is an empty integer vector which is interpreted as "no elements are fixed". The `fixed` argument can usually be disregarded but may be useful in some circumstances; an example is given in Section \ref{sec:zicmp-reg}.

Specifying the elements of `init` and `fixed` may be somewhat awkward with the formula interface, as they require knowledge of how formulas will be expanded into design matrices and coefficients. It can be helpful to produce the design matrices using R's `model.matrix` function.

```{r, eval=FALSE}
model.matrix(formula.lambda, data = data)
model.matrix(formula.nu, data = data)
model.matrix(formula.p, data = data)
```

The `control` argument has been introduced in Section \ref{sec:cmp}; regression modeling makes use of several additional arguments. `COMPoissonReg` uses `optim` to compute the maximum likelihood estimate (MLE) $\hat{\btheta}$ for $\btheta$ under the specified model. Several controls are provided to influence how `COMPoissonReg` invokes `optim`; here are their default values.

```{r}
control = getOption("COMPoissonReg.control")
control$optim.method
control$optim.control
```

The element `optim.method` is a string which is passed as the `method` argument to `optim`, while `optim.control` is a list passed as the `control` argument of `optim`. Note that, for the latter, if an entry is given for `fnscale`, it will be ignored and overwritten internally by `COMPoissonReg`.

The covariance of $\hat{\btheta}$ is estimated by $\hat{\vec{V}}(\hat{\btheta}) = -[\vec{H}(\hat{\btheta})]^{-1}$, where $\vec{H}(\btheta) = \partial^2 \log L(\btheta) / [\partial \btheta \partial \btheta^\top]$ is the Hessian of the log-likelihood computed by `optim`. The standard error for coefficient $\theta_j$ in $\btheta$ is then obtained as the square root of the $j$th diagonal of $\hat{\vec{V}}(\hat{\btheta})$.

We will now illustrate use of the regression tools using two examples whose data are included in the package. Note that these demonstrations are not intended to be complete regression analyses, and results may be slightly different than previously published analyses due to differences in the computations.

## CMP Regression
\label{sec:cmp-reg}

### Freight Dataset
\label{sec:cmp-reg-freight}

The `freight` dataset [@KutnerEtAl2003] was analyzed using CMP regression by @SellersShmueli2010 and found to exhibit underdispersion. The data describe $n = 10$ instances where 1,000 ampules were transported via air shipment. The outcome of interest is the variable `broken` which describes the number of broken ampules in each shipment. The covariate `transfers` describes the number of times the carton was transferred from one aircraft to another during the shipment.

Let us load and view the dataset.

```{r}
data(freight)
print(freight)
```

Before fitting a CMP regression, let us fit a Poisson regression model $Y_i \indep \text{Poisson}(\lambda_i)$ with 
<!-- -->
\begin{align*}
\log \lambda_i = \beta_0 + \beta_1 \cdot \text{transfers}_i.
\end{align*}
<!-- -->
This can be carried out with the standard `glm` function.

```{r}
glm.out = glm(broken ~ transfers, data = freight, family = poisson)
summary(glm.out)
```

### CMP Regression
\label{sec:cmp-reg-initial}

Next, let us fit a similar CMP regression model with
<!-- -->
\begin{align*}
&\log \lambda_i = \beta_0 + \beta_1 \cdot \text{transfers}_i, \\
&\log \nu_i = \gamma_0,
\end{align*}
<!-- -->
using only an intercept for $\nu_i$.

```{r}
cmp.out = glm.cmp(broken ~ transfers, data = freight)
print(cmp.out)
```

The coefficients used in the $\lambda_i$ formula are prefixed with an `X:` label, while an `S:` label is used for coefficients of the $\nu_i$ formula. Notice that estimates for `X:` coefficients from the CMP fit are dissimilar to those from the Poisson fit; this may occur when the estimate of $\nu$ deviates from the value of 1. Similarly to the `glm` output, the output of `glm.cmp` displays several quantities for each coefficient $\theta_j$, $j = 1, \ldots, d$: a point estimate $\hat{\theta}_j$, an associated standard error $\widehat{\text{SE}}(\hat{\theta}_j)$, a z-value $z_j = \hat{\theta}_j / \widehat{\text{SE}}(\hat{\theta}_j)$, and a p-value $2\Phi(-|z_j|)$ for the test $H_0: \theta_j = 0$ versus $H_1: \theta_j \neq 0$. Here, $\Phi$ is the CDF of the standard normal distribution. Because an intercept-only formula was specified for $\nu_i$, $\hat{\nu} = \exp(\hat{\gamma})$ does not vary with $i$ and its estimate and associated standard error are added to the display. Here we see evidence of underdispersion with $\hat{\nu} > 1$. A test for equidispersion is displayed to determine whether there is a significant amount of over or underdispersion in the data. In particular, a likelihood ratio test is used to decide whether $H_0: \vec{\gamma} = \vec{0}$ versus $H_0: \vec{\gamma} \neq \vec{0}$. The test statistic is displayed along with the degrees of freedom and associated p-value. Here we have fairly strong evidence to reject the null hypothesis of equidispersion.

The AIC for the CMP model `cmp.out` shows some improvement over that of `glm.out`. Let us also consider a slope coefficient for the $\nu$ component using
<!-- -->
\begin{align*}
\log \nu_i = \gamma_0 + \gamma_1 \cdot \text{transfers}_i.
\end{align*}
<!-- -->
via the following call to `glm.cmp`.

```{r}
cmp2.out = glm.cmp(broken ~ transfers, formula.nu = ~ transfers, data = freight)
print(cmp2.out)
```

Model `cmp2.out` provides a slight improvement over `cmp.out` in AIC and BIC, but we may prefer `cmp.out` in the interest of simplicity.

### Adjustments to Optim
\label{sec:cmp-reg-optim}

To gain some insight into the optimization, we may wish to increase the trace level, which can be done as follows.

```{r}
control = get.control(optim.control = list(maxit = 5, trace = 3, REPORT = 1))
cmp3.out = glm.cmp(broken ~ transfers, data = freight, control = control)
```

Data from the local environment may be passed to the `glm.cmp` function without explicitly using the `data` argument.

```{r, results = 'hide'}
y = freight$broken
x = freight$transfers
glm.cmp(y ~ x)
```

### Offset Term
\label{sec:cmp-reg-offset}

In a count regression model, it may be desirable to include offset terms such as
<!-- -->
\begin{align*}
\log \lambda_i = \vec{x}_i^\top \vec{\beta} + \text{offx}_i, \quad
\log \nu_i = \vec{s}_i^\top \vec{\beta} + \text{offs}_i.
\end{align*}
<!-- -->
An `offset` term may be used in the formula interface to accomplish this.

```{r, results = 'hide'}
freight$offx = 13
freight$offs = 1
glm.cmp(broken ~ transfers + offset(offx), data = freight)
glm.cmp(broken ~ transfers + offset(offx), formula.nu = ~1 + offset(offs), data = freight)
```

For users who wish to bypass the formula interface and prepare the $\vec{X}$ and $\vec{S}$ design matrices manually, a "raw" interface to the regression functionality is also provided.

```{r, results = 'hide'}
y = freight$broken
X = model.matrix(~ transfers, data = freight)
S = model.matrix(~ 1, data = freight)
offs = get.offset(x = rep(13, nrow(freight)), s = rep(1, nrow(freight)))
cmp.raw.out = glm.cmp.raw(y, X, S, offset = offs)
```

### Accessor Functions
\label{sec:cmp-reg-accessors}

Several accessors are provided to extract results from the output object.

```{r}
logLik(cmp.out)               ## Log-likelihood evaluated at MLE.
AIC(cmp.out)                  ## AIC evaluated at MLE.
BIC(cmp.out)                  ## BIC evaluated at MLE.
coef(cmp.out)                 ## Estimates of theta as a flat vector
coef(cmp.out, type = "list")  ## Estimates of theta as a named list
vcov(cmp.out)                 ## Estimated covariance matrix of theta hat
sdev(cmp.out)                 ## Standard deviations from vcov(...) diagonals
sdev(cmp.out, type = "list")  ## Standard deviations as a named list
```

The `predict` function computes several useful quantities evaluated at the estimate $\hat{\btheta}$. The argument default `type = "response"` produces a vector of estimates of the response $\hat{y}_i = \E(Y_i)$ for $i = 1, \ldots, n$ using the method described in Section \ref{sec:cmp-ev}. The argument `type = "link"` produces a `data.frame` with columns for the linked parameters $\lambda_i$ and $\nu_i$.

```{r}
predict(cmp.out)
predict(cmp.out, type = "link")
```

Note that the estimated `nu` values are equal for all observations because the model assumed only an intercept term for the dispersion component.

We can also use `predict` on new covariate values. Note that models fit with the formula interface expect the new data to be provided as a `data.frame` which is interpreted using the formula used to fit the model. If the raw interface was used to fit the model, use the `get.modelmatrix` function to specify design matrices to use for prediction.

```{r, fig.width = 3, fig.height = 3, fig.align = "center", prompt=FALSE}
# Prepare new data to fit by formula interface
new.df = data.frame(transfers = 0:10)

# Prepare new data to fit by raw interface
X = model.matrix(~ transfers, data = new.df)
S = model.matrix(~ 1, data = new.df)
new.data = get.modelmatrix(X = X, S = S)

# Pass new data to model from by formula interface
y.hat.new = predict(cmp.out, newdata = new.df)

# Pass new data to model from by raw interface
y.hat.new = predict(cmp.raw.out, newdata = new.data)

# Compute predictions for links
predict.out = predict(cmp.out, newdata = new.df, type = "link")

# Plot predictions
ggplot() +
	geom_point(data = new.df, aes(transfers, y.hat.new)) +
	xlab("Number of transfers") +
	ylab("Predicted number broken") +
	theme_bw()
```

```{r}
print(y.hat.new)
print(predict.out)
```

The `leverage` function computes the diagonal entries of a "hat" matrix which can be formulated in CMP regression. These can be used to diagnose influential observations. For details, see Section 3.6 of @SellersShmueli2010.

```{r}
leverage(cmp.out)
```

The `residuals` function provides either raw (the default) or quantile-based residuals [@DunnSmyth1996]. In a CMP regression setting, raw residuals $y_i - \hat{y}_i$  generally do not work well with traditional regression diagnostics, such as Q-Q plots. Quantile-based residuals often produce interpretable diagnostics; however, a random element is used in the computation of quantile residuals for discrete distributions. This aids interpretability but gives slightly different residual values each time they are computed. See @DunnSmyth1996 for details.

```{r}
res.raw = residuals(cmp.out)
res.qtl = residuals(cmp.out, type = "quantile")
```

Pearson residuals may be preferred over raw residuals for diagnostics; these can be obtained by standardizing raw residuals using leverage values and variance estimates.

```{r}
link.hat = predict(cmp.out, type = "link")
vv = vcmp(link.hat$lambda, link.hat$nu)
hh = leverage(cmp.out)
res.pearson = res.raw / sqrt(vv*(1-hh))
```

For each type of residual---raw, Pearson, and quantile---we now plot fitted values versus residuals and Q-Q plots.

```{r, fig.width = 3, fig.height = 3, fig.align = "center", prompt = FALSE, fig.show = "hold"}
plot.fit.res = function(y.hat, res) {
	ggplot(data.frame(y = y.hat, res = res)) +
		geom_point(aes(y, res)) +
		xlab("Fitted Value") +
		ylab("Residual Value") +
		theme_bw() +
		theme(plot.title = element_text(size = 10))
}

plot.qq.res = function(res) {
	ggplot(data.frame(res = res), aes(sample = res)) +
		stat_qq() +
		stat_qq_line() +
		theme_bw() +
		theme(plot.title = element_text(size = 10))
}

y.hat = predict(cmp.out)

plot.fit.res(y.hat, res.raw) +
	ggtitle("Fitted Values vs. Raw Residuals")
plot.qq.res(res.raw) +
	ggtitle("Q-Q Plot of Raw Residuals")

plot.fit.res(y.hat, res.pearson) +
	ggtitle("Fitted Values vs. Pearson Residuals")
plot.qq.res(res.pearson) +
	ggtitle("Q-Q Plot of Pearson Residuals")

plot.fit.res(y.hat, res.qtl) +
	ggtitle("Fitted Values vs. Quantile Residuals")
plot.qq.res(res.qtl) +
	ggtitle("Q-Q Plot of Quantile Residuals")
```

In this example, with only `r nrow(freight)` observations, it is difficult to see an advantage of using quantile residuals; the benefit will be more apparent in Section \ref{sec:zicmp-reg}. One benefit of raw residuals is that they may be used to compute a mean-squared error.

```{r}
mean(res.raw^2)
```

To access the results of the equidispersion test shown in the output of `cmp.out`, we may use the `equitest` accessor function.

```{r}
equitest(cmp.out)
```

The `deviance` function computes the deviance quantities $D_i = -2 [\log L_i(\hat{\btheta}) - \log L_i(\tilde{\btheta}_i)]$ for $i = 1, \ldots, n$, where $L_i(\btheta)$ is the term of the likelihood corresponding to the $i$th observation, $\hat{\btheta}$ is the MLE computed under the full likelihood $L(\btheta) = \prod_{i=1}^n L_i(\btheta)$, and $\tilde{\btheta}_i$ is the maximizer of $L_i(\btheta)$.

```{r}
deviance(cmp.out)
```

The `parametric.bootstrap` function carries out a parametric bootstrap with $R$ repetitions. Using the fitted MLE $\hat{\btheta}$, bootstrap samples $\vec{y}^{(r)} = (y_1^{(r)}, \ldots, y_n^{(r)})$ are drawn from the likelihood $L(\hat{\btheta})$ for $r = 1, \ldots, R$. Estimate $\hat{\btheta}^{(r)}$ is fitted from bootstrap sample $\vec{y}^{(r)}$. An $R \times d$ matrix is returned whose $r$th row is $\hat{\btheta}^{(r)}$.

```{r}
cmp.boot = parametric.bootstrap(cmp.out, reps = 100)
head(cmp.boot)
```

We used $R = `r nrow(cmp.boot)`$ in the display above to keep vignette computations small, but a larger number may be desired in practice. Bootstrap repetitions can be used, for example, to compute 95% confidence intervals for each of the coefficients.

```{r}
t(apply(cmp.boot, 2, quantile, c(0.025,0.975)))
```

### Large Covariates
\label{sec:cmp-reg-large-covariates}

Large covariates can present numerical difficulties in fitting CMP regression. We will briefly demonstrate the difficulties and some possible workarounds. First let us generate a new dataset based on a large covariate in the regression for $\lambda_i$.

```{r, prompt=FALSE}
set.seed(1234)
n = 200
x = rnorm(n, 500, 10)
X = cbind(intercept = 1, slope = x)
S = matrix(1, n, 1)
beta_true = c(-0.05, 0.05)
gamma_true = 2
lambda_true = exp(X %*% beta_true)
nu_true = exp(S %*% gamma_true)
y = rcmp(n, lambda_true, nu_true)
```

Notice that the generated counts $y_1, \ldots, y_n$ are relatively small compared to the covariate $x_1, \ldots, x_n$.

```{r}
summary(x)
summary(y)
```

An initial attempt to fit the true data-generating model fails.

```{r}
tryCatch({
	glm.cmp(y ~ x, formula.nu = ~ 1)
}, error = print_warning)
```

Internally, the linked rate parameter $\lambda_i = \exp(\beta_0 + \beta_1 x_i)$ may evaluate to `Inf` or become very close to zero as the optimizer moves $\beta_1$ away from zero in a positive or negative direction, respectively. Some possible ways to address this are as follows.

Standardize the covariate to have mean zero and variance one.

```{r}
glm.cmp(y ~ scale(x), formula.nu = ~ 1)
```

Use a logarithmic transformation on the covariate.

```{r}
glm.cmp(y ~ log(x), formula.nu = ~ 1)
```

Change optimization method or other `optim` arguments.

```{r}
control = get.control(optim.method = "BFGS", optim.control = list(maxit = 200))
suppressWarnings({
	cmp.out = glm.cmp(y ~ x, formula.nu = ~ 1, control = control)
	print(cmp.out)
})
```

In this case, standardization and logarithmic transformation produce a usable fit. Changing the optimization method to `BFGS` allows the optimization to finish, but there are further numerical problems in computing the Hessian for standard errors.

### Large Outcomes
\label{sec:cmp-reg-large-outcomes}

Now consider a generated dataset with large outcomes but a relatively small covariate. This situation can also present numerical difficulties.

```{r, prompt=FALSE}
set.seed(1234)
n = 200
x = runif(n, 1, 2)
X = cbind(intercept = 1, slope = x)
S = matrix(1, n, 1)
beta_true = c(1, 1)
gamma_true = -0.95
lambda_true = exp(X %*% beta_true)
nu_true = exp(S %*% gamma_true)
y = rcmp(n, lambda_true, nu_true)
```

```{r}
summary(x)
summary(y)
```

An initial attempt to fit the data-generating model fails.

```{r}
tryCatch({
	glm.cmp(y ~ x, formula.nu = ~ 1)
}, error = print_warning)
```

Informative starting values help the optimizer to initially make progress. True data-generating parameters will not be available in a real data analysis situation but help to illustrate the idea.

```{r}
init = get.init(beta = beta_true, gamma = gamma_true)
glm.cmp(y ~ x, formula.nu = ~ 1, init = init)
```

Without choosing an initial value, changing the optimization method to `Nelder-Mead` and increasing the maximum number of iterations also helps the optimizer find a solution.

```{r}
control = get.control(optim.method = "Nelder-Mead", optim.control = list(maxit = 1000))
glm.cmp(y ~ x, formula.nu = ~ 1, control = control)
```

Note that this solution is different from the previous one; the log-likelihood of the previous one is slightly better.

## ZICMP Regression
\label{sec:zicmp-reg}

### Couple Dataset
\label{sec:zicmp-reg-couple}

The `couple` dataset [@LoeysEtAl2012] was analyzed with ZICMP regression in @SellersRaim2016 and found to exhibit overdispersion. The data concern separation trajectories of $n = 387$ couples. The variable `UPB` records the number of unwanted pursuit behavior perpetrations and is considered the outcome of interest. Included covariates are the binary variable `EDUCATION`, which is 1 if at least a bachelor's degree was attained, and a continuous variable `ANXIETY` which measures anxious attachment. A zero-inflated count model is considered for these data because 246 of the 387 records have an outcome of zero.

Let us load and view the first few records in the dataset.

```{r}
data(couple)
head(couple)
```

As a preliminary model, let us fit a standard Poisson model $Y_i \indep \text{Poisson}(\lambda_i)$ with
<!-- -->
\begin{align*}
\log \lambda_i = \beta_0 + \beta_1 \cdot \text{EDUCATION}_i + \beta_2 \cdot \text{ANXIETY}_i.
\end{align*}
<!-- -->
We may use the standard `glm` function.

```{r}
glm.out = glm(UPB ~ EDUCATION + ANXIETY, data = couple, family = poisson)
summary(glm.out)
```

### ZICMP Regression
\label{sec:zicmp-reg-initial}

Now consider a ZICMP regression with
<!-- -->
\begin{align*}
&\log \lambda_i = \beta_0 + \beta_1 \cdot \text{EDUCATION}_i + \beta_2 \cdot \text{ANXIETY}_i, \\
&\log \nu_i = \gamma_0, \\
&\logit p_i = \zeta_0 + \zeta_1 \cdot \text{EDUCATION}_i + \zeta_2 \cdot \text{ANXIETY}_i.
\end{align*}
<!-- -->
We use the `glm.cmp` function as follows.

```{r}
zicmp0.out = glm.cmp(UPB ~ EDUCATION + ANXIETY,
	formula.nu = ~ 1,
	formula.p = ~ EDUCATION + ANXIETY,
 	data = couple)
print(zicmp0.out)
```

There are now three sets of coefficients reported in the output: the `X:`, `S:`, and `W:` prefixes label estimates for the $\lambda_i$, $\nu_i$, and $p_i$ formulas respectively.

### Comments about Results

The AIC of the ZICMP model is drastically smaller than the Poisson model, indicating a greatly improved fit. However, there are signs of possible numerical issues. The estimate for $\gamma_0$ is a large negative number, but with an extremely large associated SE, which suggests that the effect may not be statistically significant. On the other hand, the estimate of $\nu$ is nearly zero with a small SE, which suggests that the dispersion parameter is indeed statistically significant. On the surface, this seems to be a contradiction.

The issue is that the Hessian of the log-likelihood becomes insensitive to small changes in $\gamma_0$ when $\lambda_i < 1$ and $\gamma_0$ is a large negative number. Let us first verify that the estimates for $\lambda_i$ are indeed smaller than 1.

```{r}
pred.out = predict(zicmp0.out, type = "link")
summary(pred.out$lambda)
```

To show the insensitivity of the Hessian, let us consider a simpler setting with $Y \sim \text{CMP}(\lambda, \nu)$, $\lambda = \exp\{-0.25\} \approx 0.7788$ fixed, and $\nu = \exp\{\gamma_0\}$. We then have log-density
<!-- -->
\begin{align*}
\log f(y \mid \gamma_0) = y \log \lambda - e^{\gamma_0} \log(y!) - \log Z(\lambda, e^{\gamma_0}),
\end{align*}
<!-- -->
with first derivative and second derivatives, respectively,
<!-- -->
\begin{align*}
&\frac{\partial}{\partial \gamma_0} \log f(y \mid \gamma_0)
= -e^{\gamma_0} \log(y!) - \frac{\partial}{\partial \gamma_0} \log Z(\lambda, e^{\gamma_0}), \\
&\frac{\partial^2}{\partial \gamma_0^2} \log f(y \mid \gamma_0)
= -e^{\gamma_0} \log(y!) - \frac{\partial^2}{\partial \gamma_0^2} \log Z(\lambda, e^{\gamma_0}).
\end{align*}
<!-- -->
For a given value of $y$, $-e^{\gamma_0} \log(y!)$ approaches zero as $\gamma_0$ decreases. Therefore, let us focus on the function
<!-- -->
\(
g(\gamma_0) = -\log Z(\lambda, e^{\gamma_0})
\)
<!-- -->
and its first and second derivatives. The following code illustrates their behavior.

```{r, prompt=FALSE}
library(numDeriv)
g = function(gamma0) {
	-ncmp(lambda = exp(-0.25), nu = exp(gamma0), log = TRUE)
}
dat = data.frame(gamma0 = seq(0, -13), g = NA, d_g = NA, d2_g = NA)
for (j in 1:nrow(dat)) {
	gamma0 = dat$gamma0[j]
	dat$g[j] = g(gamma0)
	dat$d_g[j] = numDeriv::grad(func = g, x = gamma0)
	dat$d2_g[j] = numDeriv::hessian(func = g, x = gamma0)
}
```

Here is the result.

```{r}
print(dat)
```

Notice that $g(\gamma_0)$ approaches a limit as $\gamma_0 \rightarrow -\infty$, which coincides with the CMP distribution approaching a Geometric distribution. It may not be surprising that the first and second derivatives approach zero accordingly. This explains the large SE for $\gamma_0$ in the model `zicmp0.out`. With estimates tending to this region of the parameter space, it may be preferable to fix fix $\gamma_0$ at a value such as $-\infty$, which will be done in the next section.

### Fixed Coefficients
\label{sec:zicmp-reg-fixed}

Our attempt to fit the previous model strongly tended to the Zero-Inflated Geometric special case of ZICMP, but SEs computed via the Hessian become large in this region. In this section, we fix $\gamma_0$ at the extreme $-\infty$ and fit the remaining coefficients. Let us use the raw interface to do this.

```{r, prompt=FALSE}
init = coef(zicmp0.out, type = "list")
y = couple$UPB
X = model.matrix(~ EDUCATION + ANXIETY, data = couple)
S = model.matrix(~ 1, data = couple)
W = model.matrix(~ EDUCATION + ANXIETY, data = couple)
control = get.control(optim.method = "BFGS")
zicmp.out = glm.zicmp.raw(y, X, S, W,
	init = get.init(beta = c(-1,0,0), gamma = -Inf, zeta = c(-1,0,0)),
	fixed = get.fixed(gamma = 1L), control = control)
```

```{r}
print(zicmp.out)
```

Notice that an additional `Fixed` column has been added to the display, indicating that the coefficient `gamma` is fixed. Furthermore, its `Estimate` column is set to the initial value and the columns `SE`, `z-value`, and `p-value` are set to `NA`. This model achieves a similar log-likelihood value as our first attempt using `L-BFGS-B` but does not exhibit signs of numerical issues.

### Accessor Functions
\label{sec:zicmp-reg-accessors}

Here are several of the accessors provided to extract model outputs.

```{r}
logLik(zicmp.out)               ## Log-likelihood evaluated at MLE
AIC(zicmp.out)                  ## AIC evaluated at MLE
BIC(zicmp.out)                  ## BIC evaluated at MLE
coef(zicmp.out)                 ## Estimates of theta as a flat vector
coef(zicmp.out, type = "list")  ## Estimates of theta as a named list
vcov(zicmp.out)                 ## Estimated covariance matrix of theta hat
sdev(zicmp.out)                 ## Standard deviations from vcov(...) diagonals
sdev(zicmp.out, type = "list")  ## Standard deviations as a named list
equitest(zicmp0.out)            ## Likelihood ratio test for H_0: gamma = 0
tryCatch({                      ## An error is thrown for model with fixed gamma
	equitest(zicmp.out)
}, error = print_warning)
```

Because we fixed $\gamma = -\infty$ to obtain `zicmp0.out`, the `equitest` function throws an error instead of proceeding with an equidispersion test.

The `predict` function behaves similarly as in CMP regression; however, the `link` type here also includes a column with the estimated $p_i$.

```{r}
y.hat = predict(zicmp.out)  ## Fitted values based on ecmp
link.hat = predict(zicmp.out, type = "link")
head(y.hat)
head(link.hat)
```

In this example, we can see the benefit of using quantile residuals rather than raw residuals for diagnostic plots. The functions `plot.fit.res` and `plot.qq.res` have been defined in Section \ref{sec:cmp-reg}.

```{r, fig.width = 3, fig.height = 3, fig.align = "center", prompt = FALSE, fig.show = "hold"}
res.raw = residuals(zicmp.out, type = "raw")
res.qtl = residuals(zicmp.out, type = "quantile")

plot.fit.res(y.hat, res.raw) +
	ggtitle("Fitted Values vs. Raw Residuals")
plot.qq.res(res.raw) +
	ggtitle("Q-Q Plot of Raw Residuals")

plot.fit.res(y.hat, res.qtl) +
	ggtitle("Fitted Values vs. Quantile Residuals")
plot.qq.res(res.qtl) +
	ggtitle("Q-Q Plot of Quantile Residuals")
```

Here is an example of computing fitted values for new covariate data.

```{r, prompt = FALSE}
new.df = data.frame(EDUCATION = round(1:20 / 20), ANXIETY = seq(-3,3, length.out = 20))
X.new = model.matrix(~ EDUCATION + ANXIETY, data = new.df)
S.new = model.matrix(~ 1, data = new.df)
W.new = model.matrix(~ EDUCATION + ANXIETY, data = new.df)
new.data = get.modelmatrix(X.new, S.new, W.new)

# For model fit using raw interface, use get.modelmatrix to prepare new design
# matrices, offsets, etc
y.hat.new = predict(zicmp.out, newdata = new.data)

# For models fit with the formula interface, pass a data.frame with the same
# structure as used in the fit.
y.hat.new = predict(zicmp0.out, newdata = new.df)
```

```{r}
print(y.hat.new)
```

As with CMP regression, a `parametric.bootstrap` function is provided for convenience to obtain a bootstrap sample $\hat{\btheta}^{(r)}$ of $\btheta$ based on the estimate $\hat{\btheta}$. Because it is too time consuming to run this example within the vignette, we show the code without output below. As in Section \ref{sec:cmp-reg}, we consider using the bootstrap samples to construct a 95% confidence interval for each of the coefficients.

```{r, eval = FALSE}
zicmp.boot = parametric.bootstrap(zicmp.out, reps = 100)
head(zicmp.boot)
apply(zicmp.boot, 2, quantile, c(0.025,0.975))
```

# Acknowledgements {.unlisted .unnumbered}
We acknowledge Thomas Lotze for significant contributions to the initial development of the `COMPoissonReg` package and for service as initial maintainer on CRAN. We thank Darcy Steeg Morris, Eric Slud, and Tommy Wright for reviewing the manuscript. We are grateful to the users of `COMPoissonReg` for their interest and for bringing to light some of the issues which have been considered in this work.

# References