\[X_{j} = u_{1} + \epsilon_{j}^{x}, \quad j=1,2,3\] \[Y_{j} = u_{2} + \epsilon_{j}^{y}, \quad j=1,2,3\] and with a structural model given by \[u_{2} = f(u_{1}) + Z + \zeta_{2}\] \[u_{1} = Z + \zeta_{1}\] with iid measurement errors \(\epsilon_{j}^{x},\epsilon_{j}^{y},\zeta_{1},\zeta_{2}\sim\mathcal{N}(0,1), j=1,2,3.\) and standard normal distributed covariate \(Z\). To simulate from this model we use the following syntax:

We refer to (K. K. Holst and Budtz-Jørgensen 2013) for details on the syntax for model specification.

Estimation

To estimate the parameters using the two-stage estimator described in (Klaus Kähler Holst and Budtz-Jørgensen 2020), the first step is now to specify the measurement models

m1 <- lvm(x1+x2+x3 ~ u1, u1 ~ z, latent=~u1)
m2 <- lvm(y1+y2+y3 ~ u2, u2 ~ z, latent=~u2)

Next, we specify a quadratic relationship between the two latent variables

nonlinear(m2, type="quadratic") <- u2 ~ u1

and the model can then be estimated using the two-stage estimator

e1 <- twostage(m1, m2, data=d)
e1
#>                     Estimate Std. Error Z-value  P-value
#> Measurements:                                           
#>    y2~u2               0.977      0.035  28.309   <1e-12
#>    y3~u2               1.045      0.035  29.982   <1e-12
#> Regressions:                                            
#>    u2~z                0.885      0.208   4.260    2e-05
#>    u2~u1_1             1.141      0.174   6.552  5.7e-11
#>    u2~u1_2            -0.451      0.072  -6.292  3.1e-10
#> Intercepts:                                             
#>    y2                 -0.122      0.109  -1.117     0.26
#>    y3                 -0.099      0.105  -0.937     0.35
#>    u2                  0.678      0.174   3.906  9.4e-05
#> Residual Variances:                                     
#>    y1                  1.307      0.177   7.368         
#>    y2                  1.111      0.145   7.671         
#>    y3                  0.810      0.132   6.132         
#>    u2                  2.085      0.290   7.193

We see a clear statistically significant effect of the second order term (u2~u1_2). For comparison we can also estimate the full MLE of the linear model:

e0 <- estimate(regression(m1%++%m2, u2~u1), d)
estimate(e0,keep="^u2~[a-z]",regex=TRUE) ## Extract coef. matching reg.ex.
#>       Estimate Std.Err    2.5% 97.5%   P-value
#> u2~u1   1.4140  0.2261 0.97083 1.857 4.014e-10
#> u2~z    0.6374  0.2778 0.09291 1.182 2.177e-02

Next, we calculate predictions from the quadratic model using the estimated parameter coefficients \[ \mathbb{E}_{\widehat{\theta}_{2}}(u_{2} \mid u_{1}, Z=0), \]

newd <- expand.grid(u1=seq(-4, 4, by=0.1), z=0)
pred1 <- predict(e1, newdata=newd, x=TRUE)
head(pred1)
#>           y1      y2      y3      u2
#> [1,] -11.094 -10.959 -11.690 -11.094
#> [2,] -10.624 -10.500 -11.199 -10.624
#> [3,] -10.163 -10.049 -10.717 -10.163
#> [4,]  -9.711  -9.608 -10.245  -9.711
#> [5,]  -9.268  -9.175  -9.782  -9.268
#> [6,]  -8.834  -8.751  -9.329  -8.834

To obtain a potential better fit we next proceed with a natural cubic spline

kn <- seq(-3,3,length.out=5)
nonlinear(m2, type="spline", knots=kn) <- u2 ~ u1
e2 <- twostage(m1, m2, data=d)
#> Warning in check_ic_mean_zero(icz): IC does not have mean zero (max |mean|/rms
#> = 2.6e-06). Using lava.options(check.ic = FALSE) disables the warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 2.6e-06). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 2.6e-06). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
e2
#>                     Estimate Std. Error Z-value  P-value
#> Measurements:                                           
#>    y2~u2               0.978      0.035  28.313   <1e-12
#>    y3~u2               1.045      0.035  29.971   <1e-12
#> Regressions:                                            
#>    u2~z                0.867      0.203   4.278  1.9e-05
#>    u2~u1_1             2.862      0.673   4.255  2.1e-05
#>    u2~u1_2             0.003      0.101   0.034     0.97
#>    u2~u1_3            -0.263      0.294  -0.894     0.37
#>    u2~u1_4             0.508      0.352   1.443     0.15
#> Intercepts:                                             
#>    y2                 -0.122      0.109  -1.116     0.26
#>    y3                 -0.099      0.105  -0.936     0.35
#>    u2                  1.838      1.664   1.104     0.27
#> Residual Variances:                                     
#>    y1                  1.313      0.178   7.396         
#>    y2                  1.104      0.145   7.639         
#>    y3                  0.811      0.132   6.153         
#>    u2                  1.994      0.269   7.402

Confidence limits can be obtained via the Delta method using the estimate method:

p <- cbind(u1=newd$u1,
      estimate(e2,f=function(p) predict(e2,p=p,newdata=newd))$coefmat)
head(p)
#>      u1 Estimate Std.Err   2.5%  97.5%   P-value
#> p1 -4.0   -9.611  1.2647 -12.09 -7.132 2.978e-14
#> p2 -3.9   -9.325  1.2051 -11.69 -6.963 1.012e-14
#> p3 -3.8   -9.039  1.1464 -11.29 -6.792 3.152e-15
#> p4 -3.7   -8.752  1.0886 -10.89 -6.619 8.959e-16
#> p5 -3.6   -8.466  1.0319 -10.49 -6.444 2.320e-16
#> p6 -3.5   -8.180  0.9766 -10.09 -6.266 5.494e-17

The fitted function can be obtained with the following code:

plot(I(u2-z) ~ u1, data=d, col=Col("black",0.5), pch=16,
     xlab=expression(u[1]), ylab=expression(u[2]), xlim=c(-4,4))
lines(Estimate ~ u1, data=as.data.frame(p), col="darkblue", lwd=5)
confband(p[,1], lower=p[,4], upper=p[,5], polygon=TRUE,
     border=NA, col=Col("darkblue",0.2))

Cross-validation

A more formal comparison of the different models can be obtained by cross-validation. Here we specify linear, quadratic and cubic spline models with 4 and 9 degrees of freedom.

m2a <- nonlinear(m2, type="linear", u2~u1)
m2b <- nonlinear(m2, type="quadratic", u2~u1)
kn1 <- seq(-3,3,length.out=5)
kn2 <- seq(-3,3,length.out=8)
m2c <- nonlinear(m2, type="spline", knots=kn1, u2~u1)
m2d <- nonlinear(m2, type="spline", knots=kn2, u2~u1)

To assess the model fit average RMSE is estimated with 5-fold cross-validation repeated two times

## Scale models in stage 2 to allow for a fair RMSE comparison
d0 <- d
for (i in endogenous(m2))
    d0[,i] <- scale(d0[,i],center=TRUE,scale=TRUE)
## Repeated 5-fold cross-validation:
ff <- lapply(list(linear=m2a,quadratic=m2b,spline4=m2c,spline6=m2d),
        function(m) function(data,...) twostage(m1,m,data=data,stderr=FALSE,control=list(start=coef(e0),contrain=TRUE)))
fit.cv <- lava:::cv(ff,data=d,K=5,rep=2,mc.cores=parallel::detectCores(),seed=1)

fit.cv$coef
#>            RMSE
#> linear    2.138
#> quadratic 1.806
#> spline4   1.747
#> spline6   1.761

Here the RMSE is in favour of the splines model with 4 degrees of freedom:

fit <- lapply(list(m2a,m2b,m2c,m2d),
         function(x) {
         e <- twostage(m1,x,data=d)
         pr <- cbind(u1=newd$u1,predict(e,newdata=newd$u1,x=TRUE))
         return(list(estimate=e,predict=as.data.frame(pr)))
         })
#> Warning in check_ic_mean_zero(icz): IC does not have mean zero (max |mean|/rms
#> = 2.6e-06). Using lava.options(check.ic = FALSE) disables the warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 2.6e-06). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 2.6e-06). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
#> Warning in check_ic_mean_zero(icz): IC does not have mean zero (max |mean|/rms
#> = 6.9e-06). Using lava.options(check.ic = FALSE) disables the warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 6.9e-06). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 6.8e-06). Using lava.options(check.ic = FALSE) disables the
#> warning globally.

plot(I(u2-z) ~ u1, data=d, col=Col("black",0.5), pch=16,
     xlab=expression(eta[1]), ylab=expression(eta[2]), xlim=c(-4,4))
col <- c("orange","darkred","darkgreen","darkblue")
lty <- c(3,4,1,5)
for (i in seq_along(fit)) {
    with(fit[[i]]$pr, lines(u2 ~ u1, col=col[i], lwd=4, lty=lty[i]))
}
legend("bottomright",
      c("linear","quadratic","spline(df=4)","spline(df=6)"),
      col=col, lty=lty, lwd=3)

For convenience, the function twostageCV can be used to do the cross-validation (also for choosing the mixture distribution via the nmix argument, see the section below). For example,

set.seed(1)
selmod <- twostageCV(m1, m2, data=d, df=2:4, nmix=1:2,
        nfolds=5, rep=2, mc.cores=parallel::detectCores())

applies cross-validation (here just 2 folds for simplicity) to select the best splines with degrees of freedom varying from 1-3 (the linear model is automatically included)

selmod
#> ────────────────────────────────────────────────────────────────────────────────
#> Selected mixture model: 2 components
#>   AIC1
#> 1 1962
#> 2 1959
#> ────────────────────────────────────────────────────────────────────────────────
#> Selected spline model degrees of freedom: 4
#> Knots: -3.958 -1.968 0.02149 2.011 4.001 
#> 
#>      RMSE(nfolds=, rep=)
#> df:1               2.135
#> df:2               1.883
#> df:3               1.918
#> df:4               1.861
#> ────────────────────────────────────────────────────────────────────────────────
#> 
#>                     Estimate Std. Error Z-value  P-value   std.xy  
#> Measurements:                                                      
#>    y1~u2             1.00000                                0.93509
#>    y2~u2             0.97827  0.03464   28.24164   <1e-12   0.94291
#>    y3~u2             1.04529  0.03482   30.01849   <1e-12   0.96175
#> Regressions:                                                       
#>    u2~z              1.02726  0.22350    4.59630 4.301e-06  0.34700
#>    u2~u1_1           2.61264  0.90770    2.87830 0.003998   1.13566
#>    u2~u1_2           0.01368  0.06464    0.21164 0.8324     0.30984
#>    u2~u1_3          -0.19029  0.17477   -1.08879 0.2762    -1.48961
#>    u2~u1_4           0.35252  0.19173    1.83859 0.06597    0.52314
#> Intercepts:                                                        
#>    y1                0.00000                                0.00000
#>    y2               -0.12170  0.10925   -1.11391 0.2653    -0.03871
#>    y3               -0.09870  0.10546   -0.93592 0.3493    -0.02997
#>    u2                1.54968  2.64293    0.58635 0.5576     0.51143
#> Residual Variances:                                                
#>    y1                1.31889  0.17659    7.46873            0.12560
#>    y2                1.09634  0.14483    7.56960            0.11093
#>    y3                0.81386  0.13260    6.13772            0.07504
#>    u2                1.99291  0.28189    7.06988            0.21706

Specification of general functional forms

Next, we show how to specify a general functional relation of multiple different latent or exogenous variables. This is achieved via the predict.fun argument. To illustrate this we include interactions between the latent variable \(u_{1}\) and a dichotomized version of the covariate \(z\)

d$g <- (d$z<0)*1 ## Group variable
mm1 <- regression(m1, ~g)  # Add grouping variable as exogenous variable (effect specified via 'predict.fun')
mm2 <- regression(m2, u2~ u1+u2+u1:g+u2:g+z)
pred <- function(mu,var,data,...) {
    cbind("u1"=mu[,1],"u2"=mu[,1]^2+var[1],
      "u1:g"=mu[,1]*data[,"g"],"u2:g"=(mu[,1]^2+var[1])*data[,"g"])
}
ee1 <- twostage(mm1, model2=mm2, data=d, predict.fun=pred)
#> Warning in estimate.lvm(model2, data = newd, ...): Lack of convergence.
#> Increase number of iteration or change starting values.
#> Warning in estimate.lvm(model2, data = newd, ...): Near-singular covariance
#> matrix, using pseudo-inverse!
estimate(ee1,keep="u2~u",regex=TRUE)
#>         Estimate Std.Err     2.5%    97.5%   P-value
#> u2~u2     0.4307 0.04398  0.34446  0.51687 1.223e-22
#> u2~u1     0.2398 0.12799 -0.01104  0.49066 6.097e-02
#> u2~u1:g   0.6140 0.26070  0.10301  1.12494 1.852e-02
#> u2~u2:g  -0.2044 0.09051 -0.38180 -0.02702 2.392e-02

A formal test show no statistically significant effect of this interaction

summary(estimate(ee1,keep="(:g)", regex=TRUE))
#> Call: estimate.default(f = FALSE, contrast = contrast, vcov = vcov(object), 
#>     coef = p)
#> ────────────────────────────────────────────────────────────
#>         Estimate Std.Err    2.5%    97.5% P-value
#> u2~u1:g   0.6140 0.26070  0.1030  1.12494 0.01852
#> u2~u2:g  -0.2044 0.09051 -0.3818 -0.02702 0.02392
#> ────────────────────────────────────────────────────────────
#> Null Hypothesis: 
#>   [u2~u1:g] = 0
#>   [u2~u2:g] = 0 
#>  
#> chisq = 28.36, df = 2, p-value = 6.935e-07

Mixture models

Lastly, we demonstrate how the distributional assumptions of stage 1 model can be relaxed by letting the conditional distribution of the latent variable given covariates follow a Gaussian mixture distribution. The following code explictly defines the parameter constraints of the model by setting the intercept of the first indicator variable, \(x_{1}\), to zero and the factor loading parameter of the same variable to one.

m1 <- baptize(m1)  ## Label all parameters
intercept(m1, ~x1+u1) <- list(0,NA) ## Set intercept of x1 to zero. Remove the label of u1
regression(m1,x1~u1) <- 1 ## Factor loading fixed to 1

The mixture model may then be estimated using the mixture method (note, this requires the mets package to be installed), where the Parameter names shared across the different mixture components given in the list will be constrained to be identical in the mixture model. Thus, only the intercept of \(u_{1}\) is allowed to vary between the mixtures.

set.seed(1)
em0 <- mixture(m1, k=2, data=d)

To decrease the risk of using a local maximizer of the likelihood we can rerun the estimation with different random starting values

em0 <- NULL
ll <- c()
for (i in 1:5) {
    set.seed(i)
    em <- mixture(m1, k=2, data=d, control=list(trace=0))
    ll <- c(ll,logLik(em))
    if (is.null(em0) || logLik(em0)<tail(ll,1))
    em0 <- em
}

summary(em0)
#> Cluster 1 (n=162, Prior=0.776):
#> --------------------------------------------------
#>                     Estimate Std. Error Z value Pr(>|z|)
#> Measurements:                                           
#>    x1~u1             1.000                              
#>    x2~u1             0.996    0.079     12.541    <1e-12
#>    x3~u1             1.063    0.084     12.605    <1e-12
#> Regressions:                                            
#>    u1~z              1.067    0.085     12.510    <1e-12
#> Intercepts:                                             
#>    x1                0.000                              
#>    x2                0.038    0.099      0.389  0.7     
#>    x3               -0.025    0.103     -0.247  0.81    
#>    u1                0.209    0.132      1.590  0.11    
#> Residual Variances:                                     
#>    x1                0.985    0.133      7.400          
#>    x2                0.972    0.132      7.387          
#>    x3                1.013    0.143      7.088          
#>    u1                0.290    0.111      2.610          
#> 
#> Cluster 2 (n=38, Prior=0.224):
#> --------------------------------------------------
#>                     Estimate Std. Error Z value Pr(>|z|)
#> Measurements:                                           
#>    x1~u1             1.000                              
#>    x2~u1             0.996    0.079     12.541    <1e-12
#>    x3~u1             1.063    0.084     12.605    <1e-12
#> Regressions:                                            
#>    u1~z              1.067    0.085     12.510    <1e-12
#> Intercepts:                                             
#>    x1                0.000                              
#>    x2                0.038    0.099      0.389  0.7     
#>    x3               -0.025    0.103     -0.247  0.81    
#>    u1               -1.443    0.259     -5.578  2.4e-08 
#> Residual Variances:                                     
#>    x1                0.985    0.133      7.400          
#>    x2                0.972    0.132      7.387          
#>    x3                1.013    0.143      7.088          
#>    u1                0.290    0.111      2.610          
#> --------------------------------------------------
#> AIC= 1959 
#> ||score||^2= 7.906e-06

Measured by AIC there is a slight improvement in the model fit using the mixture model

e0 <- estimate(m1,data=d)
AIC(e0,em0)
#>     df  AIC
#> e0  10 1962
#> em0 12 1959

The spline model may then be estimated as before with the two-stage method

em2 <- twostage(em0,m2,data=d)
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 2.8e-05). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
#> Warning in check_ic_mean_zero(icz): IC does not have mean zero (max |mean|/rms
#> = 2.8e-05). Using lava.options(check.ic = FALSE) disables the warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 2.8e-05). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
#> Warning in check_ic_mean_zero(ic_theta): IC does not have mean zero (max
#> |mean|/rms = 9.5e-06). Using lava.options(check.ic = FALSE) disables the
#> warning globally.
em2
#>                     Estimate Std. Error Z-value  P-value
#> Measurements:                                           
#>    y2~u2               0.978      0.035  28.237   <1e-12
#>    y3~u2               1.045      0.035  30.040   <1e-12
#> Regressions:                                            
#>    u2~z                1.029      0.223   4.607  4.1e-06
#>    u2~u1_1             2.804      0.655   4.278  1.9e-05
#>    u2~u1_2            -0.022      0.100  -0.225     0.82
#>    u2~u1_3            -0.173      0.289  -0.599     0.55
#>    u2~u1_4             0.387      0.340   1.138     0.26
#> Intercepts:                                             
#>    y2                 -0.122      0.109  -1.114     0.27
#>    y3                 -0.099      0.105  -0.936     0.35
#>    u2                  2.124      1.666   1.275      0.2
#> Residual Variances:                                     
#>    y1                  1.319      0.177   7.470         
#>    y2                  1.097      0.145   7.564         
#>    y3                  0.813      0.133   6.135         
#>    u2                  1.996      0.283   7.055

In this example the results are very similar to the Gaussian model:

plot(I(u2-z) ~ u1, data=d, col=Col("black",0.5), pch=16,
     xlab=expression(eta[1]), ylab=expression(eta[2]))

lines(Estimate ~ u1, data=as.data.frame(p), col="darkblue", lwd=5)
confband(p[,1], lower=p[,4], upper=p[,5], polygon=TRUE,
     border=NA, col=Col("darkblue",0.2))

pm <- cbind(u1=newd$u1,
        estimate(em2, f=function(p) predict(e2,p=p,newdata=newd))$coefmat)
lines(Estimate ~ u1, data=as.data.frame(pm), col="darkred", lwd=5)
confband(pm[,1], lower=pm[,4], upper=pm[,5], polygon=TRUE,
     border=NA, col=Col("darkred",0.2))
legend("bottomright", c("Gaussian","Mixture"),
       col=c("darkblue","darkred"), lwd=2, bty="n")

SessionInfo

sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/kkzh/.asdf/installs/r/4.5.3/lib/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Copenhagen
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] survival_3.8-6 lava_1.9.1    
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.3.2            sass_0.4.10            future_1.70.0         
#>  [4] generics_0.1.4         lattice_0.22-9         listenv_0.10.1        
#>  [7] digest_0.6.39          magrittr_2.0.4         evaluate_1.0.5        
#> [10] grid_4.5.3             mvtnorm_1.3-7          fastmap_1.2.0         
#> [13] jsonlite_2.0.0         Matrix_1.7-4           backports_1.5.0       
#> [16] graph_1.88.1           purrr_1.2.1            Rgraphviz_2.54.0      
#> [19] codetools_0.2-20       numDeriv_2016.8-1.1    jquerylib_0.1.4       
#> [22] cli_3.6.6              rlang_1.2.0            mets_1.3.10           
#> [25] parallelly_1.47.0      future.apply_1.20.2    splines_4.5.3         
#> [28] RcppArmadillo_15.2.6-1 geepack_1.3.13         cachem_1.1.0          
#> [31] yaml_2.3.12            otel_0.2.0             tools_4.5.3           
#> [34] parallel_4.5.3         dplyr_1.2.0            globals_0.19.1        
#> [37] BiocGenerics_0.56.0    broom_1.0.12           vctrs_0.7.1           
#> [40] R6_2.6.1               stats4_4.5.3           lifecycle_1.0.5       
#> [43] MASS_7.3-65            pkgconfig_2.0.3        timereg_2.0.7         
#> [46] progressr_0.19.0       bslib_0.10.0           pillar_1.11.1         
#> [49] glue_1.8.0             Rcpp_1.1.1-1.1         tidyselect_1.2.1      
#> [52] xfun_0.56              tibble_3.3.1           knitr_1.51            
#> [55] htmltools_0.5.9        rmarkdown_2.31         compiler_4.5.3

Non-linear latent variable models and error-in-variable models

Klaus Kähler Holst

2026-05-14

Estimation

Cross-validation

Specification of general functional forms

Mixture models

SessionInfo

Bibliography