Getting Started with np

This vignette is meant to be the smallest useful package-side introduction to np. The emphasis is on one clean workflow that users can run after installation: choose a bandwidth, fit a model, inspect the result, and plot it.

Broader worked examples, package comparisons, and method-specific articles are better carried by the gallery site:

The basic workflow

In np, the bandwidth object is often the key object in the analysis.

  1. compute or inspect a bandwidth object,
  2. fit the model,
  3. summarize or plot the result.

A simple regression example

library(np)
data(cps71, package = "np")

bw <- npregbw(logwage ~ age, data = cps71)
summary(bw)
#> 
#> Regression Data (205 observations, 1 variable(s)):
#> 
#> Regression Type: Local-Constant
#> Bandwidth Selection Method: Least Squares Cross-Validation
#> Formula: logwage ~ age
#> Bandwidth Type: Fixed
#> Objective Function Value: 0.316055 (achieved on multistart 1)
#> Number of Function Evaluations: 47 (fast = 17)
#> 
#> Exp. Var. Name: age Bandwidth: 1.892158  Scale Factor: 0.4487743
#> 
#> Continuous Kernel Type: Second-Order Gaussian
#> No. Continuous Explanatory Vars.: 1
#> Estimation Time: 0.075 seconds

fit <- npreg(bws = bw)
summary(fit)
#> 
#> Regression Data: 205 training points, in 1 variable(s)
#>                    age
#> Bandwidth(s): 1.892158
#> 
#> Kernel Regression Estimator: Local-Constant
#> Bandwidth Type: Fixed
#> Residual standard error: 0.5307943
#> R-squared: 0.3108675
#> 
#> Continuous Kernel Type: Second-Order Gaussian
#> No. Continuous Explanatory Vars.: 1
#> Estimation Time: 0.075 seconds (optim 0.075s, fit 0s)

Plotting the fitted relationship

plot(cps71$age, cps71$logwage, cex = 0.25, col = "grey")
lines(cps71$age, fitted(fit), col = 2, lwd = 2)

Mixed data

One important feature of np is that it handles mixed data directly. Variable class matters: unordered categorical variables should be factors, and ordered categorical variables should be ordered factors when appropriate.

set.seed(42)
mydat <- data.frame(
  y = rnorm(200),
  x_cont = runif(200),
  x_unordered = factor(sample(c("a", "b", "c"), 200, replace = TRUE)),
  x_ordered = ordered(sample(1:4, 200, replace = TRUE))
)

bw_mixed <- npregbw(y ~ x_cont + x_unordered + x_ordered, data = mydat)
fit_mixed <- npreg(bws = bw_mixed)
summary(fit_mixed)
#> 
#> Regression Data: 200 training points, in 3 variable(s)
#> Search Parameter(s):
#>          x_cont  x_unordered  x_ordered
#> Type  Bandwidth       Lambda     Lambda
#> Value   1718322    0.6636654  0.9981613
#> Max          --    0.6666667          1
#> 
#> Kernel Regression Estimator: Local-Constant
#> Bandwidth Type: Fixed
#> Residual standard error: 0.9721457
#> R-squared: 0
#> 
#> Continuous Kernel Type: Second-Order Gaussian
#> No. Continuous Explanatory Vars.: 1
#> 
#> Unordered Categorical Kernel Type: Aitchison and Aitken
#> No. Unordered Categorical Explanatory Vars.: 1
#> 
#> Ordered Categorical Kernel Type: Li and Racine
#> No. Ordered Categorical Explanatory Vars.: 1
#> Estimation Time: 0.18 seconds (optim 0.179s, fit 0.001s)

Data preparation matters

In np, the formula interface tells the function which variables are the response and regressors. It is not imposing an ordinary linear-additive model.

It is also important not to pass blocks of 0/1 dummies as if this were a standard linear-model workflow. If the underlying variable is categorical, it is usually better to keep it as one factor or ordered variable.

Other common starting points

This vignette keeps the package-side introduction intentionally narrow. Other common first routes are:

Those broader branches are better carried by help pages and website articles than by a single shipped vignette.

Where to go next