Introduction to crossfit

library(crossfit) set.seed(1)

1. Motivation: why another cross-fitting engine?

Many modern estimators (double / debiased ML, meta-learners, etc. ) share the same pattern:

If we fit the nuisances and evaluate the target on the same observations, we usually:

Cross-fitting fixes this by:

  1. splitting the data into \(K\) folds,
  2. fitting nuisance models on training folds,
  3. evaluating the target on held-out folds where the nuisances were not trained.

The crossfit package generalizes this logic to:

2. Basic concepts

2.1 Nuisances

A nuisance is defined via create_nuisance():

Example: regression \(m(x) = E[Y \mid X]\):

nuis_y <- create_nuisance(
  fit = function(data, ...) lm(y ~ x, data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2  # this nuisance will train on 2 consecutive folds
)

2.2 Target

The target is just a function of:

Example: cross-fitted mean squared error (MSE) of \(m(x)\):

target_mse <- function(data, nuis_y, ...) {
  mean((data$y - nuis_y)^2)
}

During cross-fitting, the engine will:

You don’t have to manage folds manually in the target.

2.3 Methods

A method bundles:

mse_method <- create_method(
  target = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds = 4, # total number of folds K
  repeats = 3, # how many times to re-draw fold splits
  eval_fold = 1, # evaluation window width (in folds)
  mode = "estimate",
  fold_allocation = "independence",
  aggregate_panels = mean_estimate,
  aggregate_repeats = mean_estimate
)

Conceptually:

3. A simple regression example

Let’s walk through a full workflow on a toy regression problem.

n <- 200
x <- rnorm(n)
y <- x + rnorm(n)
data <- data.frame(x = x, y = y)

We reuse the nuisance and target defined above (nuis_y, target_mse), and the method mse_method.

3.1 Single-method cross-fitting with crossfit()

res <- crossfit(data, mse_method)

str(res$estimates)
res$estimates[[1]]

The result is a list with elements:

We can inspect the per-repetition values:

res$per_method$method$values

Each element in values is the aggregated MSE over panels for that repetition.

4. Multiple methods and shared nuisances

Very often, you want to compare several targets or configurations that share the same nuisance models. crossfit_multi() is built for that.

Here we estimate simultaneously:

target_mean <- function(data, nuis_y, ...) {
  mean(nuis_y)
}

m_mse <- create_method(
  target        = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 1,
  mode          = "estimate",
  fold_allocation   = "independence",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

m_mean <- create_method(
  target        = target_mean,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 1,
  mode          = "estimate",
  fold_allocation   = "overlap",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

cf_multi <- crossfit_multi(
  data    = data,
  methods = list(mse = m_mse, mean = m_mean),
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

cf_multi$estimates

The two methods share fitted nuisances whenever their structure and training folds coincide (internally via structural signatures and caching), which can significantly reduce computation when you have many methods.

5. Predict mode: build a cross-fitted ensemble predictor

In "predict" mode, the engine returns a prediction function instead of a numeric estimate. This is useful when you want:

Here we build a cross-fitted ensemble predictor that averages a linear and a quadratic regression for \(E[Y \mid X]\).

We simulate a slightly nonlinear regression problem:

n2 <- 300
x2 <- runif(n2, -2, 2)
y2 <- sin(x2) + rnorm(n2, sd = 0.3)
data2 <- data.frame(x = x2, y = y2)

Two nuisances:

nuis_lin <- create_nuisance(
  fit = function(data, ...) lm(y ~ x, data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2
)

nuis_quad <- create_nuisance(
  fit = function(data, ...) lm(y ~ poly(x, 2), data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2
)

Now define a target in predict mode that combines the two nuisance predictions into an ensemble prediction:

target_ensemble <- function(data, m_lin, m_quad, ...) {
  0.5 * m_lin + 0.5 * m_quad
}

We build a method in "predict" mode:

m_ens <- create_method(
  target        = target_ensemble,
  list_nuisance = list(
    m_lin  = nuis_lin,
    m_quad = nuis_quad
  ),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 0, # no eval window in predict mode
  mode          = "predict",
  fold_allocation   = "independence"
)

Run cross-fitting in predict mode, using mean_predictor() to aggregate panel-level and repetition-level predictors:

res_pred <- crossfit_multi(
  data = data2,
  methods = list(ensemble = m_ens),
  aggregate_panels = mean_predictor,
  aggregate_repeats = mean_predictor
)

# estimates$ensemble is now a prediction function
f_hat <- res_pred$estimates$ensemble

newdata <- data.frame(x = seq(-2, 2, length.out = 7))
cbind(x = newdata$x, y_hat = f_hat(newdata))

Here:

This is the typical pattern in "predict" mode: your target combines one or several nuisance predictors into a derived predictor (pseudo-outcome, CATE, ensemble, …), and the engine returns a cross-fitted version of that predictor.

6. Fold allocation strategies

The fold_allocation argument controls how training blocks are placed relative to the evaluation window.

For each method:

The engine supports three strategies:

You choose the strategy per method:

mse_overlap <- create_method(
  target        = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 1,
  mode          = "estimate",
  fold_allocation   = "overlap",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

7. Customization

7.1 Custom fold splitting

By default, fold assignments are:

fold_split = function(data, K) sample(rep_len(1:K, nrow(data)))

You can override this in crossfit() or crossfit_multi() if you need:

Example: simple grouped folds by an integer id:

# toy group variable
group_id <- sample(1:10, size = nrow(data), replace = TRUE)

fold_split_grouped <- function(data, K) {
  # assign folds at group level, then expand to rows
  groups <- unique(group_id)
  gfolds <- sample(rep_len(1:K, length(groups)))
  g2f    <- setNames(gfolds, groups)
  g2f[group_id]
}

res_grouped <- crossfit(
  data = data,
  method = mse_method,
  fold_split = fold_split_grouped
)

res_grouped$estimates[[1]]

The only requirement is that fold_split(data, K) returns a vector of length nrow(data) with integer labels in {1, …, K}, and that all folds are non-empty.

7.2 Aggregation functions

You can plug in any aggregation you like:

For example, a simple trimmed mean over panels:

trimmed_mean_estimate <- function(xs, trim = 0.1) {
  x <- unlist(xs)
  mean(x, trim = trim)
}

m_trim <- create_method(
  target        = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 5,
  eval_fold     = 1L,
  mode          = "estimate",
  fold_allocation   = "independence",
  aggregate_panels  = trimmed_mean_estimate,
  aggregate_repeats = trimmed_mean_estimate
)

res_trim <- crossfit(data, m_trim)
res_trim$estimates[[1]]

8. Where to go next

crossfit is meant to be a small, flexible engine: you define the nuisances and targets; it takes care of the cross-fitting schedule, reuse of models, and basic safety checks (cycles, coverage of dependencies, fold geometry).

If you encounter edge cases or have ideas for higher-level helpers (e.g., ready-made DML ATE wrappers), they can be built conveniently on top of this core.