library(crossfit) set.seed(1)
Many modern estimators (double / debiased ML, meta-learners, etc. ) share the same pattern:
If we fit the nuisances and evaluate the target on the same observations, we usually:
Cross-fitting fixes this by:
The crossfit package generalizes this logic to:
train_fold, per-target
eval_fold),"estimate" (numeric target) and
"predict" (cross-fitted predictor).A nuisance is defined via
create_nuisance():
fit(data, ...) → trains a model on (a subset of) the
data,predict(model, data, ...) → returns predictions on (a
subset of) the data,train_fold → how many folds the nuisance trains
on,fit_deps, pred_deps → which other
nuisances it depends on.Example: regression \(m(x) = E[Y \mid X]\):
The target is just a function of:
data,Example: cross-fitted mean squared error (MSE) of \(m(x)\):
During cross-fitting, the engine will:
nuis_y’s predict() on held-out
folds,target_mse(data_eval, nuis_y = predicted_values_on_eval).You don’t have to manage folds manually in the target.
A method bundles:
target,mse_method <- create_method(
target = target_mse,
list_nuisance = list(nuis_y = nuis_y),
folds = 4, # total number of folds K
repeats = 3, # how many times to re-draw fold splits
eval_fold = 1, # evaluation window width (in folds)
mode = "estimate",
fold_allocation = "independence",
aggregate_panels = mean_estimate,
aggregate_repeats = mean_estimate
)Conceptually:
folds and repeats define K-fold
cross-fitting repeated R times,eval_fold tells how many folds to reserve for
evaluating the target,mode controls whether we return a numeric
estimate ("estimate") or a prediction
function ("predict"),fold_allocation controls how training windows are laid
out across folds,aggregate_panels combines panel-wise results (within
one repetition),aggregate_repeats combines repetition-wise
results.Let’s walk through a full workflow on a toy regression problem.
We reuse the nuisance and target defined above (nuis_y,
target_mse), and the method mse_method.
crossfit()The result is a list with elements:
estimates – one entry per method (here only one),per_method – panel-wise and repetition-wise values and
errors,repeats_done – how many repetitions successfully
ran,K, K_required, methods,
plan – extra diagnostics.We can inspect the per-repetition values:
Each element in values is the aggregated MSE over panels
for that repetition.
In "predict" mode, the engine returns a
prediction function instead of a numeric estimate. This
is useful when you want:
Here we build a cross-fitted ensemble predictor that averages a linear and a quadratic regression for \(E[Y \mid X]\).
We simulate a slightly nonlinear regression problem:
n2 <- 300
x2 <- runif(n2, -2, 2)
y2 <- sin(x2) + rnorm(n2, sd = 0.3)
data2 <- data.frame(x = x2, y = y2)Two nuisances:
nuis_lin: linear regression,nuis_quad: quadratic regression via
poly(x, 2).nuis_lin <- create_nuisance(
fit = function(data, ...) lm(y ~ x, data = data),
predict = function(model, data, ...) {
as.numeric(predict(model, newdata = data))
},
train_fold = 2
)
nuis_quad <- create_nuisance(
fit = function(data, ...) lm(y ~ poly(x, 2), data = data),
predict = function(model, data, ...) {
as.numeric(predict(model, newdata = data))
},
train_fold = 2
)Now define a target in predict mode that combines the two nuisance predictions into an ensemble prediction:
We build a method in "predict" mode:
eval_fold = 0L (no dedicated evaluation window),m_lin and m_quad,m_ens <- create_method(
target = target_ensemble,
list_nuisance = list(
m_lin = nuis_lin,
m_quad = nuis_quad
),
folds = 4,
repeats = 3,
eval_fold = 0, # no eval window in predict mode
mode = "predict",
fold_allocation = "independence"
)Run cross-fitting in predict mode, using
mean_predictor() to aggregate panel-level and
repetition-level predictors:
res_pred <- crossfit_multi(
data = data2,
methods = list(ensemble = m_ens),
aggregate_panels = mean_predictor,
aggregate_repeats = mean_predictor
)
# estimates$ensemble is now a prediction function
f_hat <- res_pred$estimates$ensemble
newdata <- data.frame(x = seq(-2, 2, length.out = 7))
cbind(x = newdata$x, y_hat = f_hat(newdata))Here:
m_lin, m_quad and the ensemble
target_ensemble.mean_predictor() aggregates predictors over panels and
repetitions.f_hat(newdata) gives cross-fitted ensemble predictions
on new data.This is the typical pattern in "predict" mode: your
target combines one or several nuisance predictors into a
derived predictor (pseudo-outcome, CATE, ensemble, …),
and the engine returns a cross-fitted version of that predictor.
The fold_allocation argument controls how training
blocks are placed relative to the evaluation window.
For each method:
eval_fold folds are reserved for evaluating the
target,train_fold width,fold_allocation decides how the training blocks for
nuisances occupy the K folds.The engine supports three strategies:
"independence"
"overlap"
"estimate" mode."disjoint"
independence and
overlap.You choose the strategy per method:
By default, fold assignments are:
You can override this in crossfit() or
crossfit_multi() if you need:
Example: simple grouped folds by an integer id:
# toy group variable
group_id <- sample(1:10, size = nrow(data), replace = TRUE)
fold_split_grouped <- function(data, K) {
# assign folds at group level, then expand to rows
groups <- unique(group_id)
gfolds <- sample(rep_len(1:K, length(groups)))
g2f <- setNames(gfolds, groups)
g2f[group_id]
}
res_grouped <- crossfit(
data = data,
method = mse_method,
fold_split = fold_split_grouped
)
res_grouped$estimates[[1]]The only requirement is that fold_split(data, K) returns
a vector of length nrow(data) with integer labels in
{1, …, K}, and that all folds are non-empty.
You can plug in any aggregation you like:
For example, a simple trimmed mean over panels:
trimmed_mean_estimate <- function(xs, trim = 0.1) {
x <- unlist(xs)
mean(x, trim = trim)
}
m_trim <- create_method(
target = target_mse,
list_nuisance = list(nuis_y = nuis_y),
folds = 4,
repeats = 5,
eval_fold = 1L,
mode = "estimate",
fold_allocation = "independence",
aggregate_panels = trimmed_mean_estimate,
aggregate_repeats = trimmed_mean_estimate
)
res_trim <- crossfit(data, m_trim)
res_trim$estimates[[1]]Use ?crossfit, ?crossfit_multi,
?create_method, ?create_nuisance for detailed
argument reference.
Explore the per_method and plan
components in the result if you need to:
crossfit is meant to be a small, flexible engine: you
define the nuisances and targets; it takes care of the cross-fitting
schedule, reuse of models, and basic safety checks (cycles, coverage of
dependencies, fold geometry).
If you encounter edge cases or have ideas for higher-level helpers (e.g., ready-made DML ATE wrappers), they can be built conveniently on top of this core.