---
title: "Introduction to crossfit"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to crossfit}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
```

library(crossfit) set.seed(1)

# 1. Motivation: why another cross-fitting engine?

Many modern estimators (**double / debiased ML**, meta-learners, etc. )
share the same pattern:

-   We care about a **low-dimensional target** (ATE, regression risk, a
    parameter, …)
-   This target depends on one or several **high-dimensional nuisance
    functions**\
    $m(x) = E[Y \mid X]$, propensity scores, conditional means, …

If we fit the nuisances and evaluate the target on the **same
observations**, we usually:

-   overfit the nuisances,
-   and introduce bias in the target that doesn’t vanish nicely.

**Cross-fitting** fixes this by:

1.  splitting the data into $K$ folds,
2.  fitting nuisance models on **training folds**,
3.  evaluating the target on **held-out folds** where the nuisances were
    *not* trained.

The `crossfit` package generalizes this logic to:

-   an arbitrary **DAG of nuisances**,
-   multiple methods in parallel,
-   flexible fold geometry (per-node `train_fold`, per-target
    `eval_fold`),
-   two modes: `"estimate"` (numeric target) and `"predict"`
    (cross-fitted predictor).

# 2. Basic concepts

## 2.1 Nuisances

A **nuisance** is defined via `create_nuisance()`:

-   `fit(data, ...)` → trains a model on (a subset of) the data,
-   `predict(model, data, ...)` → returns predictions on (a subset of)
    the data,
-   `train_fold` → how many folds the nuisance trains on,
-   optional `fit_deps`, `pred_deps` → which other nuisances it depends
    on.

Example: regression $m(x) = E[Y \mid X]$:

``` r
nuis_y <- create_nuisance(
  fit = function(data, ...) lm(y ~ x, data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2  # this nuisance will train on 2 consecutive folds
)
```

## 2.2 Target

The **target** is just a function of:

-   `data`,
-   and some nuisance outputs (passed as arguments).

Example: cross-fitted mean squared error (MSE) of $m(x)$:

``` r
target_mse <- function(data, nuis_y, ...) {
  mean((data$y - nuis_y)^2)
}
```

During cross-fitting, the engine will:

-   call `nuis_y`’s `predict()` on held-out folds,
-   then call
    `target_mse(data_eval, nuis_y = predicted_values_on_eval)`.

You don’t have to manage folds manually in the target.

## 2.3 Methods

A **method** bundles:

-   a `target`,
-   a list of nuisances,
-   cross-fitting configuration:

``` r
mse_method <- create_method(
  target = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds = 4, # total number of folds K
  repeats = 3, # how many times to re-draw fold splits
  eval_fold = 1, # evaluation window width (in folds)
  mode = "estimate",
  fold_allocation = "independence",
  aggregate_panels = mean_estimate,
  aggregate_repeats = mean_estimate
)
```

Conceptually:

-   `folds` and `repeats` define **K-fold cross-fitting repeated R
    times**,
-   `eval_fold` tells how many folds to reserve for evaluating the
    target,
-   `mode` controls whether we return a **numeric estimate**
    (`"estimate"`) or a **prediction function** (`"predict"`),
-   `fold_allocation` controls how training windows are laid out across
    folds,
-   `aggregate_panels` combines panel-wise results (within one
    repetition),
-   `aggregate_repeats` combines repetition-wise results.

# 3. A simple regression example

Let’s walk through a full workflow on a toy regression problem.

``` r
n <- 200
x <- rnorm(n)
y <- x + rnorm(n)
data <- data.frame(x = x, y = y)
```

We reuse the nuisance and target defined above (`nuis_y`, `target_mse`),
and the method `mse_method`.

## 3.1 Single-method cross-fitting with `crossfit()`

``` r
res <- crossfit(data, mse_method)

str(res$estimates)
res$estimates[[1]]
```

The result is a list with elements:

-   `estimates` – one entry per method (here only one),
-   `per_method` – panel-wise and repetition-wise values and errors,
-   `repeats_done` – how many repetitions successfully ran,
-   `K`, `K_required`, `methods`, `plan` – extra diagnostics.

We can inspect the per-repetition values:

``` r
res$per_method$method$values
```

Each element in `values` is the aggregated MSE over panels for that
repetition.

# 4. Multiple methods and shared nuisances

Very often, you want to compare several targets or configurations that
share the **same nuisance models**. `crossfit_multi()` is built for
that.

Here we estimate simultaneously:

-   the cross-fitted MSE of $m(x)$,
-   the cross-fitted mean of $m(x)$.

``` r
target_mean <- function(data, nuis_y, ...) {
  mean(nuis_y)
}

m_mse <- create_method(
  target        = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 1,
  mode          = "estimate",
  fold_allocation   = "independence",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

m_mean <- create_method(
  target        = target_mean,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 1,
  mode          = "estimate",
  fold_allocation   = "overlap",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

cf_multi <- crossfit_multi(
  data    = data,
  methods = list(mse = m_mse, mean = m_mean),
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

cf_multi$estimates
```

The two methods share fitted nuisances whenever their **structure** and
**training folds** coincide (internally via structural signatures and
caching), which can significantly reduce computation when you have many
methods.

# 5. Predict mode: build a cross-fitted ensemble predictor

In `"predict"` mode, the engine returns a **prediction function**
instead of a numeric estimate. This is useful when you want:

-   a cross-fitted regression / classifier you can re-use on new data,
-   possibly built from an **ensemble** of several nuisance models.

Here we build a cross-fitted **ensemble predictor** that averages a
linear and a quadratic regression for $E[Y \mid X]$.

We simulate a slightly nonlinear regression problem:

``` r
n2 <- 300
x2 <- runif(n2, -2, 2)
y2 <- sin(x2) + rnorm(n2, sd = 0.3)
data2 <- data.frame(x = x2, y = y2)
```

Two nuisances:

-   `nuis_lin`: linear regression,
-   `nuis_quad`: quadratic regression via `poly(x, 2)`.

``` r
nuis_lin <- create_nuisance(
  fit = function(data, ...) lm(y ~ x, data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2
)

nuis_quad <- create_nuisance(
  fit = function(data, ...) lm(y ~ poly(x, 2), data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2
)
```

Now define a **target in predict mode** that combines the two nuisance
predictions into an ensemble prediction:

``` r
target_ensemble <- function(data, m_lin, m_quad, ...) {
  0.5 * m_lin + 0.5 * m_quad
}
```

We build a method in `"predict"` mode:

-   `eval_fold = 0L` (no dedicated evaluation window),
-   target depends on `m_lin` and `m_quad`,
-   results will be aggregated into a single prediction function.

``` r
m_ens <- create_method(
  target        = target_ensemble,
  list_nuisance = list(
    m_lin  = nuis_lin,
    m_quad = nuis_quad
  ),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 0, # no eval window in predict mode
  mode          = "predict",
  fold_allocation   = "independence"
)
```

Run cross-fitting in predict mode, using `mean_predictor()` to aggregate
panel-level and repetition-level predictors:

``` r
res_pred <- crossfit_multi(
  data = data2,
  methods = list(ensemble = m_ens),
  aggregate_panels = mean_predictor,
  aggregate_repeats = mean_predictor
)

# estimates$ensemble is now a prediction function
f_hat <- res_pred$estimates$ensemble

newdata <- data.frame(x = seq(-2, 2, length.out = 7))
cbind(x = newdata$x, y_hat = f_hat(newdata))
```

Here:

-   Each repetition builds cross-fitted predictors for `m_lin`, `m_quad`
    and the ensemble `target_ensemble`.
-   `mean_predictor()` aggregates predictors over panels and
    repetitions.
-   `f_hat(newdata)` gives cross-fitted ensemble predictions on new
    data.

This is the typical pattern in `"predict"` mode: your `target` combines
one or several nuisance predictors into a **derived predictor**
(pseudo-outcome, CATE, ensemble, …), and the engine returns a
cross-fitted version of that predictor.

# 6. Fold allocation strategies

The `fold_allocation` argument controls how training blocks are placed
relative to the evaluation window.

For each method:

-   `eval_fold` folds are reserved for evaluating the target,
-   each nuisance has a `train_fold` width,
-   `fold_allocation` decides how the training blocks for nuisances
    occupy the K folds.

The engine supports three strategies:

-   `"independence"`
    -   Each *instance* (possibly duplicated by context) gets its own
        **disjoint** training window after the eval window.\
    -   Strongest notion of out-of-sample independence for all nodes.
-   `"overlap"`
    -   All non-target nuisances **share** the same training window
        starting after the eval window.\
    -   Training data for different nuisances may overlap, but they
        still avoid eval folds in `"estimate"` mode.
-   `"disjoint"`
    -   Unique nuisances (by name) get one disjoint training window each
        after the eval window, without duplicating instances by
        context.\
    -   Intermediate between `independence` and `overlap`.

You choose the strategy per method:

``` r
mse_overlap <- create_method(
  target        = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 1,
  mode          = "estimate",
  fold_allocation   = "overlap",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)
```

# 7. Customization

## 7.1 Custom fold splitting

By default, fold assignments are:

``` r
fold_split = function(data, K) sample(rep_len(1:K, nrow(data)))
```

You can override this in `crossfit()` or `crossfit_multi()` if you need:

-   stratification,
-   time-series blocks,
-   grouped folds, etc.

Example: simple grouped folds by an integer `id`:

``` r
# toy group variable
group_id <- sample(1:10, size = nrow(data), replace = TRUE)

fold_split_grouped <- function(data, K) {
  # assign folds at group level, then expand to rows
  groups <- unique(group_id)
  gfolds <- sample(rep_len(1:K, length(groups)))
  g2f    <- setNames(gfolds, groups)
  g2f[group_id]
}

res_grouped <- crossfit(
  data = data,
  method = mse_method,
  fold_split = fold_split_grouped
)

res_grouped$estimates[[1]]
```

The only requirement is that `fold_split(data, K)` returns a vector of
length `nrow(data)` with integer labels in `{1, …, K}`, and that all
folds are non-empty.

## 7.2 Aggregation functions

You can plug in any aggregation you like:

-   for numeric estimates: trimmed means, medians, robust summaries,
-   for predictors: custom ensembles, stacking, etc.

For example, a simple **trimmed mean** over panels:

``` r
trimmed_mean_estimate <- function(xs, trim = 0.1) {
  x <- unlist(xs)
  mean(x, trim = trim)
}

m_trim <- create_method(
  target        = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds         = 4,
  repeats       = 5,
  eval_fold     = 1L,
  mode          = "estimate",
  fold_allocation   = "independence",
  aggregate_panels  = trimmed_mean_estimate,
  aggregate_repeats = trimmed_mean_estimate
)

res_trim <- crossfit(data, m_trim)
res_trim$estimates[[1]]
```

# 8. Where to go next

-   Use `?crossfit`, `?crossfit_multi`, `?create_method`,
    `?create_nuisance` for detailed argument reference.

-   Explore the `per_method` and `plan` components in the result if you
    need to:

    -   debug dependency graphs,
    -   inspect allocated folds,
    -   or introspect which nuisances are used where.

`crossfit` is meant to be a small, flexible engine: you define the
nuisances and targets; it takes care of the cross-fitting schedule,
reuse of models, and basic safety checks (cycles, coverage of
dependencies, fold geometry).

If you encounter edge cases or have ideas for higher-level helpers
(e.g., ready-made DML ATE wrappers), they can be built conveniently on
top of this core.