---
title: "Feature definitions workshop"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Feature definitions workshop}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Purpose of this workshop

The first vignette showed how `featdelta` helps move features created in R into
a database table. This workshop focuses on the part of the workflow that most
data scientists will edit most often: the feature definitions.

Feature definitions are the bridge between exploratory R code and a repeatable
feature pipeline. When features are kept only as lines inside an ad-hoc script,
they are harder to review, reuse, test, and refresh. With `fd_define()`, the
feature logic becomes a structured object. That object can then be computed
locally with `fd_compute()` or passed to `fd_run()` for the full database
pipeline.

In this workshop, we will build feature definitions step by step:

1. simple one-column features;
2. features that depend on earlier features;
3. reusable definitions supplied programmatically;
4. multi-column scripts with `fd_block()`;
5. function-based blocks that generate an unknown number of columns.


## Workshop data

We will use a small version of the built-in `mtcars` dataset. The data is simple
enough to inspect directly, but it lets us demonstrate the same patterns used in
larger feature engineering projects.

```{r}
library(featdelta)

raw_cars <- mtcars
raw_cars$car_id <- seq_len(nrow(raw_cars))

raw_cars <- raw_cars[, c("car_id", "mpg", "cyl", "disp", "hp", "wt", "am")]

head(raw_cars)
```

The `car_id` column is the key. It identifies each row and will be preserved in
the computed feature table.

## Start with simple feature definitions

The most direct use of `fd_define()` is to write one expression per feature.
Each named expression becomes one output column.

```{r}
defs_basic <- fd_define(
  transmission = ifelse(am == 1, "automatic", "manual"),
  hp_per_cyl = hp / cyl,
  wt_per_hp = wt / hp
)

defs_basic
```

The printed object gives a quick overview of the feature set. 
Note that the feature names and expressions are stored together and can be passed
around as one object.

Now compute the definitions on the raw data.

```{r}
features_basic <- fd_compute(
  data = raw_cars,
  defs = defs_basic,
  key = "car_id"
)

head(features_basic)
```

The output contains the key column plus the computed features. This is the
feature table shape that can later be written to the database by `fd_run()`.

## Use earlier features in later features

Definitions are evaluated in order. This means a later feature can use columns
created by earlier definitions in the same `fd_compute()` call.

```{r}
defs_ordered <- fd_define(
  hp_per_cyl = hp / cyl,
  strong_engine = hp_per_cyl > 30,
  engine_label = ifelse(strong_engine, "strong", "regular")
)

features_ordered <- fd_compute(
  data = raw_cars,
  defs = defs_ordered,
  key = "car_id"
)

head(features_ordered)
```

This is useful when your feature engineering naturally has stages. You can first
create a base transformation, then reuse it in flags, labels, scores, or other
derived features. The important habit is to keep the order intentional. If a
later feature uses `hp_per_cyl`, then `hp_per_cyl` must be defined earlier.

## Keep programmatic definitions reusable

Sometimes feature definitions are created outside the `fd_define()` call. For
example, you might keep a small library of expressions, generate definitions
from a configuration file, or reuse the same expression across projects.

```{r}
log_hp_expr <- expression(log(hp))
heavy_car_expr <- expression(wt > 3.5)

defs_programmatic <- fd_define(
  log_hp = log_hp_expr,
  heavy_car = heavy_car_expr
)

features_programmatic <- fd_compute(
  data = raw_cars,
  defs = defs_programmatic,
  key = "car_id"
)

head(features_programmatic)
```

The benefit is administrative: the feature set can be built from named pieces
instead of being rewritten by hand each time. This matters when the feature
catalog becomes larger than a few simple columns.

## Use fd_block() when one feature step returns several columns

One expression per feature is convenient for small feature sets. In real
projects, however, a single conceptual feature step may naturally produce
several columns. That is what `fd_block()` is for.

An `fd_block()` is a multi-column definition step. It must return a
`data.frame`, and each column of that data frame becomes a feature.

```{r}
defs_block <- fd_define(
  engine_ratios = fd_block({
    data.frame(
      hp_per_cyl = hp / cyl,
      disp_per_cyl = disp / cyl,
      wt_per_hp = wt / hp
    )
  })
)

features_block <- fd_compute(
  data = raw_cars,
  defs = defs_block,
  key = "car_id"
)

head(features_block)
```

This is useful when you want one named definition step, such as
`engine_ratios`, to produce a small family of related columns.

## Write a small script inside fd_block()

The block body does not have to be a single `data.frame()` call. It can be a
small R script. You can create temporary variables, reuse intermediate
calculations, and return only the final columns you want to store.

```{r}
defs_script_block <- fd_define(
  engine_script = fd_block({
    hp_per_cyl <- hp / cyl
    disp_per_cyl <- disp / cyl

    ratio_average <- (hp_per_cyl + disp_per_cyl) / 2
    high_ratio <- ratio_average > stats::median(ratio_average, na.rm = TRUE)

    data.frame(
      hp_per_cyl = hp_per_cyl,
      disp_per_cyl = disp_per_cyl,
      engine_ratio_average = ratio_average,
      high_engine_ratio = high_ratio
    )
  })
)

features_script_block <- fd_compute(
  data = raw_cars,
  defs = defs_script_block,
  key = "car_id"
)

head(features_script_block)
```

This pattern is often easier to read than forcing every intermediate expression
into a separate top-level feature. Temporary variables stay inside the block,
while the returned data frame defines the columns that become part of the final
feature table.

## Use function-based blocks for larger feature scripts

As feature logic grows, it is often better to move it into a regular R function.
This is especially helpful when you want to test the feature script separately,
reuse it across projects, or keep the `fd_define()` call compact.

```{r}
make_engine_features <- function(data) {
  hp_per_cyl <- data$hp / data$cyl
  disp_per_cyl <- data$disp / data$cyl

  data.frame(
    hp_per_cyl = hp_per_cyl,
    disp_per_cyl = disp_per_cyl,
    engine_index = hp_per_cyl + disp_per_cyl
  )
}

defs_function_block <- fd_define(
  engine_features = fd_block(make_engine_features)
)

features_function_block <- fd_compute(
  data = raw_cars,
  defs = defs_function_block,
  key = "car_id"
)

head(features_function_block)
```

Function-based blocks are a good fit for code that already looks like a small
feature-engineering script. The function receives the current working data and
returns a `data.frame` of feature columns.

## Generate an unknown number of features in a loop

Some feature sets are not known column by column in advance. For example, you
might want to apply the same transformation to a selected group of numeric
variables. A function-based block can generate those columns in a loop.

```{r}
make_scaled_features <- function(data) {
  vars <- c("hp", "disp", "wt")
  out <- list()

  for (var in vars) {
    center <- mean(data[[var]], na.rm = TRUE)
    spread <- stats::sd(data[[var]], na.rm = TRUE)

    out[[paste0(var, "_scaled")]] <- (data[[var]] - center) / spread
  }

  as.data.frame(out)
}

defs_loop_block <- fd_define(
  scaled_inputs = fd_block(make_scaled_features)
)

features_loop_block <- fd_compute(
  data = raw_cars,
  defs = defs_loop_block,
  key = "car_id"
)

head(features_loop_block)
```

This pattern is useful when the number of output columns depends on a vector of
input names, a configuration object, or another piece of project logic. The
important rule remains the same: the block must return a data frame with one row
per input row.

## Combine ordinary features and blocks

You do not have to choose between ordinary definitions and blocks. A single
definition object can contain both. Later steps can also use columns produced by
earlier steps, including columns produced by blocks.

```{r}
defs_combined <- fd_define(
  transmission = ifelse(am == 1, "automatic", "manual"),
  engine_features = fd_block(make_engine_features),
  scaled_inputs = fd_block(make_scaled_features),
  engine_per_weight = engine_index / wt
)

features_combined <- fd_compute(
  data = raw_cars,
  defs = defs_combined,
  key = "car_id"
)

head(features_combined)
```

This is where feature definitions become a practical organizing tool. You can
keep simple expressions simple, move related feature families into blocks, and
still evaluate the full set as one ordered pipeline.

## Declare expected block outputs when useful

Sometimes you want a block to have an expected output schema. This is useful
when the block may return only some columns in some situations, but the database
feature table should still have a stable set of columns.

```{r}
defs_expected <- fd_define(
  optional_engine_flags = fd_block(
    {
      data.frame(
        high_hp = hp > 150
      )
    },
    expected_names = c("high_hp", "high_disp")
  )
)

features_expected <- fd_compute(
  data = raw_cars,
  defs = defs_expected,
  key = "car_id"
)

head(features_expected)
```

The block returned `high_hp`, but `high_disp` was declared as an expected output.
`fd_compute()` includes the missing expected column and fills it with `NA`. This
can help when you want the feature table to keep a predictable schema.

## What to remember

Feature definitions are where the package lets you turn R feature engineering
into a reusable pipeline component.

Use ordinary `fd_define()` expressions when each feature is simple and readable
on one line. Use `fd_block()` when a feature step naturally produces several
columns, needs temporary variables, or belongs in a reusable function. Use
function-based blocks when the logic is long enough to test separately or when
the output columns are generated programmatically.

Once the definitions object is ready, the same object can be used in two ways:

```{r, eval = FALSE}
# Local computation while developing feature logic
fd_compute(raw_data, defs, key = "id")

# Full database pipeline once the definitions are ready
fd_run(con, sql, defs, key = "id", feat_table_name = "feature_table")
```

That is the main workflow: develop feature logic in R, store it as a definitions
object, test it locally, and then use it in the incremental database pipeline.