--- title: "Getting started with featdelta" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with featdelta} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Why featdelta exists `featdelta` is for R users who create features in R but need those features to live in a database. In many analytics projects, the raw observations are stored in a database and new rows arrive over time. The feature engineering logic, however, may be easier to write, test, and maintain in R. This creates a practical problem: after the R code computes the features, those values still need to be stored in a database table so they can be reused by modelling scripts, dashboards, monitoring jobs, or other downstream systems. `featdelta` provides an automated pipeline for that workflow. You define feature expressions in R, and the package handles the database-oriented steps around them: 1. storing feature definitions in a reusable object; 2. finding raw rows that do not yet have features; 3. computing features for those missing rows; 4. creating or extending the database feature table; 5. inserting or updating the computed feature values. The goal is to reduce the amount of repeated code needed to refresh feature tables. Instead of rebuilding an analysis dataset by hand each time new raw data arrives, you can keep a persistent feature table in the database and update it incrementally. ## Where this helps This pattern appears in many applied data workflows. For example, in a daily market-data workflow, a database table may be updated with the latest closing prices after each trading day. Once the new prices are available, an analyst may calculate indicators in R and store them in a separate feature table. On the next day, the pipeline should process only the newly available observations rather than rebuilding the entire history. In a credit-risk workflow, new loan or credit-card applications may arrive every day. Feature engineering often involves transformed variables that are more useful for modelling than the raw input columns. For example, an analyst may derive ratios, delay summaries, utilization measures, or other scorecard inputs from the raw application and bureau data. If these features are recreated only inside ad-hoc modelling scripts, refreshing the dataset with recent applications can become a recurring manual task. A persistent feature table makes the transformed variables easier to reuse for model development, validation, and monitoring. In a business-intelligence workflow, a reporting tool may need variables that are difficult to create directly in the BI layer. These might come from an R model, a specialized matching algorithm, or domain-specific transformation code. If the results are pushed back to the database, the BI tool can read them like ordinary columns. The details differ across domains, but the workflow is similar: raw data changes over time, feature logic is maintained in R, and the computed features are more useful when they are available in the database. ## A small running example This vignette uses an in-memory SQLite database and the built-in `mtcars` dataset. The database is tiny and the example features are intentionally simple, but the workflow mirrors a real production pattern: 1. finding which raw observations are new; 2. recomputing only the features that are missing; 3. keeping feature definitions organized; 4. creating or extending the database table where the features live; 5. pushing the newly computed values back to the database safely. Before demonstrating `featdelta`, we first set up the database. In this example, we pretend that the rows arrive in two batches: * at the end of day one, we have observed the first 20 rows of the `mtcars` dataset; * at the end of day two, we observe the next 10 rows, corresponding to ids 21 to 30. These rows are not available on day one. Our goal is to add new features to a database feature table at the end of each day: * whether the transmission is automatic or manual, stored as a text field; * the horsepower-to-cylinder ratio; * the weight per horsepower unit. ```{r} library(DBI) library(RSQLite) library(featdelta) cars <- mtcars cars$id <- seq_len(nrow(cars)) # Name of the unique row identifier used throughout the pipeline. key <- "id" day_one <- 1:20 day_two <- 21:30 ``` We use an in-memory SQLite database to mimic a real R-to-database workflow. We establish the database connection with `DBI::dbConnect()`. For this small example, no database credentials are needed. Next, `DBI::dbWriteTable()` pushes the first batch of rows from R into the database and creates a table named `raw_cars`. This table represents the raw observations for which we want to create separate features. After the raw data has been written to the database, we can inspect it using a simple SQL query. ```{r} con <- dbConnect(SQLite(), ":memory:") dbWriteTable(con, "raw_cars", cars[day_one, ], overwrite = TRUE) dbGetQuery(con, "SELECT id, mpg, cyl, hp, wt, am FROM raw_cars ORDER BY id") ``` In real-world scenarios, the data needed for feature engineering is rarely stored in one simple table. It is often selected from several database tables with joins, filters, and other selection criteria. For that reason, the database-to-R interface in this vignette uses a SQL query instead of referring directly to the `raw_cars` table. In the snippet below, we define the source dataset as rows where `id > 15`. This demonstrates that the dataset to be processed can be created with a custom query. There is one important requirement: each returned row must have a unique identifier. That is why we defined the `key` variable earlier. We can bring this source dataset into the R session with `DBI::dbGetQuery(con, source_sql)`. ```{r} source_sql <- "SELECT * FROM raw_cars WHERE id > 15 ORDER BY id" dbGetQuery(con, source_sql) ``` At this point, the source table contains raw observations, but the feature table does not exist yet. The rest of the vignette demonstrates how the feature table can be created and refreshed with `featdelta`. Later, we summarize which manual steps this pipeline replaces. ## Feature definitions live in R Start by defining features. These are ordinary R expressions evaluated against the raw data returned by `source_sql`. ```{r} defs <- fd_define( transmission = ifelse(am == 1, "automatic", "manual"), hp_per_cyl = hp / cyl, wt_per_hp = wt / hp ) defs ``` This is one of the first conveniences of `featdelta`: feature definitions are stored as an object. They can be printed, reviewed, reused, tested, and passed to the pipeline. The expressions are not scattered across one-off scripts. Before writing to the database, we can also compute locally to see the shape of the feature table that will be produced. ```{r} raw_preview <- dbGetQuery(con, source_sql) fd_compute( data = raw_preview, defs = defs, key = key ) ``` The output contains the key column plus the computed features. This is the table shape we want to persist. In this example, we define features in the simplest possible way: one expression per feature. In real projects, feature definitions may be generated by longer R scripts and may produce many columns. Later vignettes will cover richer definition patterns, including multi-column feature blocks. ## First run: create the feature table Now let `featdelta` run the full pipeline. The important result is the database table it creates. In this example, we call the new database table `car_features`. ```{r} run_day_one <- fd_run( con = con, sql = source_sql, defs = defs, key = key, feat_table_name = "car_features", verbose = FALSE ) dbGetQuery(con, "SELECT * FROM car_features ORDER BY id") ``` The R expressions were evaluated against the selected raw rows, and the result was stored as a persistent table. You can inspect that table with an ordinary SQL query, just as you would inspect any other database table. That is the central idea: the data scientist can write feature logic in R, while the computed features are stored in a database table that other workflows can reuse. ## Second run: process only new rows Now pretend time passes and more raw observations arrive. ```{r} dbAppendTable(con, "raw_cars", cars[day_two, ]) dbGetQuery( con, "SELECT id, mpg, cyl, hp, wt, am FROM raw_cars WHERE id > 15 ORDER BY id DESC" ) ``` The query above shows that new rows have been added to the raw table. The feature table, however, has not been refreshed yet. If we inspect it now, it still ends at id 20, even though raw rows up to id 30 are available. ```{r} dbGetQuery( con, "SELECT * FROM car_features ORDER BY id DESC" ) ``` This is the point where manual pipelines become repetitive: you need to compare raw keys with feature-table keys, build feature rows only for the missing raw observations, and append them without duplicating existing rows. With `featdelta`, run the same pipeline again and inspect the database table. ```{r} run_day_two <- fd_run( con = con, sql = source_sql, defs = defs, key = key, feat_table_name = "car_features", verbose = FALSE ) dbGetQuery(con, "SELECT * FROM car_features ORDER BY id DESC") ``` The existing feature rows were not duplicated. `featdelta` detected the raw rows whose keys were missing from `car_features`, computed features only for those rows, and inserted them into the database table. ## Add a new feature later Feature engineering is rarely finished on the first day. Suppose we decide that we also want a power-to-weight feature. We can define a new version of the feature set and run the same pipeline again. This time, the definitions include the new `power_to_weight` feature. ```{r} defs_v2 <- fd_define( transmission = ifelse(am == 1, "automatic", "manual"), hp_per_cyl = hp / cyl, wt_per_hp = wt / hp, power_to_weight = hp / wt ) run_refresh <- fd_run( con = con, sql = source_sql, defs = defs_v2, key = key, feat_table_name = "car_features", fetch_mode = "all", verbose = FALSE ) dbGetQuery(con, "SELECT * FROM car_features ORDER BY id DESC") ``` Here we use `fetch_mode = "all"`, which tells `fd_run()` to refresh the features for all rows returned by `source_sql`, including rows that already existed in the feature table. This ensures that old rows are not missing the newly added feature. In a real workflow, this is the difference between maintaining a database feature table over time and repeatedly rebuilding analysis datasets by hand. ## What the pipeline replaces Without `featdelta`, the workflow usually grows into a script that does all of this manually: 1. run a source SQL query; 2. check which keys already exist in the feature table; 3. keep only the missing rows; 4. evaluate feature expressions in R; 5. create the target table if it does not exist; 6. alter the target table when new feature columns appear; 7. insert or update rows; 8. inspect whether the run did what you expected. That script can be written, but it tends to become repetitive and fragile. `featdelta` makes those steps explicit: ```{r, eval = FALSE} defs <- fd_define( feature_a = some_r_expression, feature_b = another_r_expression ) fd_run( con = con, sql = "SELECT * FROM raw_table", defs = defs, key = "id", feat_table_name = "feature_table" ) ``` For most users, this is the main workflow: define features in R, then run the database pipeline. ## Inspect the run when needed The database table is the main result, but `fd_run()` also returns a structured report. The report is useful when developing, debugging, or monitoring the pipeline. ```{r} run_report <- fd_run( con = con, sql = source_sql, defs = defs_v2, key = key, feat_table_name = "car_features", fetch_mode = "all", return_data = "features", preview_n = 3, verbose = FALSE ) run_report ``` The report tells you how many rows were fetched, which features were computed, and how many rows were targeted for insert or update. It is there when you need visibility, but it does not get in the way of the core workflow. ## Where to go next This vignette introduced the main story: create feature logic in R, keep a database feature table up to date, and avoid rebuilding the same feature dataset by hand. The next topics are: 1. writing richer feature definitions with `fd_define()` and `fd_block()`; 2. using `fd_fetch()` and `fd_upsert()` directly; 3. production patterns for failures, previews, refreshes, and reports. ```{r cleanup, include = FALSE} dbDisconnect(con) ```