--- title: "Organising Large Projects with Sub-Pipelines" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{sub-pipelines} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette introduces `rxp_pipeline()`, a function for organising large projects into logical sub-pipelines. This feature is particularly useful when working on complex projects with multiple phases (e.g., ETL, Modelling, Reporting) or when collaborating in teams where different members work on different parts of the pipeline. ## Large Pipelines Become Unwieldy As pipelines grow, a single `gen-pipeline.R` file can become difficult to manage. Consider a data science project with: - Data extraction and cleaning (ETL) - Feature engineering - Model training - Model evaluation - Report generation Putting all derivations in one file makes it hard to: - Navigate the code - Understand which derivations belong to which phase - Collaborate across team members - Reuse pipeline components in other projects To solve this issue, you can define your project using sub-pipelines and join them into a master pipeline using `rxp_pipeline()`. This allows you to: 1. **Organise** derivations into named groups 2. **Colour-code** groups for visual distinction in DAG visualisations 3. **Modularise** your code across multiple R scripts ### Basic Usage A project with sub-pipelines would look something like this: ``` my-project/ ├── default.nix # Nix environment (generated by rix) ├── gen-env.R # Script to generate default.nix ├── gen-pipeline.R # MASTER SCRIPT: combines all sub-pipelines └── pipelines/ ├── 01_data_prep.R # Data preparation sub-pipeline ├── 02_analysis.R # Analysis sub-pipeline └── 03_reporting.R # Reporting sub-pipeline ``` Each sub-pipeline file returns a list of derivations: ```{r sub-pipeline-1, eval = FALSE} # Data Preparation Sub-Pipeline # pipelines/01_data_prep.R library(rixpress) list( rxp_r(name = raw_mtcars, expr = mtcars), rxp_r(name = clean_mtcars, expr = dplyr::filter(raw_mtcars, am == 1)), rxp_r(name = selected_mtcars, expr = dplyr::select(clean_mtcars, mpg, cyl, hp, wt)) ) ``` The `rxp_pipeline()` function takes: - **name**: A descriptive name for this group of derivations - **path**: Either a **file path** to an R script returning a list of derivations (recommended), or a list of derivation objects. - **color**: Optional CSS color name or hex code for DAG visualisation The second sub-pipeline: ```{r sub-pipeline-2, eval = FALSE} # Analysis Sub-Pipeline # pipelines/02_analysis.R library(rixpress) list( rxp_r(name = summary_stats, expr = summary(selected_mtcars)), rxp_r(name = mpg_model, expr = lm(mpg ~ hp + wt, data = selected_mtcars)), rxp_r(name = model_coefs, expr = coef(mpg_model)) ) ``` The master script becomes very clean, as `rxp_pipeline` handles sourcing the files: ```{r master-script, eval = FALSE} # gen-pipeline.R library(rixpress) # Create named pipelines with colours by pointing to the files pipe_data_prep <- rxp_pipeline( name = "Data Preparation", path = "pipelines/01_data_prep.R", color = "#E69F00" ) pipe_analysis <- rxp_pipeline( name = "Statistical Analysis", path = "pipelines/02_analysis.R", color = "#56B4E9" ) # Build combined pipeline rxp_populate(list(pipe_data_prep, pipe_analysis), project_path = ".", build = TRUE) ``` ## Visualising Sub-Pipelines When sub-pipelines are defined, visualisation tools use pipeline colours: 1. **Interactive Network** (`rxp_visnetwork()`) and **Static DAG** (`rxp_ggdag()`) both use a dual-encoding approach: - **Node fill (interior)**: Derivation type colour (R = blue, Python = yellow, etc.) - **Node border (thick stroke)**: Pipeline group colour This allows you to see both what *type* of computation each node is and which *pipeline* it belongs to.
Subpipelines are coloured.
3. **Trace**: `rxp_trace()` output in the console is coloured by pipeline (using the `cli` package).
If your terminal supports it, derivation names are coloured according to the chosen sub-pipeline colour.
### Switching Between Colour Modes ```{r color-modes, eval = FALSE} # Dual encoding: fill = type, border = pipeline (default when pipelines are defined) rxp_ggdag(color_by = "pipeline") # Colour entirely by derivation type (rxp_r, rxp_py, etc.) - original behaviour rxp_ggdag(color_by = "type") ``` ## How It Works Internally When you call `rxp_populate()` with `rxp_pipeline` objects: 1. **Flattening**: Pipelines are flattened to a single list of derivations 2. **Metadata Preservation**: Each derivation retains `pipeline_group` and `pipeline_color` 3. **DAG Generation**: `dag.json` includes pipeline metadata 4. **Visualisation**: `rxp_visnetwork()` and `rxp_ggdag()` read this metadata ## Best Practices 1. **Use descriptive pipeline names**: "Data Preparation" is better than "ETL" 2. **Choose contrasting colours**: Use [ColorBrewer](https://colorbrewer2.org/) palettes 3. **Keep sub-pipelines focused**: One logical phase per sub-pipeline 4. **Order your files**: Use numeric prefixes (01_, 02_, etc.) ## Conclusion `rxp_pipeline()` provides a simple yet powerful way to organise complex pipelines. By grouping derivations into logical units, you can: - Keep your code organised and maintainable - Enable team collaboration on different parts of the pipeline - Visualise the structure of your workflow with meaningful colours - Reuse sub-pipelines across projects For a working example, see the `subpipelines` demo in the [rixpress_demos](https://github.com/b-rodrigues/rixpress_demos) repository.