--- title: "Getting Started with rmake" author: "Michal Burda" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with rmake} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r gs-setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) library(rmake) ``` ## Introduction R is a mature scripting language for statistical computations and data processing. An important advantage of R is that it allows writing **repeatable** statistical analyses by programming all steps of data processing in scripts, which allows re-executing the whole process after any change in data or processing steps. There are several useful packages for R to obtain repeatability of statistical computations, such as `knitr` and `rmarkdown`. These tools allow writing R scripts that generate reports combining text with tables and figures generated from data. However, if analyses grow in complexity, manual re-execution of the whole process may become tedious, prone to errors, and very demanding computationally. Complex analyses typically involve: - Many pre-processing steps on large datasets - Repetitive execution of commands differing only in parameters - Production of multiple output files in various formats It is inefficient to re-run all pre-processing steps repeatedly to refresh the final report after any change. A caching mechanism provided by `knitr` is helpful but limited to a single report. Splitting complex analyses into several parts and saving intermediate results into files is rational, but brings another challenge: **management of dependencies** between inputs, outputs, and underlying scripts. This is where **Make** comes in. Make is a tool that controls the generation of files from source data and script files by reading dependencies from a `Makefile` and comparing timestamps to determine which files need to be refreshed. The `rmake` package provides tools for easy generation of Makefiles for statistical and data manipulation tasks in R. ## Key Features The main features of `rmake` are: - Use of the well-known Make tool - Easy definitions of file dependencies in the R language - High flexibility through parameterized execution and programmatic rule generation - Simple, short code thanks to the `%>>%` pipeline operator and templating - Support for R scripts and R markdown files - Extensibility for user-defined rule types - Isolated and parallel execution via Make's parallel processing - Support for all platforms: Unix (Linux), MacOS, Windows, and Solaris - Compatibility with RStudio ## Why Use rmake? R allows the development of **repeatable** statistical analyses. However, when analyses grow in complexity, manual re-execution on any change may become tedious and error-prone. **Make** is a widely accepted tool for managing the generation of resulting files from source data and script files. `rmake` makes it easy to generate Makefiles for R analytical projects. ## Installation To install `rmake` from CRAN: ```{r gs-install, eval=FALSE} install.packages("rmake") ``` Alternatively, install the development version from GitHub: ```{r install_github, eval=FALSE} install.packages("devtools") devtools::install_github("beerda/rmake") ``` Load the package: ```{r load} library(rmake) ``` ## Prerequisites ### System Requirements - **R**: Version 3.5.0 or higher - **Make**: GNU Make or compatible make tool - On Linux/macOS: Usually pre-installed - On Windows: Install Rtools (which includes make) ### Environment Variables The package requires the `R_HOME` environment variable to be properly set. This variable indicates the directory where R is installed and is automatically set when running from within R or RStudio. #### When is R_HOME needed? When running `make` from the command line (outside of R), you may need to set `R_HOME` manually. #### Finding R_HOME To find the correct value for your system, run this in R: ```{r check_env, eval=FALSE} R.home() ``` You can also check the current values of R environment variables: ```{r check_env_vars, eval=FALSE} Sys.getenv("R_HOME") ``` #### Setting R_HOME **On Linux/macOS:** ```bash export R_HOME=/usr/lib/R # Use the path from R.home() ``` **On Windows (Command Prompt):** ```cmd set R_HOME=C:\Program Files\R\R-4.3.0 ``` **On Windows (PowerShell):** ```powershell $env:R_HOME = "C:\Program Files\R\R-4.3.0" ``` For permanent setup, add the export commands to your shell configuration file (`.bashrc`, `.zshrc`, etc. on Unix-like systems, or system environment variables on Windows). For more information on R environment variables, see the [official R documentation](https://stat.ethz.ch/R-manual/R-devel/library/base/html/EnvVar.html). ## Project Initialization ### Creating Skeleton Files To start a new project with `rmake`: ```{r gs-skeleton, eval=FALSE} library(rmake) rmakeSkeleton(".") ``` This creates two files: - `Makefile.R` - R script to generate the Makefile - `Makefile` - The generated Makefile (initially minimal) The initial `Makefile.R` contains: ```{r skeleton_content, eval=FALSE} library(rmake) job <- list() makefile(job, "Makefile") ``` ## Basic Example Let's walk through a simple example. Suppose we have: - `data.csv` - input data file - `script.R` - R script to process the data - Output: `sums.csv` - computed results ### Step 1: Create the Data File Create `data.csv`: ``` ID,V1,V2 a,2,8 b,9,1 c,3,3 ``` ### Step 2: Create the Processing Script Create `script.R`: ```{r script_content, eval=FALSE} d <- read.csv("data.csv") sums <- data.frame(ID = "sum", V1 = sum(d$V1), V2 = sum(d$V2)) write.csv(sums, "sums.csv", row.names = FALSE) ``` ### Step 3: Define the Build Rule Edit `Makefile.R`: ```{r define_rule, eval=FALSE} library(rmake) job <- list(rRule(target = "sums.csv", script = "script.R", depends = "data.csv")) makefile(job, "Makefile") ``` ### Step 4: Run the Build Execute make: ```{r run_make, eval=FALSE} make() ``` Make will: 1. Regenerate `Makefile` (if `Makefile.R` changed) 2. Execute `script.R` to create `sums.csv` Subsequent calls to `make()` will do nothing unless files change. ## Using the Pipe Operator The `%>>%` pipe operator makes rule definitions more readable: ```{r pipe_example, eval=FALSE} library(rmake) job <- "data.csv" %>>% rRule("script.R") %>>% "sums.csv" makefile(job, "Makefile") ``` This is equivalent to the previous example but more concise. ## Adding a Markdown Report Let's extend our example to create a PDF report. Create `analysis.Rmd`: ````markdown --- title: "Analysis" output: pdf_document --- # Sums of data rows ```{r, echo=FALSE, results='asis'}`r ''` sums <- read.csv('sums.csv') knitr::kable(sums) ``` ```` Update `Makefile.R`: ```{r add_markdown, eval=FALSE} library(rmake) job <- list( rRule(target = "sums.csv", script = "script.R", depends = "data.csv"), markdownRule(target = "analysis.pdf", script = "analysis.Rmd", depends = "sums.csv") ) makefile(job, "Makefile") ``` Or using pipes: ```{r pipe_chain, eval=FALSE} library(rmake) job <- "data.csv" %>>% rRule("script.R") %>>% "sums.csv" %>>% markdownRule("analysis.Rmd") %>>% "analysis.pdf" makefile(job, "Makefile") ``` Run make again: ```{r run_make2, eval=FALSE} make() ``` ## Running Make ### From R ```{r make_options, eval=FALSE} # Run all tasks make() # Run specific task make("all") # Clean generated files make("clean") # Parallel execution (8 jobs) make("-j8") ``` ### From Command Line ```bash make # Run all tasks make clean # Clean generated files make -j8 # Parallel execution ``` ### From RStudio 1. Go to **Build** > **Configure Build Tools** 2. Set **Project build tools** to **Makefile** 3. Use **Build All** button ## Visualizing Dependencies Visualize the dependency graph: ```{r gs-visualize, eval=FALSE} visualize(job, legend = FALSE) ``` This creates an interactive graph showing: - **Squares**: Data files - **Diamonds**: Script files - **Ovals**: Rules - **Arrows**: Dependencies ## Multiple Dependencies Handle complex dependencies: ```{r multiple_deps} chain1 <- "data1.csv" %>>% rRule("preprocess1.R") %>>% "intermed1.rds" chain2 <- "data2.csv" %>>% rRule("preprocess2.R") %>>% "intermed2.rds" chain3 <- c("intermed1.rds", "intermed2.rds") %>>% rRule("merge.R") %>>% "merged.rds" %>>% markdownRule("report.Rmd") %>>% "report.pdf" job <- c(chain1, chain2, chain3) ``` Alternatively, you can define all chains directly without intermediate variables: ```{r multiple_deps_alt, eval=FALSE} job <- c( "data1.csv" %>>% rRule("preprocess1.R") %>>% "intermed1.rds", "data2.csv" %>>% rRule("preprocess2.R") %>>% "intermed2.rds", c("intermed1.rds", "intermed2.rds") %>>% rRule("merge.R") %>>% "merged.rds" %>>% markdownRule("report.Rmd") %>>% "report.pdf" ) ``` ## Rule Types `rmake` provides several pre-defined rule types: - **`rRule()`**: Execute R scripts - **`markdownRule()`**: Render R Markdown documents - **`knitrRule()`**: Process knitr documents - **`copyRule()`**: Copy files - **`offlineRule()`**: Manual tasks with reminders For detailed documentation on all rule types including `depRule()`, `subdirRule()`, and custom rules, see the [Build Rules](build-rules.html) vignette. ## Next Steps For more information on specific topics, see these vignettes: - [rmake Project Management](project-management.html): Learn about project initialization, running builds, cleaning, and parallel execution - [Build Rules](build-rules.html): Comprehensive reference for all rule types (rRule, markdownRule, knitrRule, copyRule, depRule, subdirRule, offlineRule) - [Tasks and Templates](tasks-and-templates.html): Advanced features including tasks, parameterized execution, and rule templates ## Summary Key takeaways: 1. Use `rmakeSkeleton()` to initialize projects 2. Define rules in `Makefile.R` 3. Use `%>>%` for readable rule chains 4. Run `make()` to execute the build process 5. Use `visualize()` to understand dependencies ## Resources - Package documentation: `?rmake` - GitHub: https://github.com/beerda/rmake - Issues: https://github.com/beerda/rmake/issues