--- title: "Specifying Custom Node Types in a DAG" output: rmarkdown::html_vignette: toc: true toc_depth: 4 author: "Robin Denz" vignette: > %\VignetteIndexEntry{Specifying Custom Node Types in a DAG} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse=TRUE, comment="#>", fig.align="center" ) ``` # Introduction In this small vignette, we give a detailed explanation on how to define custom functions that can be used in the `type` argument of `node()` or `node_td()` calls. Although `simDAG` includes a large number of different node types that can be used in this argument directly, it also allows the user to pass any function to this argument, as long as that function meets some limited criteria (as described below). This is an advanced feature that most users probably don't need for standard simulation studies. We strongly recommend reading the documentation and the other vignettes first, because this vignette assumes that the reader is already familiar with the `simDAG` syntax and general features. The support for custom functions in `type` allows users to create root nodes, child nodes or time-dependent nodes that are not directly implemented in this package. By doing so, users may create data with any functional dependence they can think of. The requirements for each node type are listed below. Some simple examples for each node type are given in each section. If you think that your custom node type might be useful to others, please contact the maintainer of this package via the supplied e-mail address or github and we might add it to this package. ```{r} library(simDAG) set.seed(1234) ``` # Root Nodes ## Requirements Any function that generates some vector of size `n` with `n==nrow(data)`, or a `data.frame()` with as many rows as the current data can be used as a child node. The only requirement is: * **1.)** The function should have an argument called `n` which controls how many samples to generate. Some examples that are already implemented in R outside of this package are `stats::rnorm()`, `stats::rgamma()` and `stats::rbeta()`. The function may take any amount of further arguments, which will be passed through the three-dot (`...`) syntax. Note that whenever the supplied function produces a `data.frame()` (or similar object), the user has to ensure that the included columns are named properly. ## Examples Using external functions that fulfill the requirements which are already defined by some other package can be done this way: ```{r} dag <- empty_dag() + node("A", type="rgamma", shape=0.1, rate=2) + node("B", type="rbeta", shape1=2, shape2=0.3) ``` Of course users may also define an appropriate root node function themselves. The code below defines a function that takes the sum of a normally distributed random number and a uniformly distributed random number for each simulated individual: ```{r} custom_root <- function(n, min=0, max=1, mean=0, sd=1) { out <- runif(n, min=min, max=max) + rnorm(n, mean=mean, sd=sd) return(out) } # the function may be supplied as a string dag <- empty_dag() + node("A", type="custom_root", min=0, max=10, mean=5, sd=2) # equivalently, the function can also be supplied directly # This is the recommended way! dag <- empty_dag() + node("A", type=custom_root, min=0, max=10, mean=5, sd=2) data <- sim_from_dag(dag=dag, n_sim=100) head(data) ``` # Child Nodes ## Requirements Again, almost any function may be used to generate a child node. Only four things are required for this to work properly: * **1.)** Its' name should start with `node_` (if you want to use a string to define it in `type`). * **2.)** It should contain an argument called `data` (contains the already generated data). * **3.)** It should contain an argument called `parents` (contains a vector of the child nodes parents). * **4.)** It should return either a vector of length `n_sim` or a `data.frame()` (or similar object) with any number of columns and `n_sim` rows. The function may include any amount of additional arguments specified by the user. ## Examples Below we define a custom child node type that is basically just a gaussian node with some (badly done) truncation, limiting the range of the resulting variable to be between `left` and `right`. ```{r} node_gaussian_trunc <- function(data, parents, betas, intercept, error, left, right) { out <- node_gaussian(data=data, parents=parents, betas=betas, intercept=intercept, error=error) out <- ifelse(out <= left, left, ifelse(out >= right, right, out)) return(out) } ``` Please note that this is a terrible form of truncation in most cases, because it artificially distorts the resulting normal distribution at the `left` and `right` values. It is only meant as an illustration. Here is another example of a custom child node function, which simply returns the sum of its parents: ```{r} parents_sum <- function(data, parents, betas=NULL) { out <- rowSums(data[, parents, with=FALSE]) return(out) } ``` We can use both of these functions in a DAG like this: ```{r} dag <- empty_dag() + node("age", type="rnorm", mean=50, sd=4) + node("sex", type="rbernoulli", p=0.5) + node("custom_1", type="gaussian_trunc", parents=c("sex", "age"), betas=c(1.1, 0.4), intercept=-2, error=2, left=10, right=25) + node("custom_2", type=parents_sum, parents=c("age", "custom_1")) data <- sim_from_dag(dag=dag, n_sim=100) head(data) ``` # Time-Dependent Nodes ## Requirements By time-dependent nodes we mean nodes that are created using the `node_td()` function. In general, this works in essentially the same way as for simple root nodes or child nodes. The requirements are: * **1.)** Its' name should start with `node_` (if you want to use a string to define it in `type`). * **2.)** It should contain an argument called `data` (contains the already generated data). * **3.)** If it is a child node, it should contain an argument called `parents` (contains a vector of the child nodes parents). This is not necessary for nodes that are independently generated. * **4.)** It should return either a vector of length `n_sim` or a `data.frame()` (or similar object) with any number of columns and `n_sim` rows. Again, any number of additional arguments is allowed and will be passed through the three-dot syntax. Additionally, there are two build-in arguments that users may specify in custom time-dependent nodes, which are then used internally. First, users may add an argument to this function called `sim_time`. If included in the function definition, the current time of the simulation will be passed to the function on every call made to it. Secondly, the argument `past_states` may be added. If done so, a list containing all previous states of the simulation (as saved using the `save_states` argument of the `sim_discrete_time()` function) will be passed to it internally, giving the user access to the data generated at previous points in time. ## Examples ### Time-Dependent Root Nodes An example for a custom time-dependent root node is given below: ```{r} node_custom_root_td <- function(data, n, mean=0, sd=1) { return(rnorm(n=n, mean=mean, sd=sd)) } ``` This function simply draws a new value from a normal distribution at each point in time of the simulation. A DAG using this node type could look like this: ```{r} n_sim <- 100 dag <- empty_dag() + node_td(name="Something", type=node_custom_root_td, n=n_sim, mean=10, sd=5) ``` ### Time-Dependent Child Nodes Below is an example for a function that can be used to define a custom time-dependent child node: ```{r} node_custom_child <- function(data, parents) { out <- numeric(nrow(data)) out[data$other_event] <- rnorm(n=sum(data$other_event), mean=10, sd=3) out[!data$other_event] <- rnorm(n=sum(!data$other_event), mean=5, sd=10) return(out) } dag <- empty_dag() + node_td("other", type="time_to_event", prob_fun=0.1) + node_td("whatever", type="custom_child", parents="other_event") ``` This function takes a random draw from a normal distribution with different specifications based on whether a previously updated time-dependent node called `other` is currently `TRUE` or `FALSE`. ### Using the `sim_time` Argument Below we give an example on how the `sim_time` argument may be used. The following function simply returns the square of the current simulation time as output: ```{r} node_square_sim_time <- function(data, sim_time, n_sim) { return(rep(sim_time^2, n=n_sim)) } dag <- empty_dag() + node_td("unclear", type=node_square_sim_time, n_sim=100) ``` Note that we did not (and should not!) actually define the `sim_time` argument in the `node_td()` definition of the node, because it will be passed internally, just like `data` is. As long as `sim_time` is a named argument of the function the user is passing, it will be handled automatically. In real simulation studies this feature may be used to create time-scale dependent risks or effects for some time-dependent events of interest. ### Using the `past_states` Argument As stated earlier, another special kind of argument is the `past_states` argument, which allows users direct access to past states of the simulation. Below is an example of how this might be used: ```{r} node_prev_state <- function(data, past_states, sim_time) { if (sim_time < 3) { return(rnorm(n=nrow(data))) } else { return(past_states[[sim_time-2]]$A + rnorm(n=nrow(data))) } } dag <- empty_dag() + node_td("A", type=node_prev_state, parents="A") ``` This function simply returns the value used two simulation time steps ago plus a normally distributed random value. To make this happen, we actually use both the `sim_time` argument **and** the `past_states` argument. Note that, again, we do not (and cannot!) define these arguments in the `node_td()` definition of the node. They are simply used internally. A crucial thing to make the previous code work in an actual simulation is the `save_states` argument of the `sim_discrete_time()` function. This argument controls which states should be saved internally. If users want to use previous states, these need to be saved, so the argument should in almost all cases be set to `save_states="all"`, as shown below: ```{r} sim <- sim_discrete_time(dag, n_sim=100, max_t=10, save_states="all") ``` # Using the Formula Interface Users may also use the enhanced `formula` interface directly with custom child nodes and custom time-dependent nodes. This is described in detail in the vignette on specifying formulas (see `vignette(topic="v_using_formulas", package="simDAG")`). # Some General Comments Using custom functions as node types is an advanced technique to obtain specialized simulated data. It is sadly impossible to cover all user cases here, but we would like to give some general recommendations nonetheless: * When using custom nodes, pass the function to `type` directly, do not use a string. This might avoid some weird scoping issues, depending on which environment the simulation is performed in. * Keep it simple, if u can. Particularly in time-dependent simulations, the computational complexity of the node function matters a lot. * Consider if `node_identity()` might be used instead. In many cases, it is a lot easier to just use a node of type `identity` instead of defining a new function. * The structural equations printed for custom nodes may be uninformative.