Implementing custom filters

cohortBuilder itself provides five types of filters that are suitable to perform most common filtering tasks:

If any of the above filters doesn’t meet your need, you may want to create a custom one. Below we describe in details how new filters can be created and provide an example for creating a new one - logical filter.

Filter structure

Before we start let’s make a closer look about what R object the filter is.

The filter function itself is S3 method taking type as a first argument.

filter
#> function(type, ...) {
#>   UseMethod("filter", type)
#> }
#> <bytecode: 0x55d4b954b5b0>
#> <environment: namespace:cohortBuilder>

So in case of discrete filter, the proper used method is:

cohortBuilder:::filter.discrete
#> function(type, id, name, ..., active = getOption("cb_active_filter", default = TRUE)) {
#>   args <- append(
#>     environment() %>% as.list() %>% purrr::keep(~ !is.symbol(.x)),
#>     list(...)
#>   )
#> 
#>   .as_constructor(
#>     function(source) {
#>       do.call(
#>         cb_filter.discrete,
#>         append(list(source = source), args)
#>       )
#>     }
#>   )
#> }
#> <bytecode: 0x55d4ba424de0>
#> <environment: namespace:cohortBuilder>

that gathers provided parameters and returns function of source argument:

spec_filter <- filter("discrete", value = "setosa", dataset = "iris", variable = "Species")
spec_filter
#> function(source) {
#>       do.call(
#>         cb_filter.discrete,
#>         append(list(source = source), args)
#>       )
#>     }
#> <bytecode: 0x55d4ba42db38>
#> <environment: 0x55d4bd16e050>
#> attr(,"class")
#> [1] "function"              "cb_filter_constructor"

So whenever we define filter of specific type it returns unevaluated function of source parameter.

We can realize, the function itself calls another S3 method cb_filter.discrete, that takes into account type of the provided source.

cb_filter.discrete
#> function(source, ...) {
#>   UseMethod("cb_filter.discrete", source)
#> }
#> <bytecode: 0x55d4b9536b58>
#> <environment: namespace:cohortBuilder>

With such approach for having two layers of S3 methods, we are allowed to to build various filter types, for various source types. For example discrete filter for tblist source, discrete filter for db source, range filter for raw source etc.

Now, let’s check what object stores filter evaluated on source:

iris_source <- set_source(
  tblist(iris = iris)
)
str(
  spec_filter(iris_source),
  give.attr = FALSE
)
#> List of 10
#>  $ id          : chr "GDWIM1727265547182"
#>  $ type        : 'discrete' chr "discrete"
#>  $ name        : chr "GDWIM1727265547182"
#>  $ input_param : chr "value"
#>  $ filter_data :function (data_object)  
#>  $ get_stats   :function (data_object, name)  
#>  $ plot_data   :function (data_object)  
#>  $ get_params  :function (name)  
#>  $ get_data    :function (data_object)  
#>  $ get_defaults:function (data_object, cache_object)

We can see, the evaluated filter is a list of 10 elements:

Where:

Logical filter

In case you want to create a new filter definition, you may use new_filter to initialize it from template.

Below we’ll create a new filter that takes logical value and filters logical column accordingly. Let’s name type of the filter as ‘logical’. The filter will work on ‘tblist’ source data (list of data frames).

To do so, we need to:

  1. Create filter.logical S3 method that is called when type of filter is ‘logical’.
  2. Create generic cb_filter.logical that operates based on source type.
filter.logical <- function(type, id, name, ..., active = getOption("cb_active_filter", default = TRUE)) {
  # Skip missing parameters passed and attach `...`
  args <- append(
    environment() %>% as.list() %>% purrr::keep(~ !is.symbol(.x)),
    list(...)
  )

  # Return function of source parameter calling valid S3 method based on source type
  function(source) {
    do.call(
      cb_filter.logical,
      append(list(source = source), args)
    )
  }
}

Create cb_filter.logical generic used in the above method (skip when method already exists).

cb_filter.logical <- function(source, ...) {
  UseMethod("cb_filter.logical", source)
}

Create S3 method for specific source data type (‘tblist’ in this case). Here we define list of parameters required by filter.

The obligatory parameters are:

More to that you should define parameters that allow filter configuration related to data.

In our case we need to configure:

We can also add an extra parameter keep_na which determines whether NA values should be included or not. It’s also worth to add description parameter, storing helpful information about the defined filter.

Now we create source specific S3 method for cb_filter.logical. Inside of the method we call def_filter completing all of the parameters based on filter configuration.

cb_filter.logical.tblist <- function(
  source, type = "logical", id = .gen_id(), name = id, dataset, variable, 
  value = NA, keep_na = TRUE, description = NULL, ..., active = TRUE) {
  args <- list(...)

  def_filter(
    type = type,
    id = id,
    name = name,
    input_param = "value",
    filter_data = function(data_object) {

      selected_value <- value # code include
      if (keep_na && !identical(selected_value, NA)) {
        # keep_na !value_na start
        data_object[[dataset]] <- data_object[[dataset]] %>%
          dplyr::filter(!!sym(variable) == !!selected_value | is.na(variable))
        # keep_na !value_na end
      }
      if (!keep_na && identical(selected_value, NA)) {
        # !keep_na value_na start
        data_object[[dataset]] <- data_object[[dataset]] %>%
          dplyr::filter(!is.na(!!sym(variable)))
        # !keep_na value_na end
      }
      if (!keep_na && !identical(selected_value, NA)) {
        # !keep_na !value_na start
        data_object[[dataset]] <- data_object[[dataset]] %>%
          dplyr::filter(!!sym(variable) %in% !!selected_value & !is.na(variable))
        # !keep_na !value_na end
      }
      attr(data_object[[dataset]], "filtered") <- TRUE # code include
      data_object
    },
    get_stats = function(data_object, name) {
      if (missing(name)) {
        name <- c("n_data", "choices", "n_missing")
      }
      stats <- list(
        choices = if ("choices" %in% name) data_object[[dataset]][[variable]] %>% 
          stats::na.omit() %>% table() %>% as.list(),
        n_data = if ("n_data" %in% name)  data_object[[dataset]][[variable]] %>% 
          stats::na.omit() %>% 
          length(),
        n_missing = if ("n_missing" %in% name) data_object[[dataset]][[variable]] %>% is.na() %>% sum()
      )
      if (length(name) == 1) {
        return(stats[[name]])
      } else {
        return(stats[name])
      }
    },
    plot_data = function(data_object) {
      if (nrow(data_object[[dataset]])) {
        data_object[[dataset]][[variable]] %>% table %>% prop.table() %>% graphics::barplot()
      } else {
        graphics::barplot(0, ylim = c(0, 0.1), main = "No data")
      }
    },
    get_params = function(name) {
      params <- list(
        dataset = dataset,
        variable = variable,
        value = value,
        description = description,
        keep_na = keep_na,
        active = active,
        ...
      )
      if (!missing(name)) return(params[[name]])
      return(params)
    },
    get_data = function(data_object) {
      data_object[[dataset]][[variable]]
    },
    get_defaults = function(data_object, cache_object) {
      list(value = names(cache_object$choices))
    }
  )
}

Please note that:

  1. When filter parameter value equals NA, we assume no filtering is done at all (unless we define keep_na = TRUE).
  2. While defining filter_data we add a few comment blocks that affect reproducible code output:
  1. When defining get_stats, we took care to evaluate only the selected stat (if name is not missing). Such implementation is not obligatory, but helps to improve performance if we operate with large source data.
  2. filter_data method doesn’t change the structure of data_object.
  3. filter_data attaches filtered attribute to affected dataset. This way, whenever filter_data is called (filter is active), we can handle such information while running bindings. If no bindings are used, this step can be skipped.

Now we can use our filter for building cohort.

For the example we’ll use extended iris table:

iris2 <- dplyr::mutate(iris, is_setosa = Species == "setosa")
coh <- set_source(tblist(iris = iris2)) %>%
  cohort(
    filter("logical", dataset = "iris", variable = "is_setosa", value = TRUE)
  ) %>%
  run()