cohortBuilder
itself provides five types of filters that
are suitable to perform most common filtering tasks:
If any of the above filters doesn’t meet your need, you may want to create a custom one. Below we describe in details how new filters can be created and provide an example for creating a new one - logical filter.
Before we start let’s make a closer look about what R object the filter is.
The filter
function itself is S3 method taking
type
as a first argument.
filter#> function(type, ...) {
#> UseMethod("filter", type)
#> }
#> <bytecode: 0x55d4b954b5b0>
#> <environment: namespace:cohortBuilder>
So in case of discrete
filter, the proper used method
is:
:::filter.discrete
cohortBuilder#> function(type, id, name, ..., active = getOption("cb_active_filter", default = TRUE)) {
#> args <- append(
#> environment() %>% as.list() %>% purrr::keep(~ !is.symbol(.x)),
#> list(...)
#> )
#>
#> .as_constructor(
#> function(source) {
#> do.call(
#> cb_filter.discrete,
#> append(list(source = source), args)
#> )
#> }
#> )
#> }
#> <bytecode: 0x55d4ba424de0>
#> <environment: namespace:cohortBuilder>
that gathers provided parameters and returns function of
source
argument:
<- filter("discrete", value = "setosa", dataset = "iris", variable = "Species")
spec_filter
spec_filter#> function(source) {
#> do.call(
#> cb_filter.discrete,
#> append(list(source = source), args)
#> )
#> }
#> <bytecode: 0x55d4ba42db38>
#> <environment: 0x55d4bd16e050>
#> attr(,"class")
#> [1] "function" "cb_filter_constructor"
So whenever we define filter of specific type it returns unevaluated
function of source
parameter.
We can realize, the function itself calls another S3 method
cb_filter.discrete
, that takes into account type of the
provided source.
cb_filter.discrete#> function(source, ...) {
#> UseMethod("cb_filter.discrete", source)
#> }
#> <bytecode: 0x55d4b9536b58>
#> <environment: namespace:cohortBuilder>
With such approach for having two layers of S3 methods, we are allowed to to build various filter types, for various source types. For example discrete filter for tblist source, discrete filter for db source, range filter for raw source etc.
Now, let’s check what object stores filter evaluated on source:
<- set_source(
iris_source tblist(iris = iris)
)str(
spec_filter(iris_source),
give.attr = FALSE
)#> List of 10
#> $ id : chr "GDWIM1727265547182"
#> $ type : 'discrete' chr "discrete"
#> $ name : chr "GDWIM1727265547182"
#> $ input_param : chr "value"
#> $ filter_data :function (data_object)
#> $ get_stats :function (data_object, name)
#> $ plot_data :function (data_object)
#> $ get_params :function (name)
#> $ get_data :function (data_object)
#> $ get_defaults:function (data_object, cache_object)
We can see, the evaluated filter is a list of 10 elements:
type
- Filter type.id
- Filter id.name
- Filter name.input_param
- Name of the parameter taking filtering
value.filter_data
- Function of ‘data_object’ parameter
defining filtering logic on Source data object.get_stats
- Function of ‘data_object’ and ‘name’
parameters defining what and how data statistics should be
calculated.plot_data
- Function of ‘data_object’ parameter
defining how filter data should be plotted.get_params
- Function of ‘name’ parameter returning
filter parameter (or all parameters when name is missing).get_data
- Function of ‘data_object’ returning filter
related data.get_defaults
- Function of ‘data_object’ and
‘cache_object’ parameters returning default ‘input_param’ parameter
value.Where:
data_object
is an object passed to and returned from
each filtering step (in more details, following the structure returned
by init_source
S3 method).cache_object
is result of get_stats
method
for the previous step.In case you want to create a new filter definition, you may use
new_filter
to initialize it from template.
Below we’ll create a new filter that takes logical value and filters logical column accordingly. Let’s name type of the filter as ‘logical’. The filter will work on ‘tblist’ source data (list of data frames).
To do so, we need to:
filter.logical
S3 method that is called when
type of filter is ‘logical’.cb_filter.logical
that operates based on
source type.<- function(type, id, name, ..., active = getOption("cb_active_filter", default = TRUE)) {
filter.logical # Skip missing parameters passed and attach `...`
<- append(
args environment() %>% as.list() %>% purrr::keep(~ !is.symbol(.x)),
list(...)
)
# Return function of source parameter calling valid S3 method based on source type
function(source) {
do.call(
cb_filter.logical,append(list(source = source), args)
)
} }
Create cb_filter.logical
generic used in the above
method (skip when method already exists).
<- function(source, ...) {
cb_filter.logical UseMethod("cb_filter.logical", source)
}
Create S3 method for specific source data type (‘tblist’ in this case). Here we define list of parameters required by filter.
The obligatory parameters are:
source
,type
- equals filter type by default,id
- should be randomly generated by default,
.gen_id()
,name
- used only for improving readiness of various
outputs. Can be reassigned from id,active
- the parameter that configures whether filter
should be used or not (even if defined).More to that you should define parameters that allow filter configuration related to data.
In our case we need to configure:
dataset
.variable
.value
. Regarding value,
cohortBuilder
assumes value = NA
means no
filtering is applied. This needs to be handled while writing filtering
logic....
. Accessible via
get_params
method - see below.We can also add an extra parameter keep_na
which
determines whether NA
values should be included or not.
It’s also worth to add description
parameter, storing
helpful information about the defined filter.
Now we create source specific S3 method for
cb_filter.logical
. Inside of the method we call
def_filter
completing all of the parameters based on filter
configuration.
<- function(
cb_filter.logical.tblist type = "logical", id = .gen_id(), name = id, dataset, variable,
source, value = NA, keep_na = TRUE, description = NULL, ..., active = TRUE) {
<- list(...)
args
def_filter(
type = type,
id = id,
name = name,
input_param = "value",
filter_data = function(data_object) {
<- value # code include
selected_value if (keep_na && !identical(selected_value, NA)) {
# keep_na !value_na start
<- data_object[[dataset]] %>%
data_object[[dataset]] ::filter(!!sym(variable) == !!selected_value | is.na(variable))
dplyr# keep_na !value_na end
}if (!keep_na && identical(selected_value, NA)) {
# !keep_na value_na start
<- data_object[[dataset]] %>%
data_object[[dataset]] ::filter(!is.na(!!sym(variable)))
dplyr# !keep_na value_na end
}if (!keep_na && !identical(selected_value, NA)) {
# !keep_na !value_na start
<- data_object[[dataset]] %>%
data_object[[dataset]] ::filter(!!sym(variable) %in% !!selected_value & !is.na(variable))
dplyr# !keep_na !value_na end
}attr(data_object[[dataset]], "filtered") <- TRUE # code include
data_object
},get_stats = function(data_object, name) {
if (missing(name)) {
<- c("n_data", "choices", "n_missing")
name
}<- list(
stats choices = if ("choices" %in% name) data_object[[dataset]][[variable]] %>%
::na.omit() %>% table() %>% as.list(),
statsn_data = if ("n_data" %in% name) data_object[[dataset]][[variable]] %>%
::na.omit() %>%
statslength(),
n_missing = if ("n_missing" %in% name) data_object[[dataset]][[variable]] %>% is.na() %>% sum()
)if (length(name) == 1) {
return(stats[[name]])
else {
} return(stats[name])
}
},plot_data = function(data_object) {
if (nrow(data_object[[dataset]])) {
%>% table %>% prop.table() %>% graphics::barplot()
data_object[[dataset]][[variable]] else {
} ::barplot(0, ylim = c(0, 0.1), main = "No data")
graphics
}
},get_params = function(name) {
<- list(
params dataset = dataset,
variable = variable,
value = value,
description = description,
keep_na = keep_na,
active = active,
...
)if (!missing(name)) return(params[[name]])
return(params)
},get_data = function(data_object) {
data_object[[dataset]][[variable]]
},get_defaults = function(data_object, cache_object) {
list(value = names(cache_object$choices))
}
) }
Please note that:
keep_na = TRUE
).filter_data
we add a few comment blocks
that affect reproducible code output:(!)keep_na (!)value_na (start|end)
-
states which filtering case are we in.code include
comments, or
code include (start|end)
designates which code lines to
include in reproducible code. This way we can easily control which parts
of filtering logic should reproducible code include. When writing custom
filters in a separate package please add keepSource: true
in the description to preserve comments after compilation.get_stats
, we took care to evaluate only
the selected stat (if name
is not missing). Such
implementation is not obligatory, but helps to improve performance if we
operate with large source data.filter_data
method doesn’t change the structure of
data_object
.filter_data
attaches filtered
attribute to
affected dataset. This way, whenever filter_data
is called
(filter is active), we can handle such information while running
bindings. If no bindings are used, this step can be skipped.Now we can use our filter for building cohort.
For the example we’ll use extended iris
table:
<- dplyr::mutate(iris, is_setosa = Species == "setosa")
iris2 <- set_source(tblist(iris = iris2)) %>%
coh cohort(
filter("logical", dataset = "iris", variable = "is_setosa", value = TRUE)
%>%
) run()