SCDB: Benchmarks

library(magrittr)

To showcase the performance of SCDB on different database backends, we include this vignette that summarises a simple benchmark: A sample dataset is created based on the datasets::iris dataset. This data is repeated 10 times and given a unique ID (the row number of the data).

This data forms the basis for three “snapshots” used in the benchmarks:

The data as described above.
As 1. but where any Sepal.Length below the median is halved.
As 2. but where any Sepal.width below the median is halved.

The benchmark function uses three consecutive calls to update_snapshot() to create the table with first snapshot and then update it to the second and third snapshot. Finally, the table is deleted.

The performance of this benchmark function is timed with the {microbenchmark} package using 10 replicates. All benchmarks are run on the same machine.

We include another benchmark to highlight the complexity scaling of the update_snapshot() with the size of the input data. The datasets are similar to the first benchmark, but the number of repeats is varied to see the impact of increasing data size. The benchmarks are run from a “clean” state, where the target_table does not exists. The benchmark measures both the time to create the table and to remove it again afterwards (to restore the clean state).

The performance of this benchmark function is timed with the {microbenchmark} package using 5 replicates. All benchmarks are run on the same machine.

The results of the benchmark are shown graphically below (mean and standard deviation) and with linear scaling (dotted line), where we compare the current development version of SCDB with the current CRAN version.

NOTE: There are reports of a superlinear complexity for very large data sets. If you experience such problems, consider batching the updates via the filters argument.