--- title: "Introduction to diffdf" author: "Craig Gower & Kieran Martin" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to diffdf} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The purpose of `diffdf` is to provide `proc compare` like functionality to R for use in second line programming. In particular we focus on raising warnings if any differences are found whilst providing in-depth diagnostics to highlight where these differences have occurred. ## Basic usage Here we show the basic functionality of `diffdf` using a dummy data set. ```{r} library(diffdf) LENGTH <- 30 suppressWarnings(RNGversion("3.5.0")) set.seed(12334) test_data <- tibble::tibble( ID = 1:LENGTH, GROUP1 = rep(c(1, 2), each = LENGTH / 2), GROUP2 = rep(c(1:(LENGTH / 2)), 2), INTEGER = rpois(LENGTH, 40), BINARY = sample(c("M", "F"), LENGTH, replace = TRUE), DATE = lubridate::ymd("2000-01-01") + rnorm(LENGTH, 0, 7000), DATETIME = lubridate::ymd_hms("2000-01-01 00:00:00") + rnorm(LENGTH, 0, 200000000), CONTINUOUS = rnorm(LENGTH, 30, 12), CATEGORICAL = factor(sample(c("A", "B", "C"), LENGTH, replace = TRUE)), LOGICAL = sample(c(TRUE, FALSE), LENGTH, replace = TRUE), CHARACTER = stringi::stri_rand_strings(LENGTH, rpois(LENGTH, 13), pattern = "[ A-Za-z0-9]") ) test_data diffdf(test_data, test_data) ``` As you would expect no differences are found. We now look to introduce various types differences into the data in order to show how `diffdf` highlights them. Note that for the purposes of this vignette we have used the `suppress_warnings` argument to stop errors being raised; it is recommended however that this option is not used in production code as it may mask problems. ### Missing Columns ```{r} test_data2 <- test_data test_data2 <- test_data2[, -6] diffdf(test_data, test_data2, suppress_warnings = TRUE) ``` ### Missing Rows ```{r} test_data2 <- test_data test_data2 <- test_data2[1:(nrow(test_data2) - 2), ] diffdf(test_data, test_data2, suppress_warnings = TRUE) ``` ### Different Values ```{r} test_data2 <- test_data test_data2[5, 2] <- 6 diffdf(test_data, test_data2, suppress_warnings = TRUE) ``` ### Different Types ```{r} test_data2 <- test_data test_data2[, 2] <- as.character(test_data2[, 2]) diffdf(test_data, test_data2, suppress_warnings = TRUE) ``` ### Different Labels ```{r} test_data2 <- test_data attr(test_data$ID, "label") <- "This is a interesting label" attr(test_data2$ID, "label") <- "what do I type here?" diffdf(test_data, test_data2, suppress_warnings = TRUE) ``` ### Different Factor Levels ```{r} test_data2 <- test_data levels(test_data2$CATEGORICAL) <- c(1, 2, 3) diffdf(test_data, test_data2, suppress_warnings = TRUE) ``` ## Grouping Variables A key feature of `diffdf` that enables easier diagnostics is the ability to specify which variables form a unique row i.e. which rows should be compared against each other based upon a key. By default if no key is specified `diffdf` will use the row numbers as the key however in general this isn't recommended as it means two identical datasets simply sorted differently can lead to incomprehensible error messages as every observation is flagged as different. In `diffdf` keys can be specified as character vectors using the `keys` argument. ```{r} test_data2 <- test_data test_data2$INTEGER[c(5, 2, 15)] <- 99L diffdf(test_data, test_data2, keys = c("GROUP1", "GROUP2"), suppress_warnings = TRUE) ``` ## Misc ### Accessing problem rows As an additional utility `diffdf` comes with the function `diffdf_issuerows()` which can be used to subset your dataset against the issue object to return just the rows that are flagged as containing issues. ```{r} iris2 <- iris for (i in 1:3) iris2[i, i] <- 99 diff <- diffdf(iris, iris2, suppress_warnings = TRUE) diffdf_issuerows(iris, diff) diffdf_issuerows(iris2, diff) ``` Bear in mind that the `vars` option can be used to just subset down to issues associated with particular variables. ```{r} diffdf_issuerows(iris2, diff, vars = "Sepal.Length") diffdf_issuerows(iris2, diff, vars = c("Sepal.Length", "Sepal.Width")) ``` ### Are there issues ? Sometimes it can be useful to use the comparison result to fuel further checks or programming logic. To assist with this `diffdf` offers two pieces of functionality namely the `suppress_warnings` argument (which has already been shown) and the `diffdf_has_issues()` helper function which simply returns TRUE if differences have been found else FALSE. ```{r} iris2 <- iris for (i in 1:3) iris2[i, i] <- 99 diff <- diffdf(iris, iris2, suppress_warnings = TRUE) diffdf_has_issues(diff) ``` ```{r eval = FALSE} if (diffdf_has_issues(diff)) { # } ``` ### Tolerance You can use the `tolerance` argument of `diffdf` to define how sensitive the comparison should be to decimal place inaccuracies. This important as very often floating point numbers will not compare equal due to machine rounding as they cannot be perfectly represented in binary. By default tolerance is set to `sqrt(.Machine$double.eps)` ```{r} dsin1 <- data.frame(x = 1.1e-06) dsin2 <- data.frame(x = 1.1e-07) diffdf(dsin1, dsin2, suppress_warnings = TRUE) diffdf(dsin1, dsin2, tolerance = 0.001, suppress_warnings = TRUE) ``` ### Strictness By default, the function will note a difference between integer and double columns, and factor and character columns. It can be useful in some contexts to prevent this from occurring. We can do so with the `strict_numeric = FALSE` and `strict_factor = FALSE` arguments. ```{r} dsin1 <- data.frame(x = as.integer(c(1, 2, 3))) dsin2 <- data.frame(x = as.numeric(c(1, 2, 3))) diffdf(dsin1, dsin2, suppress_warnings = TRUE) diffdf(dsin1, dsin2, suppress_warnings = TRUE, strict_numeric = FALSE) dsin1 <- data.frame(x = as.character(c(1, 2, 3)), stringsAsFactors = FALSE) dsin2 <- data.frame(x = as.factor(c(1, 2, 3))) diffdf(dsin1, dsin2, suppress_warnings = TRUE) diffdf(dsin1, dsin2, suppress_warnings = TRUE, strict_factor = FALSE) ```