Getting Started with autoFlagR

Introduction

autoFlagR is an R package for automated data quality auditing using unsupervised machine learning. It provides AI-driven anomaly detection for data quality assessment, primarily designed for Electronic Health Records (EHR) data, with benchmarking capabilities for validation and publication.

Installation

Install the package from CRAN:

install.packages("autoFlagR")

Basic Workflow

The typical workflow consists of three main steps:

  1. Preprocess your data
  2. Score anomalies using AI algorithms
  3. Flag top anomalies for review

Step 1: Load the Package

library(autoFlagR)
library(dplyr)

Step 2: Prepare Your Data

The prep_for_anomaly() function automatically handles: - Identifier columns (patient_id, encounter_id, etc.) - Missing value imputation - Numerical feature scaling (MAD or min-max) - Categorical variable encoding (one-hot)

# Example healthcare data
data <- data.frame(
  patient_id = 1:200,
  age = rnorm(200, 50, 15),
  cost = rnorm(200, 10000, 5000),
  length_of_stay = rpois(200, 5),
  gender = sample(c("M", "F"), 200, replace = TRUE),
  diagnosis = sample(c("A", "B", "C"), 200, replace = TRUE)
)

# Introduce some anomalies
data$cost[1:5] <- data$cost[1:5] * 20  # Unusually high costs
data$age[6:8] <- c(200, 180, 190)  # Impossible ages

# Prepare data for anomaly detection
prepared <- prep_for_anomaly(data, id_cols = "patient_id")

Step 3: Score Anomalies

Use either Isolation Forest (default) or Local Outlier Factor (LOF):

# Score anomalies using Isolation Forest
scored_data <- score_anomaly(
  data, 
  method = "iforest", 
  contamination = 0.05
)
#> Warning in (function (data, sample_size = min(nrow(data), 10000L), ntrees =
#> 500, : Attempting to use more than 1 thread, but package was compiled without
#> OpenMP support. See
#> https://github.com/david-cortes/installing-optimized-libraries#4-macos-install-and-enable-openmp

# View anomaly scores
head(scored_data[, c("patient_id", "anomaly_score")], 10)
#>    patient_id anomaly_score
#> 1           1    0.15034167
#> 2           2    0.21395292
#> 3           3    0.00000000
#> 4           4    0.02693202
#> 5           5    0.23670251
#> 6           6    0.04638215
#> 7           7    0.11533699
#> 8           8    0.15881136
#> 9           9    0.92531753
#> 10         10    0.71809012

Step 4: Flag Top Anomalies

Flag records as anomalous based on threshold or contamination rate:

# Flag top anomalies
flagged_data <- flag_top_anomalies(
  scored_data, 
  contamination = 0.05
)

# View flagged anomalies
anomalies <- flagged_data[flagged_data$is_anomaly, ]
head(anomalies[, c("patient_id", "anomaly_score", "is_anomaly")], 10)
#>     patient_id anomaly_score is_anomaly
#> 39          39     0.9697503       TRUE
#> 56          56     0.9862881       TRUE
#> 63          63     0.9727825       TRUE
#> 73          73     0.9998179       TRUE
#> 135        135     0.9830231       TRUE
#> 157        157     1.0000000       TRUE
#> 175        175     0.9912094       TRUE
#> 184        184     0.9810962       TRUE
#> 191        191     0.9733082       TRUE
#> 192        192     0.9776592       TRUE

Step 5: Generate Audit Report

Generate comprehensive PDF, HTML, or DOCX reports:

# Generate PDF report (saves to tempdir() by default)
generate_audit_report(
  data,
  filename = "my_audit_report",
  output_dir = tempdir(),
  output_format = "pdf",
  method = "iforest",
  contamination = 0.05
)

Key Features

Next Steps