--- title: "Evaluating Emotion Classification with evaluate_emotions()" author: "transforEmotion Team" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Evaluating Emotion Classification with evaluate_emotions()} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction The `evaluate_emotions()` function provides comprehensive evaluation capabilities for discrete emotion classification tasks. This vignette demonstrates how to use the function to assess model performance using standard metrics and visualizations. ## Installation and Setup ```{r, eval=FALSE} # Install transforEmotion if not already installed # devtools::install_github("your-repo/transforEmotion") library(transforEmotion) ``` ```{r} library(transforEmotion) ``` ## Basic Usage ### Creating Sample Data First, let's create some sample evaluation data to demonstrate the function: ```{r} # Create synthetic evaluation data set.seed(42) n_samples <- 200 # Generate ground truth labels emotions <- c("anger", "joy", "sadness", "fear", "surprise") eval_data <- data.frame( id = 1:n_samples, truth = sample(emotions, n_samples, replace = TRUE, prob = c(0.2, 0.3, 0.2, 0.15, 0.15)), stringsAsFactors = FALSE ) # Generate realistic predictions (correlated with truth but with some errors) eval_data$pred <- eval_data$truth # Introduce some classification errors error_indices <- sample(1:n_samples, size = 0.25 * n_samples) eval_data$pred[error_indices] <- sample(emotions, length(error_indices), replace = TRUE) # Generate probability scores for (emotion in emotions) { # Higher probability for correct class, lower for others eval_data[[paste0("prob_", emotion)]] <- ifelse( eval_data$truth == emotion, runif(n_samples, 0.6, 0.95), # Higher prob for correct class runif(n_samples, 0.01, 0.4) # Lower prob for incorrect classes ) } # Normalize probabilities to sum to 1 prob_cols <- paste0("prob_", emotions) prob_sums <- rowSums(eval_data[, prob_cols]) eval_data[, prob_cols] <- eval_data[, prob_cols] / prob_sums # Display sample data head(eval_data) ``` ### Basic Evaluation Now let's evaluate the model performance with basic metrics: ```{r} # Basic evaluation with default metrics results <- evaluate_emotions( data = eval_data, truth_col = "truth", pred_col = "pred" ) # Print results print(results) ``` ### Evaluation with Probabilities For more comprehensive evaluation including calibration metrics: ```{r} # Full evaluation with probability scores results_full <- evaluate_emotions( data = eval_data, truth_col = "truth", pred_col = "pred", probs_cols = prob_cols, classes = emotions, return_plot = TRUE ) # Display summary summary(results_full) ``` ## Understanding the Metrics ### Classification Metrics The function computes several standard classification metrics: - **Accuracy**: Overall classification accuracy - **Precision**: Per-class and macro/micro averages - **Recall**: Per-class and macro/micro averages - **F1-Score**: Harmonic mean of precision and recall ```{r} # Access per-class metrics results_full$per_class_metrics ``` ### Probabilistic Metrics When probability scores are provided: - **AUROC**: Area under the ROC curve for each class - **ECE**: Expected Calibration Error measuring probability calibration ```{r} # AUROC results results_full$auroc # Calibration error cat("Expected Calibration Error:", round(results_full$ece, 3)) ``` ### Inter-rater Reliability Krippendorff's α measures agreement between human annotators and model predictions: ```{r} cat("Krippendorff's α:", round(results_full$krippendorff_alpha, 3)) ``` ## Visualization The function provides built-in plotting capabilities: ```{r, eval=FALSE} # Plot confusion matrix and metrics (requires ggplot2) if (requireNamespace("ggplot2", quietly = TRUE)) { plots <- plot(results_full) # Display confusion matrix print(plots$confusion_matrix) # Display per-class metrics print(plots$metrics) } ``` ## Integration with transforEmotion Workflow ### Complete Pipeline Example Here's how to integrate `evaluate_emotions()` into a complete emotion analysis workflow: ```{r, eval=FALSE} # Step 1: Get emotion predictions using transforEmotion text_data <- c( "I am so happy today!", "This makes me really angry.", "I feel very sad about this news." ) # Get transformer-based predictions predictions <- transformer_scores( x = text_data, classes = c("anger", "joy", "sadness"), return_prob = TRUE ) # Step 2: Prepare evaluation data (assuming you have ground truth) ground_truth <- c("joy", "anger", "sadness") # Your ground truth labels eval_df <- data.frame( id = 1:length(text_data), truth = ground_truth, pred = predictions$predicted_class, prob_anger = predictions$prob_anger, prob_joy = predictions$prob_joy, prob_sadness = predictions$prob_sadness, stringsAsFactors = FALSE ) # Step 3: Evaluate performance evaluation <- evaluate_emotions( data = eval_df, probs_cols = c("prob_anger", "prob_joy", "prob_sadness") ) print(evaluation) ``` ### Using with CSV Data You can also evaluate models using data stored in CSV files: ```{r, eval=FALSE} # Save evaluation data to CSV write.csv(eval_data, "model_evaluation.csv", row.names = FALSE) # Load and evaluate from CSV csv_results <- evaluate_emotions( data = "model_evaluation.csv", probs_cols = prob_cols ) ``` ## Advanced Usage ### Custom Metrics Selection Select only specific metrics for faster computation: ```{r} # Evaluate only accuracy and F1 scores quick_eval <- evaluate_emotions( data = eval_data, metrics = c("accuracy", "f1_macro", "f1_micro"), return_plot = FALSE ) print(quick_eval$metrics) ``` ### Handling Missing Data The function automatically handles missing values: ```{r} # Create data with missing values eval_data_missing <- eval_data eval_data_missing$truth[1:5] <- NA eval_data_missing$pred[6:10] <- NA # Evaluate with automatic missing value removal results_clean <- evaluate_emotions( data = eval_data_missing, na_rm = TRUE # Default behavior ) cat("Original samples:", nrow(eval_data_missing), "\n") cat("Samples after cleaning:", results_clean$summary$n_instances, "\n") ``` ### Custom Column Names Use custom column names for your data: ```{r} # Rename columns in your data custom_data <- eval_data names(custom_data)[names(custom_data) == "truth"] <- "ground_truth" names(custom_data)[names(custom_data) == "pred"] <- "model_prediction" # Evaluate with custom column names custom_results <- evaluate_emotions( data = custom_data, truth_col = "ground_truth", pred_col = "model_prediction", metrics = c("accuracy", "f1_macro") ) print(custom_results) ``` ## Best Practices ### 1. Always Include Probability Scores When possible, include probability scores for more comprehensive evaluation: ```{r, eval=FALSE} # Good: Include probabilities for calibration analysis results_with_probs <- evaluate_emotions( data = eval_data, probs_cols = prob_cols ) ``` ### 2. Use Appropriate Metrics Choose metrics based on your use case: - **Balanced datasets**: Accuracy and macro F1 - **Imbalanced datasets**: Micro F1 and per-class recall - **Probabilistic models**: AUROC and ECE ### 3. Validate Data Quality Always check your evaluation data before analysis: ```{r} # Check class distribution table(eval_data$truth) table(eval_data$pred) # Check for missing values sum(is.na(eval_data$truth)) sum(is.na(eval_data$pred)) ``` ### 4. Report Multiple Metrics Don't rely on a single metric - report comprehensive results: ```{r} # Get comprehensive evaluation comprehensive_eval <- evaluate_emotions( data = eval_data, probs_cols = prob_cols, metrics = c("accuracy", "precision", "recall", "f1_macro", "f1_micro", "auroc", "ece", "krippendorff", "confusion_matrix") ) # Report key metrics key_metrics <- comprehensive_eval$metrics[ comprehensive_eval$metrics$metric %in% c("accuracy", "f1_macro", "f1_micro"), ] print(key_metrics) ``` ## Conclusion The `evaluate_emotions()` function provides a comprehensive toolkit for evaluating emotion classification models. It integrates seamlessly with the transforEmotion package workflow and follows best practices from the machine learning evaluation literature. Key features: - Standard classification metrics (accuracy, precision, recall, F1) - Probabilistic evaluation (AUROC, calibration) - Inter-rater reliability (Krippendorff's α) - Built-in visualization capabilities - Flexible input handling and data validation For more information, see the function documentation with `?evaluate_emotions`.