--- title: "Data Wrangling & Visualization" author: "Bernardo Lares" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Wrangling & Visualization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) library(dplyr) ``` ## Install and Load Install `lares` from CRAN or get the development version from GitHub. Then, load the package: ```{r} library(lares) ``` ## Quick Start with Built-in Data We'll use the Titanic dataset included in `lares`: ```{r} data(dft) head(dft, 3) ``` ## Frequency Analysis ### Basic Frequencies The `freqs()` function provides quick frequency tables with percentages and cumulative values: ```{r} # How many survived? freqs(dft, Survived) ``` ### Multi-variable Frequencies ```{r} # Survival by passenger class freqs(dft, Pclass, Survived) ``` ### Visual Frequencies ```{r fig.width=7, fig.height=4} # Visualize survival by class freqs(dft, Pclass, Survived, plot = TRUE) ``` ### Dataframe-wide Frequencies Analyze all variables at once: ```{r fig.width=7, fig.height=5} freqs_df(dft, plot = TRUE, top = 10) ``` ## Correlation Analysis ### Correlation Matrix Get correlations between all variables (automatically handles categorical variables): ```{r} # Correlation matrix of numeric variables cors <- corr(dft[, 2:5], method = "pearson") head(cors, 3) ``` ### Correlate One Variable with All Others ```{r fig.width=7, fig.height=5} # Which variables correlate most with Survival? corr_var(dft, Survived, top = 10) ``` ### Cross-Correlations Find the strongest correlations across the entire dataset: ```{r fig.width=7, fig.height=5} # Top cross-correlations corr_cross(dft[, 2:6], top = 8) ``` ## Data Transformation ### Categorical Reduction Reduce categories in high-cardinality variables: ```{r} # Reduce ticket categories (keep top 5, group rest as "other") dft_reduced <- categ_reducer(dft, Ticket, top = 5) freqs(dft_reduced, Ticket, top = 10) ``` ### Normalization Normalize numeric variables to [0, 1] range: ```{r} # Normalize age dft$Age_norm <- normalize(dft$Age) head(dft[, c("Age", "Age_norm")], 5) ``` ### One-Hot Encoding Convert categorical variables to binary columns: ```{r} # One-hot encode passenger class dft_encoded <- ohse(dft[, c("Pclass", "Survived")], limit = 5) colnames(dft_encoded) ``` ## Date Manipulation Create date features for time series analysis: ```{r} # Create sample dates dates <- seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day") # Extract year-month ym <- year_month(dates[1:5]) ym # Extract year-quarter yq <- year_quarter(dates[1:5]) yq # Cut dates into quarters quarters <- date_cuts(dates[c(1, 100, 200, 300)], type = "Q") quarters ``` ## Visualization with theme_lares ### Custom ggplot2 Theme `lares` includes a clean, professional theme: ```{r fig.width=7, fig.height=4} library(ggplot2) ggplot(dft, aes(x = Age, y = Fare * 1000, color = Survived)) + geom_point(alpha = 0.6) + labs(title = "Age vs Fare by Survival") + # Customize theme with several available options theme_lares(legend = "top", grid = "Yy", pal = 2, background = "#f2f2f2") + # Customize axis scales to look nicer scale_y_abbr() ``` ### Distribution Plots Visualize distributions quickly: ```{r fig.width=7, fig.height=5} # Analyze Fare distribution distr(dft, Fare, breaks = 20) ``` ### Number Formatting Format numbers for better readability: ```{r} # Format large numbers formatNum(c(1234567, 987654.321), decimals = 2) # Abbreviate numbers num_abbr(c(1500, 2500000, 1.5e9)) # Convert abbreviations back to numbers num_abbr(c("1.5K", "2.5M", "1.5B"), numeric = TRUE) ``` ### Custom Scales Use lares scales for better axis formatting: ```{r fig.width=7, fig.height=4} df_summary <- dft %>% group_by(Pclass) %>% summarize(avg_fare = mean(Fare, na.rm = TRUE), .groups = "drop") ggplot(df_summary, aes(x = factor(Pclass), y = avg_fare)) + geom_col(fill = "#00B1DA") + labs(title = "Average Fare by Class", x = "Class", y = NULL) + scale_y_dollar() + # Format as currency theme_lares() ``` ## Text and Vector Utilities ### Vector to Text Convert vectors to readable text: ```{r} # Simple comma-separated vector2text(c("apple", "banana", "cherry")) # With "and" before last item vector2text(c("red", "green", "blue"), and = "and") # Shorter alias v2t(LETTERS[1:5]) ``` ## Putting It All Together Here's a complete analysis workflow: ```{r fig.width=7, fig.height=5} library(dplyr) # 1. Load and prepare data data(dft) # 2. Clean and transform dft_clean <- dft %>% mutate(Age_Group = cut(Age, breaks = c(0, 18, 35, 60, 100), labels = c("Child", "Young", "Adult", "Senior") )) # 3. Analyze frequencies freqs(dft_clean, Age_Group, Survived, plot = TRUE) # 4. Check correlations corr_var(dft_clean, Survived_TRUE, top = 8, max_pvalue = 0.05) ``` ## Further Reading ### Package Resources - **Package documentation:** [https://laresbernardo.github.io/lares/](https://laresbernardo.github.io/lares/) - **GitHub repository:** [https://github.com/laresbernardo/lares](https://github.com/laresbernardo/lares) - **Report issues:** [https://github.com/laresbernardo/lares/issues](https://github.com/laresbernardo/lares/issues) ### Blog Posts & Tutorials - **Find Insights with Ranked Cross-Correlations:** [DataScience+](https://laresbernardo.github.io/lares/reference/corr_cross.html) - **Visualize Monthly Income Distribution and Spend Curve:** [DataScience+](https://laresbernardo.github.io/lares/reference/distr.html) - **All lares articles:** [Author page on DataScience+](https://laresbernardo.github.io/lares/articles/) ## Next Steps - Explore machine learning with `h2o_automl()` (see Machine Learning vignette) - Learn about API integrations with ChatGPT and Gemini (see API Integrations vignette) - Check individual function documentation: `?freqs`, `?corr`, `?theme_lares`