--- title: "explore" author: "Roland Krasser" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{explore} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! explore package on Github: [https://github.com/rolkra/explore](https://github.com/rolkra/explore) As the explore-functions fits well into the tidyverse, we load the dplyr-package as well. ```{r message=FALSE, warning=FALSE} library(dplyr) library(explore) ``` ### Interactive data exploration Explore your data set (in this case the iris data set) in one line of code: ```{R eval=FALSE, echo=TRUE} explore(iris) ``` A shiny app is launched, you can inspect individual variable, explore their relation to a target (binary / categorical / numerical), grow a decision tree or create a fully automated report of all variables with a few "mouse clicks". ![](../man/figures/explore-shiny-iris-target-species.png){width=600px} You can choose each variable containing as a target, that is binary (0/1, FALSE/TRUE or "no"/"yes"), categorical or numeric. ### Report variables Create a rich HTML report of all variables with one line of code: ```{R eval=FALSE, echo=TRUE} # report of all variables iris %>% report(output_file = "report.html", output_dir = tempdir()) ``` ![](../man/figures/report-attributes.png){width=600px} Or you can simply add a target and create the report. In this case we use a binary target, but a categorical or numerical target would work as well. ```{R eval=FALSE, echo=TRUE} # report of all variables and their relationship with a binary target iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0) iris %>% report(output_file = "report.html", output_dir = tempdir(), target = is_versicolor) ``` If you use a binary target, the parameter ***split = FALSE*** (or `targetpct = TRUE`) will give you a different view on the data. ![](../man/figures/report-target.png){width=600px} ### Grow a decision tree Grow a decision tree with one line of code: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=4} iris %>% explain_tree(target = Species) ``` You can grow a decision tree with a binary target too. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=4} iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0) iris %>% select(-Species) %>% explain_tree(target = is_versicolor) ``` Or using a numerical target. The syntax stays the same. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=4} iris %>% explain_tree(target = Sepal.Length) ``` You can control the growth of the tree using the parameters `maxdepth`, `minsplit` and `cp`. To create other types of models use `explain_forest()`, `explain_xgboost()` and `explain_logreg()`. ### Explore dataset Explore your table with one line of code to see which type of variables it contains. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore_tbl() ``` You can also use `describe_tbl()` if you just need the main facts without visualization. ```{r message=FALSE, warning=FALSE} iris %>% describe_tbl() ``` ### Explore variables Explore a variable with one line of code. You don't have to care if a variable is numerical or categorical. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Species) ``` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length) ``` ### Explore variables with a target Explore a variable and its relationship with a binary target with one line of code. You don't have to care if a variable is numerical or categorical. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, target = is_versicolor) ``` Using split = FALSE will change the plot to %target: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, target = is_versicolor, split = FALSE) ``` The target can have more than two levels: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, target = Species) ``` Or the target can even be numeric: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, target = Petal.Length) ``` ### Explore multiple variables ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5} iris %>% select(Sepal.Length, Sepal.Width) %>% explore_all() ``` ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5} iris %>% select(Sepal.Length, Sepal.Width, is_versicolor) %>% explore_all(target = is_versicolor) ``` ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5} iris %>% select(Sepal.Length, Sepal.Width, is_versicolor) %>% explore_all(target = is_versicolor, split = FALSE) ``` ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5} iris %>% select(Sepal.Length, Sepal.Width, Species) %>% explore_all(target = Species) ``` ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5} iris %>% select(Sepal.Length, Petal.Width, Petal.Length) %>% explore_all(target = Petal.Length) ``` ```{r message=FALSE, warning=FALSE} data(iris) ``` To use a high number of variables with `explore_all()` in a RMarkdown-File, it is necessary to set a meaningful fig.width and fig.height in the junk. The function `total_fig_height()` helps to automatically set fig.height: ```fig.height=total_fig_height(iris)``` ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=total_fig_height(iris, size=2.5)} iris %>% explore_all() ``` If you use a target: ```fig.height=total_fig_height(iris, var_name_target = "Species")``` ```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=total_fig_height(iris, var_name_target = "Species", size=2.5)} iris %>% explore_all(target = Species) ``` You can control total_fig_height() by parameters ncols (number of columns of the plots) and size (height of 1 plot) ### Explore correlation between two variables Explore correlation between two variables with one line of code: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, Petal.Length) ``` You can add a target too: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, Petal.Length, target = Species) ``` ### Explore options If you use explore to explore a variable and want to set lower and upper limits for values, you can use the `min_val` and `max_val` parameters. All values below min_val will be set to min_val. All values above max_val will be set to max_val. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, min_val = 4.5, max_val = 7) ``` `explore` uses auto-scale by default. To deactivate it use the parameter `auto_scale = FALSE` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% explore(Sepal.Length, auto_scale = FALSE) ``` ### Describing data Describe your data in one line of code: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% describe() ``` The result is a data-frame, where each row is a variable of your data. You can use `filter` from dplyr for quick checks: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} # show all variables that contain less than 5 unique values iris %>% describe() %>% filter(unique < 5) ``` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} # show all variables contain NA values iris %>% describe() %>% filter(na > 0) ``` You can use `describe` for describing variables too. You don't need to care if a variale is numerical or categorical. The output is a text. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} # describe a numerical variable iris %>% describe(Species) ``` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} # describe a categorical variable iris %>% describe(Sepal.Length) ``` ### Use data Use one of the prepared datasets to explore: * `use_data_beer()` * `use_data_diamonds()` * `use_data_iris()` * `use_data_mpg()` * `use_data_mtcars()` * `use_data_penguins()` * `use_data_starwars()` * `use_data_titanic()` * `use_data_wordle()` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} use_data_beer() %>% describe() ``` ### Create data Use one of the prepared datasets to explore: * `create_data_abtest()` * `create_data_app()` * `create_data_buy()` * `create_data_churn()` * `create_data_esoteric()` * `create_data_newsletter()` * `create_data_person()` * `create_data_unfair()` * `create_data_random()` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} # create dataset and describe it data <- create_data_app(obs = 100) describe(data) ``` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} # create dataset and describe it data <- create_data_random(obs = 100, vars = 5) describe(data) ``` You can build you own random dataset by using ```create_data_empty()``` and ```add_var_random_*()``` functions: ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} # create dataset and describe it data <- create_data_empty(obs = 1000) %>% add_var_random_01("target") %>% add_var_random_dbl("age", min_val = 18, max_val = 80) %>% add_var_random_cat("gender", cat = c("male", "female", "other"), prob = c(0.4, 0.4, 0.2)) %>% add_var_random_starsign() %>% add_var_random_moon() describe(data) ``` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} data %>% select(random_starsign, random_moon) %>% explore_all() ``` ### Basic data cleaning To clean a variable you can use `clean_var`. With one line of code you can rename a variable, replace NA-values and set a minimum and maximum for the value. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} iris %>% clean_var(Sepal.Length, min_val = 4.5, max_val = 7.0, na = 5.8, name = "sepal_length") %>% describe() ``` To drop variables or observations you can use ```drop_var_*()``` and ```drop_obs_*()``` functions. ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} use_data_penguins() %>% describe_tbl() ``` ```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3} use_data_penguins() %>% drop_obs_with_na() %>% describe_tbl() ``` ### Create notebook Create an RMarkdown template to explore your own data. Set output_dir (existing file may be overwritten) ```{r eval=FALSE, message=FALSE, warning=FALSE} create_notebook_explore( output_dir = tempdir(), output_file = "notebook-explore.Rmd") ``` ![](../man/figures/notebook-explore.png){width=600px}