--- title: "Yield data preprocessing" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Yield data preprocessing} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r eval = !requireNamespace("ggplot2"), echo = FALSE, comment = NA} message("No package ggplot2 available. Code chunks using that package will not be evaluated.") ``` Usually, yield data comes with many noisy observations. This vignette will show how to preprocess yield data to remove both, spatial and global outliers. The protocol for error removal follows the protocol proposed by @Vega2019. Functions from this package are used in FastMapping software [@Paccioretti2020]. For the tutorial we will use the `barley` dataset that comes with the `paar` package. The `barley` data contains barley grain yield which were obtained using calibrated commercial yield monitors, mounted on combines equipped with DGPS. The data is not a `sf` object format. We will convert it to an `sf` object first. First, we will load the `paar` package, the `sf` package for spatial data manipulation, `ggplot2` for plotting, and the `barley` dataset that comes with the `paar` package. ```{r setup} library(paar) library(sf) require(ggplot2) data("barley", package = 'paar') ``` The `barley` dataset is a `data.frame` object. We will convert it to a `sf` object using the `st_as_sf` function. The `coords` argument specifies the columns that contain the coordinates. The `crs` argument specifies the coordinate reference system. The `barley` dataset is in UTM zone 20S. ```{r} barley_sf <- st_as_sf(barley, coords = c("X", "Y"), crs = 32720) ``` The `barley_sf` object is now an `sf` object. We can plot the data to visualize the yield data. - The `plot` function can be used to plot the data. ```{r} plot(barley_sf["Yield"]) ``` - The `ggplot2` package can be used to plot the data. ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_sf) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal() ``` Let's see the yield values distribution. - The `hist` function can be used to plot the histogram. ```{r} hist(barley_sf$Yield, main = 'Yield values distribution') ``` - The `ggplot2` package can be used to plot the histogram. ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_sf) + geom_histogram(aes(x = Yield)) + theme_minimal() ``` The protocol proposed by [@Vega2019], is implemented in the function `depurate` and consists of three steps: 1. Remove border observations (*edges*). 2. Remove global outliers (*outliers*). 3. Remove spatial outliers (*inliers*). The `depurate` function takes an `sf` object as input and returns an object of class `paar`. Any combination of the three steps can be done using the `depurate` function. The argument `to_remove` specifies which steps to perform. The argument `y` specifies the column name of the variable to be cleaned. A field boundary is necessary to remove the *edges* observations. If a polygon is not provided in the `poly_border` argument, the function will make a hull, around the data and remove the observation that are 10m from the hull. The hull is made using `concaveman::concaveman` function if the package is installed, otherwise, the `sf::st_convex_hull` function is used. ```{r} barley_clean_paar <- depurate(barley_sf, y = 'Yield', toremove = c("edges", "outlier", "inlier")) ``` ## Summary of the cleaning process The `depurate` function returns an object of class `paar`. The `paar` object contains the cleaned data (`$depurated_data`), and the condition of each observation (`$condition`). If the condition is `NA` means that the observation was not removed. ```{r} barley_clean_paar ``` The `summary` function can be used to get a summary of the percentage of considered outlier and the number of observations removed. The `summary` function returns a `data.frame` object. ```{r} summary_table <- summary(barley_clean_paar) summary_table ``` Filtered dataset can be extracted from the `paar` object using the `$depurated_data` ```{r} barley_clean <- barley_clean_paar$depurated_data ``` Final Yield values distribution can be plotted. - The `plot` function can be used to plot yield values. ```{r} plot(barley_clean["Yield"]) ``` - The `ggplot2` package can be used to plot yield values. ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_clean) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal() ``` A comparison can be made between the original data and the cleaned data. ```{r, eval = !requireNamespace("ggplot2"), echo = FALSE, comment = NA} message('Package ggplot2 is not available.') ``` - Original data ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_sf) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal() ``` - Cleaned data ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_clean) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal() ``` Also, the distribution of the yield values can be compared. - Original data ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_sf, aes(x = Yield)) + geom_histogram() ``` - Cleaned data ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_clean, aes(x = Yield)) + geom_histogram() ``` ## Plotting the condition of each observation The condition of each observation can be combined to the original data using the `cbind` function. The `paar` object must be used as first argument in the `cbind` function. ```{r} barley_sf <- cbind(barley_clean_paar, barley_sf) ``` The `barley_sf` object now contains the condition of each observation. The `condition` column contains the condition of each observation. The condition can be `NA` if the observation was not removed, `edges` if the observation was removed in the *edges* step, `outlier` if the observation was removed in the *outliers* step, and `inlier` if the observation was removed in the *inliers* step. Results can be plotted to visualize the observations. - The `plot` function can be used to plot the condition of each observation. ```{r} plot(barley_sf[,'condition'], col = as.numeric(as.factor(barley_sf$condition))) legend("topright", legend = levels(as.factor(barley_sf$condition)), fill = 1:4) ``` - The `ggplot2` package can be used to plot the condition of each observation. ```{r, eval = requireNamespace("ggplot2")} ggplot(barley_sf) + geom_sf(aes(color = condition)) + scale_fill_viridis_d() + scale_color_discrete( labels = function(k) {k[is.na(k)] <- "normal"; k}, na.value = "#44214234") + theme_minimal() ```