--- title: dfidx author: Yves Croissant date: 2024/08/22 number-sections: true output: pdf_document: number-sections: true html_document: toc: true toc_float: true bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{dfidx and tibbles} %\VignetteEngine{quarto::pdf} --- In some situations, series from a data frame have a natural two-dimensional (tabular) representation because each observation can be uniquely characterized by a combination of two indexes. Two major cases of this situations in applied econometrics are: - panel data, where the same individuals are observed for several time periods, - random utility models, where each observation describes the features of an alternative among a set of alternatives for a given choice situation. The idea of **dfidx** is to keep in the same object the data and the information about its structure. A `dfidx` is a data frame with an `idx` column, which is a data frame that contains the series that defines the indexes. This vignette supersede the preceding vignette of the **dfidx** package by showing the advantages of creating `dfidx` objects from a tibble and not from an ordinary data frame.^[The advantage of attaching the **dplyr** package [@WICK:FRAN:23] is that the **magrittr**'s pipe [@BACH:WICK:22] and functions from the **tibble** package [@MULL:WICK:23] are exported]. It also introduces a new vector interface to define the indexes. # Basic use of the `dfidx` function The `dfidx` package is loaded using: ```{r } #| label: setup #| include: false knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, widht = 50) old_options <- options(width = 70, tibble.print_max = 3, tibble.print_min = 3 ) ``` ```{r } #| label: load_dfidx library(dfidx) ``` We also attach the **dplyr** package because we'll use throughout this vignette tibbles and not ordinary data frames and we'll show how **dplyr**'s verbs can be used with `dfidx` objects thanks to appropriate methods. ```{r} #| label: load_dplyr library(dplyr) ``` To illustrate the features of **dfidx**, we'll use the `munnell` data set [@MUNN:90] that is used in @BALT:13's famous book and is part of the **plm** package as `Produc`. It contains several economic series for American states from 1970 to 1986. We've added to the initial data set a `president` series which indicates the name of the American president in power for the given year. ```{r} #| label: print.tibble munnell ``` The two indexes are `state` and `year` and both are nested in another variable: `state` in `region` and `year` in `president`. A `dfidx` object is created with the `dfidx` function: the first argument should be a data frame (or a tibble) and the second argument `idx` is used to indicate the indexes. As, in the `munnell` data set, the first two columns contain the two indexes, the `idx` argument is not mandatory and a `dfidx` can be obtained from the `munnell` tibble simply by using: ```{r} #| label: print.dfidx munnell %>% dfidx ``` The resulting object is of class `dfidx` and is a tibble with an `idx` column, which is a tibble containing the two indexes. Note that the two indexes are now longer standalone series in the resulting tibble, because the default value of the `drop.index` argument is `TRUE`. The header of the tibble indicates the names and the cardinal of the two indexes. It also indicated whether the data set is balanced ie, in this panel data context, whether all the states are observed for the same set of years (which is the case for the `munnell` data set). The `idx` column can be retrieved using the `idx` function: ```{r } #| label: extract_idx munnell %>% dfidx %>% idx ``` If the first two columns don't contain the indexes, the `idx` argument should be set. If the observations are ordered first by the first index and then by the second one and if the data set is *balanced*, `idx` can be an integer, the number of distinct values of the first index: ```{r } #| label: dfidx_integer munnell %>% dfidx(48) ``` Then the two indexes are created with the default names `id1` and `id2`. More relevant names can be indicated using the `idnames` argument and the values of the second index can be indicated, using the `levels` argument. ```{r } #| label: dfidx_integer_pretty munnell %>% dfidx(48, idnames = c("state", "year"), levels = 1970:1986) ``` The `idx` argument can also be a character of length one or two. In the first case, only the first index is indicated: ```{r } #| label: dfidx_one_index munnell %>% dfidx("state", idnames = c(NA, "date"), levels = 1970:1986) ``` Note that we've only provided a name for the second index, the `NA` in the first position of the `idnames` argument meaning that we want to keep the original name for the first index. Finally, if the `idx` argument is a character of length 2, it should contain the name of the two indexes. ```{r } #| label: dfidx_two_indexes munnell %>% dfidx(c("state", "year")) ``` # More advanced use of `dfidx` ## Nesting structure One or both of the indexes may be nested in another series. In this case, the `idx` argument is still a character of length two, but the nesting series is indicated as the name of the corresponding index: ```{r} #| label: one_or_two_nests mn <- munnell %>% dfidx(c(region = "state", "year")) mn <- munnell %>% dfidx(c(region = "state", president = "year")) mn ``` The `idx` column is now a tibble containing the two indexes and the nesting variables. ```{r} #| label: idx_two_nests mn %>% idx ``` ## Data frames in wide format `dfidx` can deal with data frames in wide format, *i.e* for which each series for a given value of the second index is a column of the data frame. This is the case of the `munnell_wide` tibble that contains two series of the original data set (`gsp` and `unemp`). ```{r} #| label: munnell_wide munnell_wide ``` Each line is now an American state and, apart the indexes, there are now 34 series with names obtained by the concatenation of the name of the series and the year (for example `gsp_1988`). In this case a supplementary argument called `varying` should be provided. It is a vector of integers indicating the position of the columns that should be merged in the resulting long formatted data frame. The `stats::reshape` function is then used and the `sep` argument can be also provided to indicate the separating character in the names of the series (the default value being `"."`). ```{r} #| label: varying munnell_wide %>% dfidx(varying = 3:36, sep = "_") ``` Better results can be obtained using the `idx` and `idnames` previously described: ```{r} #| label: varying_pretty munnell_wide %>% dfidx(idx = c(region = "state"), varying = 3:36, sep = "_", idnames = c(NA, "year")) ``` # Getting the indexes or their names The name (and the position) of the `idx` column can be obtained as a named integer (the integer being the position of the column and the name its name) using the `idx_name` function: ```{r} #| label: idx_names #| collapse: true idx_name(mn) ``` To get the name of one of the indexes, the second argument, `n`, is set either to 1 or 2 to get the first or the second index, ignoring the nesting variables: ```{r } #| label: one_idx #| collapse: true idx_name(mn, 2) idx_name(idx(mn), 2) ``` Not that `idx_name` can be in this case applied to a `dfidx` or to a `idx` object. To get a nesting variable, the third argument, called `m`, is set to 2: ```{r } #| label: nested_idx #| collapse: true idx_name(mn, 1, 1) idx_name(mn, 1, 2) ``` To extract one or all the indexes, the `idx` function is used. This function has already been encountered when one wants to extract the `idx` column of a `dfidx` object. The same `n` and `m` arguments as for the `idx_name` function can be used in order to extract a specific series. For example, to extract the region index, which nests the state index: ```{r } #| collapse: true #| label: extract_index_with_idx id_index1 <- idx(mn, n = 1, m = 2) id_index2 <- idx(idx(mn), n = 1, m = 2) head(id_index1) identical(id_index1, id_index2) ``` # Data frames subsetting Subsets of data frames are obtained using the `[` and the `[[` operators. The former returns most of the time a data frame as the second one always returns a series. ## Commands that return a data frame Consider first the use of `[`. If one argument is provided, it indicates the columns that should be selected. The result is always a data frame, even if a single column is selected. If two arguments are provided, the first one indicates the subset of lines and the second one the subset of columns that should be returned. If only one column is selected, the result depends on the value of the `drop` argument. If `TRUE`, a series is returned and if `FALSE`, a one series data frame is returned. An important difference between tibbles and ordinary data frames is that the default value of `drop` is `FALSE` for the former and `TRUE` for the later. Therefore, with tibbles, the use of `[` will always by default return a data frame. A specific `dfidx` method is provided for one reason: the column that contains the indexes should be "sticky" (we borrow this idea from the `sf` package^[@PEBE:BIVA:23 and @PEBE:18.]), which means that it should be always returned while using the extractor operator, even if it is not explicitly selected. ```{r} #| label: one_bracket mn[mn$unemp > 10, ] mn[mn$unemp > 10, c("highway", "utilities")] mn[mn$unemp > 10, "highway"] ``` All the previous commands extract the observations where the unemployment rate is greater than 10% and, in the first case all the series, in the second case two of them and in the third case only one series. ## Commmands that return a series A series can be extracted using any of the following commands: ```{r } #| collapse: true #| label: return_series mn1 <- mn[, "highway", drop = TRUE] mn2 <- mn[["highway"]] mn3 <- mn$highway c(identical(mn1, mn2), identical(mn1, mn3)) ``` The result is a `xseries` which inherits the `idx` column from the data frame it has been extracted from as an attribute : ```{r } #| label: xseries mn1 %>% print(n = 3) class(mn1) idx(mn1) %>% print(n = 3) ``` Note that, except when `dfidx` hasn't been used with `drop.index = FALSE`, a series which defines the indexes is dropped from the data frame (but is one of the column of the `idx` column of the data frame). It can be therefore retrieved using: ```{r} #| label: extract_index_1 mn$idx$president %>% head ``` or ```{r } #| label: extract_index_2 idx(mn)$president %>% head ``` or more simply by applying the `$` operator as if the series were a stand-alone series in the data frame : ```{r } #| label: extract_index_3 mn$president %>% print(n = 3) ``` In this last case, the resulting series is a `xseries`, *ie* it inherits the index data frame as an attribute. ## User defined class for extracted series While creating the `dfidx`, a `pkg` argument can be indicated, so that the resulting `dfidx` object and its series are respectively of class `c("dfidx_pkg", "dfidx")` and `c("xseries_pkg", "xseries")` which enables the definition of special methods for `dfidx` and `xseries` objects. For example, consider the hypothetical **pnl** package for panel data: ```{r } #| label: pkg_series #| collapse: true mn <- dfidx(munnell, idx = c(region = "state", president = "year"), pkg = "pnl") mn1 <- mn$gsp class(mn) class(mn1) ``` For example, we want to define a `lag` method for `xseries_pnl` objects. While lagging there should be a `NA` not only on the first position of the resulting vector like for time-series, but each time we encounter a new individual. A minimal `lag` method could therefore be written as: ```{r } #| label: pnl_seriex lag.xseries_pnl <- function(x, ...){ .idx <- idx(x) class <- class(x) x <- unclass(x) id <- .idx[[1]] lgt <- length(id) lagid <- c("", id[- lgt]) sameid <- lagid == id x <- c(NA, x[- lgt]) x[! sameid] <- NA structure(x, class = class, idx = .idx) } lmn1 <- stats::lag(mn1) lmn1 %>% print(n = 3) class(lmn1) rbind(mn1, lmn1)[, 1:20] ``` Note the use of `stats::lag` instead of `lag` which ensures that the `stats::lag` function is used, even if the **dplyr** (or **tidyverse**) package is attached. # tidyverse ## dplyr **dfidx** supports some of the verbs of **dplyr**, namely, for the current version: - `select` to select columns, - `filter` to select some rows using logical conditions, - `arrange` to sort the lines according to one or several variables, - `mutate` and `transmute` for creating new series, - `slice` to select some rows using their position. **dplyr**'s verbs don't work with `dfidx` objects for two main reasons: - the first one is that with most of the verbs (`select` is an exception), the returned object is a `data.frame` (or a `tibble`) and not a `dfidx`, - the second one is that the index column should be "sticky", which means that it should be always returned, even while selecting a subset of columns which doesn't include the index column or while using `transmute`. Therefore, specific methods are provided for **dplyr**'s verb. The general strategy consists on: @. first save the original attributes of the argument (a `dfidx` object), @. coerce to a `data.frame` or a tibble using the `as.data.frame` method, @. use `dplyr`'s verb, @. add the column containing the index if necessary (*i.e.* while using `transmute` or while `select`ing a subset of columns which don't contain the index column), @. change some of the attributes if necessary, @. attach the attributes to the `data.frame` and returns the result. The following code illustrates the use of **dplyr**'s verbs applied to `dfidx` objects. ```{r } #| label: dplyr_verbs select(mn, highway, utilities) arrange(mn, desc(unemp)) mutate(mn, lgsp = log(gsp), lgsp2 = lgsp ^ 2) transmute(mn, lgsp = log(gsp), lgsp2 = lgsp ^ 2) filter(mn, unemp > 10, gsp > 150000) slice(mn, 1:3) ``` To extract a series, the `pull` function can be used: ```{r} #| label: pull mn %>% pull(utilities) ``` # Model building The two main steps in **R** in order to estimate a model are to use the `model.frame` function to construct a data frame, using a formula and a data frame and then to extract from it the matrix of covariates using the `model.matrix` function. ## Model frame The default method of `model.frame` has as first two arguments `formula` and `data`. It returns a data frame with a `terms` attribute. Some other methods exist in the **stats** package, for example for `lm` and `glm` object with a first and main argument called `formula`. This is quite unusual and misleading as for most of the generic functions in **R**, the first argument is called either `x` or `object`. Another noticeable method for `model.frame` is provided by the **Formula** package and, in this case, the first argument is a `Formula` object, which is an extended formula which can contain several parts on the left and/or on the right hand side of the formula. We provide a `model.frame` method for `dfidx` objects, mainly because the `idx` column should be returned in the resulting data frame. This leads to an unusual order of the arguments, the data frame first and then the formula. The method then first extract (and subset if necessary the `idx` column), call the `formula`/`Formula` method and then add to the resulting data frame the `idx` column. The resulting data frame is a `dfidx` object. ```{r} #| label: model_frame mf_mn <- mn %>% model.frame(gsp ~ utilities + highway | unemp | labor, subset = unemp > 10) mf_mn formula(mf_mn) ``` ## Model matrix `model.matrix` is a generic function and for the default method, the first two arguments are a `terms` object and a data frame. In `lm`, the `terms` attribute is extracted from the `model.frame` internally constructed using the `model.frame` function. This means that, at least in this context, `model.matrix` doesn't need a `formula`/`term` argument and a `data.frame`, but only a data frame returned by the model frame method, i.e. a data frame with a `terms` attribute. We use this idea for the `model.matrix` method for `dfidx` object; the only required argument is a `dfidx` returned by the `model.frame` function. The formula is then extracted from the `dfidx` and the `Formula` or default method is then called. The result is a matrix of class `dfidx_matrix`, with a printing method that allows the use of the `n` argument: ```{r} #| label: model_matrix mf_mn %>% model.matrix(rhs = 1) mf_mn %>% model.matrix(rhs = 2:3) %>% print(n = 5) ``` ```{r } mn <- dfidx(munnell, idx = c(region = "state", president = "year"), name = "index", position = 4) mn ``` # References