--- title: "Retrieve classifications and correspondence tables stored as Linked Open Data" output: rmarkdown::html_vignette: toc: TRUE vignette: > %\VignetteIndexEntry{Retrieve classifications and correspondence tables stored as Linked Open Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="asis"} cat(" ") ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.align = "center") old <- getOption("useLocalDataForVignettes") options(useLocalDataForVignettes = TRUE) on.exit(options(useLocalDataForVignettes = old), add = TRUE) ``` ## Overview
Statistical classifications and correspondence tables are published as Linked Open Data (LOD) by several organisations, notably the Publications Office of the European Union (OP, via CELLAR) and the Food and Agriculture Organization (FAO). While these resources can be accessed directly using SPARQL, this requires specific technical expertise. The **correspondenceTables** package provides high‑level R functions that allow users to retrieve these data as standard R data frames, without writing SPARQL queries themselves. Two core data retrieval functions are provided: - `retrieveClassificationTable()`: retrieves the structure of a statistical classification (codes, labels, hierarchy). - `retrieveCorrespondenceTable()`: retrieves a correspondence (mapping) table between two classifications. Optionally, both functions can return the SPARQL query used for the retrieval, making the process transparent, inspectable, and reproducible. In addition, the `dataStructure()` utility allows users to inspect the hierarchical structure of a classification (e.g. available levels and code depth) before retrieving the data. This step is optional but recommended when working with hierarchical classifications, particularly when the desired level is not known in advance and is covered in this vignette, with illustrative examples provided for both the CELLAR and FAO endpoints.
```{r} library(correspondenceTables) ``` ```{r, echo=FALSE, results="asis"} cat("") ``` ## Discovering available data
Before using the core retrieval functions `retrieveClassificationTable()` and `retrieveCorrespondenceTable()`, it is necessary to know which data can be retrieved and how it is identified. In practice, this means answering the following questions: - Which classifications or correspondence tables are available? - From which endpoint (`CELLAR` or `FAO`)? - Which identifiers (`prefix`, `conceptScheme`, `ID_table`) should be used? - For hierarchical classifications, which levels are available? The package provides lightweight discovery utilities to support this step before data retrieval.
### Gathering information necessary for classification retrieval
To retrieve a statistical classification, users first need to know which classifications are available at a given endpoint and how they are identified. The `classificationList()` utility provides this information.
### Example 1: Available classifications (CELLAR)
The example below illustrates the typical output structure using a static snapshot of the `CELLAR` classification list bundled with the package. To retrieve updated information about available classifications, users only need to execute the `classificationList()` function. ```{r} list_data <- read.csv( system.file("extdata/test", "classificationList_CELLAR.csv", package = "correspondenceTables"), stringsAsFactors = FALSE ) knitr::kable( head(list_data, 3), caption = "Example output of classificationList() (retrieved from CELLAR)" ) ``` For each classification, three identifiers are required to retrieve the data: - `endpoint`: `"CELLAR"` or `"FAO"` - `prefix`: namespace prefix used in the SPARQL endpoint - `conceptScheme`: unique identifier of the classification For example, the NACE Rev. 2 classification: - is available from the `CELLAR` repository, - uses prefix `"nace2"`, - uses concept scheme `"nace2"`.
## Inspecting the structure of hierarchical classifications
Many statistical classifications are hierarchical. If only a specific level is required (e.g. divisions or classes), it is recommended to inspect the classification structure first. The `dataStructure()` function provides this information.
### Example 2: Classification structure (CN 2022, CELLAR)
This example illustrates how to inspect the structural characteristics of a classification stored in the `CELLAR` repository. The `dataStructure()` function can be used to retrieve either a **summary view**, a **detailed view**, or **both**. To keep the vignette reproducible and independent of live SPARQL endpoints, the function calls below are shown for documentation purposes only.
#### Summary view of the classification structure
The `summary` output provides an overview of the hierarchical organisation of the classification. For each level, it reports: - the classification scheme identifier, - the hierarchical depth, - the level label, - the number of classification items defined at that level. This view is useful for quickly understanding the overall structure of a classification and identifying which hierarchical levels are available. ```{r, eval = FALSE} ds_cn <- dataStructure( endpoint = "CELLAR", prefix = "cn2022", conceptScheme = "cn2022", language = "en", return = "summary" ) knitr::kable(head(ds_cn, 20), caption = "CN 2022 — dataStructure(summary)") ``` The Combined Nomenclature (CN 2022) follows a hierarchical product classification structure defined at several levels. The summary output shows that it consists of five hierarchical levels: - **Level 1: Sections**: broad groupings of goods; - **Level 2: Chapters**: main product divisions; - **Level 3: Headings**: four‑digit product categories; - **Level 4: HS subheadings**: six‑digit Harmonized System categories; - **Level 5: CN subheadings**: eight‑digit CN‑specific product codes. The `Count` column indicates the number of classification items defined at each hierarchical depth.
#### Detailed view of classification items
The `details` output returns one row per classification item. It provides item‑level metadata, including: - the classification code, - the preferred label, - the hierarchical level and depth, - links to broader (parent) concepts where available. This view is intended for detailed inspection of classification content, for example when analysing parent-child relationships or validating code hierarchies. ```{r, eval = FALSE} ds_cn_det <- dataStructure( endpoint = "CELLAR", prefix = "cn2022", conceptScheme = "cn2022", language = "en", return = "details" ) knitr::kable(head(ds_cn_det, 20), caption = "CN 2022 — dataStructure(details)") ```
#### Summary and detailed views combined
When `return = "both"`, the function returns a list containing both summary and detailed outputs. This option can be convenient when both a structural overview and item‑level information are required within a single workflow. ```{r, eval = FALSE} ds_cn_both <- dataStructure( endpoint = "CELLAR", prefix = "cn2022", conceptScheme = "cn2022", language = "en", return = "both" ) knitr::kable(head(ds_cn_both$summary, 20), caption = "CN 2022 — summary (from both)") knitr::kable(head(ds_cn_both$details, 20), caption = "CN 2022 — details (from both)") ``` As with classifications retrieved from `CELLAR`, this inspection step can be skipped if the required classification level is already known in advance.
### Example 3: Classification structure (CPC 2.1, FAO)
The same approach can be applied to classifications hosted in the FAO repository. This example illustrates how to inspect the structure of the Central Product Classification (CPC), version 2.1. As with CELLAR, the `dataStructure()` function can return a **summary view**, a **detailed view**, or **both** representations of the classification structure. In practice, the choice depends on whether a high-level overview or item-level information is required. To keep the vignette reproducible and independent of live SPARQL endpoints, the function call below is provided for documentation purposes only.
#### Summary view of the classification structure
The *summary* output provides a compact overview of the hierarchical organisation of CPC 2.1. For each level, it reports: - the classification scheme identifier, - the hierarchical depth, - the level label, - the number of classification items defined at that level. This view is useful for understanding the overall structure of the classification before retrieving detailed content. ```{r, eval = FALSE} endpoint <- "FAO" prefix <- "CPC21" conceptScheme <- "CPC21" ds_cpc <- dataStructure( endpoint = endpoint, prefix = prefix, conceptScheme = conceptScheme, language = "en", showQuery = FALSE, return = "summary" ) knitr::kable( head(ds_cpc, 20), caption = "CPC 2.1 — dataStructure(summary, FAO)" ) ``` As in the CELLAR example, `return = "details"` retrieves item-level information, while `return = "both"` returns both summary and detailed outputs in a single call.
## Retrieving classification tables
Once the classification identifiers and (optionally) the desired level are known, the `retrieveClassificationTable()` function can be used to retrieve the data. The function returns a flat data frame suitable for: - browsing and documentation; - validation of codes and hierarchy; - downstream correspondence analysis. **Main arguments** - `endpoint`: `"CELLAR"` or `"FAO"` - `prefix`: Character. Classification prefix used for matching and URI resolution (e.g. "cn2022", "cpc21", "isic4"). - `conceptScheme`: Character. Local identifier of the scheme (often identical to `prefix`). The function automatically resolves this to the canonical ConceptScheme URI published in the endpoint. - `language`: Character. Preferred label language as a BCP47 code. Defaults to "en" (English). Examples: "fr", "de". - `level`: Character. One of: + `"ALL"` (default): return all levels in the hierarchy; + a specific depth value (e.g. "2") to filter concepts at that depth only. - `showQuery`: Logical. + `FALSE` (default): returns only the classification table; + `TRUE`: returns a list containing the SPARQL query, the resolved scheme URI, and the table itself. - `knownSchemes`: Optional. A data.frame supplying authoritative mappings of the form Prefix, ConceptScheme, URI. When provided, this overrides automatic discovery. To be obtained using `classificationList(endpoint)`. - `preferMappingOnly`: Logical. If `TRUE`, the function never attempts SPARQL discovery and uses only information in `knownSchemes` or `classificationList(endpoint)`. Default: `FALSE`.
### Example 4: Class‑level NACE Rev. 2 in multiple languages
The following example demonstrates how to retrieve level‑4 (“class”) data for the German, French, and Bulgarian versions of **NACE Rev. 2**. The code is **not executed** during vignette rendering as data availability and response times may vary. ```{r retrieve-nace-multilang, eval=FALSE} endpoint <- "CELLAR" prefix <- "nace2" conceptScheme <- "nace2" level <- "4" languages <- c("de", "fr", "bg") results <- lapply(languages, function(lang) { retrieveClassificationTable( endpoint = endpoint, prefix = prefix, conceptScheme = conceptScheme, language = lang, level = level, showQuery = FALSE ) }) ``` The resulting object is a list of data frames, one per language, each containing the class‑level codes and labels for NACE Rev. 2 in the selected language.
### Example 5: FAO classification at group level
The FAO endpoint provides access to a limited subset of international classifications. Availability depends on the endpoint configuration. The following example illustrates how a FAO classification would be retrieved. The code is not executed during vignette rendering. This call queries the FAO repository and returns metadata describing all published classification schemes (prefix, concept scheme, title, etc.). ```{r, eval=FALSE} cl_fao <- classificationList("FAO") knitr::kable( head(cl_fao), caption = "Retrieving a classification table from the FAO endpoint" ) ``` **Inspect available prefix identifiers** The `Prefix` field identifies the catalogue or namespace under which each FAO classification is published. ```{r, eval=FALSE} knitr::kable( head(unique(cl_fao$Prefix))) ``` **Inspect available concept schemes** The `ConceptScheme` field identifies the underlying classification schemes that can be queried using `retrieveClassificationTable()`. ```{r, eval=FALSE} knitr::kable( head(unique(cl_fao$ConceptScheme))) ``` **Retrieving a classification table from the FAO endpoint** The following example illustrates how to retrieve a classification from the `FAO` repository using `retrieveClassificationTable()`. Because `FAO` data availability and response times may vary, this example is shown for documentation purposes and is not executed in the vignette. ```{r retrieve-fao-classification, eval=FALSE} endpoint <- "FAO" prefix <- "cpc21" conceptScheme <- "core" out <- retrieveClassificationTable( endpoint = endpoint, prefix = prefix, conceptScheme = conceptScheme, language = "en", level = "2", showQuery = TRUE ) ``` The `FAO` endpoint provides access to selected international and domain‑specific classifications maintained by `FAO`. Not all `CELLAR` classifications are available via `FAO`, and vice versa.
### Example 6: Retrieving a classification table from a known data frame of classification tables
Every time it is executed, the `retrieveClassificationTable()` function attempts to retrieve the list of all the available classifications for a selected endpoint, in order to have always the most up-to-date URI for a given pair of prefix-concept scheme. Since this step can be time consuming, it can be skipped entirely by providing a previously retrieved (and stored) classification list (obtained with `classificationList()`) using the `knownSchemes` argument. The example that follows, shows how to use this argument: ```{r , eval=FALSE} cl_fao <- classificationList("FAO") endpoint <- "FAO" prefix <- "cpc21" conceptScheme <- "core" out <- retrieveClassificationTable( endpoint = endpoint, prefix = prefix, conceptScheme = conceptScheme, knownSchemes = cl_fao ) ```
## Retrieving correspondence tables
The `retrieveCorrespondenceTable()` function retrieves a correspondence (mapping) table between two statistical classifications from a SPARQL endpoint. Its interface is similar to `retrieveClassificationTable()`, with the main difference that correspondence tables are identified using `ID_table` (instead of `conceptScheme`). Correspondence tables are usually provided at the most granular level of the classifications involved. **Main arguments** - `endpoint`: Character. The online service to query. Case-insensitive. Supported values are those returned by the internal endpoint registry (e.g., `"CELLAR"`, `"FAO"`). - `prefix`: Character. Catalogue prefix where the correspondence is published (e.g., "nace2", "cpa21", "cn2022"). Use `correspondenceTableList()` to discover valid values. - `ID_table`: Character. Identifier of the correspondence, typically of the form "A_B" such as "NACE2_CPA21" or "CN2022_NACE2". Discover identifiers via `correspondenceTableList()`. - `language`: Character. Preferred label language as a BCP47 code. Defaults to "en" (English). Examples: "fr", "de". - `showQuery`: Logical. If `TRUE`, returns a list with the SPARQL query and the result data frame; otherwise (default) returns just the data frame.
### Example 7: Available correspondence tables
Before retrieving a correspondence table, users need to identify which correspondences are available and how they are referenced at a given SPARQL endpoint. The `correspondenceTableList()` utility serves this purpose. It is analogous to `classificationList()`, but lists correspondence tables instead of classifications. The following example illustrates how to list correspondence tables available from the `CELLAR` and `FAO` repositories. It is shown for documentation purposes and not executed during vignette rendering to avoid reliance on live external SPARQL endpoints. ```{r, eval = FALSE} corr_list = correspondenceTableList("ALL") names(corr_list) #Correspondence tables available from CELLAR knitr::kable( head(corr_list$CELLAR, 10), caption = "Available correspondence tables from the CELLAR endpoint (preview)" ) #Correspondence tables available from FAO knitr::kable( head(corr_list$FAO, 10), caption = "Available correspondence tables from the FAO endpoint (preview)" ) ``` When executed interactively, this call returns a list whose elements correspond to the selected endpoints (e.g. `CELLAR`, `FAO`). Each element is a data frame describing the available correspondence tables, including their identifiers, associated prefixes, and human-readable labels. Each correspondence table is identified by: - **endpoint**: `"CELLAR"` or `"FAO"` - **prefix**: namespace associated with the source classification - **ID_table**: unique identifier of the correspondence table
### Inspect available correspondence tables (CELLAR)
The following examples illustrate how to inspect the correspondence tables available from the `CELLAR` endpoints. ```{r corr-table-list, eval=FALSE} # Inspect correspondence tables available from CELLAR tbl_cellar <- correspondenceTableList("CELLAR") #Correspondence tables available from CELLAR knitr::kable( head(tbl_cellar, 10), caption = "Available correspondence tables from the CELLAR endpoint " ) ```
### Example 8: Retrieve a correspondence table from CELLAR
The following example illustrates the retrieval of a correspondence table published by the Publications Office of the European Union via the `CELLAR` endpoint. Users should note that the availability of correspondence data depends on what is currently exposed by the underlying SPARQL endpoint. Although a correspondence table may be listed by `correspondenceTableList()`, it can legitimately return an empty result when queried. For some `CELLAR` correspondences (including several PRODCOM‑related mappings), `retrieveCorrespondenceTable()` may therefore return a valid but empty data frame, which does not indicate a failure of the retrieval process. ```{r retrieve-prodcom, eval=FALSE} res <- retrieveCorrespondenceTable( endpoint = "CELLAR", prefix = "prodcom2023", ID_table = "PRODCOM2023_CPA21", language = "en", showQuery = FALSE ) knitr::kable( head(res, 10), caption = "PRODCOM2023_CPA21 CorrespondenceTable from the CELLAR endpoint " ) ``` To reduce potential user confusion, it is helpful to include at least one correspondence example that is more likely to return data when queried. ```{r, eval = FALSE} res2 <- retrieveCorrespondenceTable( endpoint = "CELLAR", prefix = "nace2", ID_table = "NACE2_CPA21", language = "en" ) knitr::kable( head(res2, 10), caption = "NACE2_CPA21 CorrespondenceTable from the CELLAR endpoint " ) ``` For transparency and reproducibility, the SPARQL query used for retrieval can also be inspected by setting `showQuery = TRUE`.
### Inspect available correspondence tables (FAO) The following examples illustrate how to inspect the correspondence tables available from the `FAO` endpoint. ```{r, eval = FALSE} # Inspect correspondence tables available from FAO tbl_fao <- correspondenceTableList("FAO") head(tbl_fao) knitr::kable( head(tbl_fao, 10), caption = "correspondence tables available from FAO" ) ``` ### Example 9: Retrieve a correspondence table from FAO: CPC 2.1 : ISIC Rev. 4
The following example illustrates the retrieval of a correspondence table published by the Food and Agriculture Organization of the United Nations (FAO) via the `FAO` endpoint. Users should note that the availability of correspondence data depends on what is currently exposed by the underlying SPARQL endpoint. Although a correspondence table may be listed by `correspondenceTableList()`, it can legitimately return an empty result when queried. In practice, however, correspondence tables exposed by the `FAO` endpoint tend to be more consistently populated than some of those available from `CELLAR.` The English-language version of the CPC 2.1 : ISIC Rev. 4 correspondence table can be retrieved as follows. This example is not executed during vignette rendering. ```{r retrieve-cpc21-isic4_FAO , eval = FALSE} Res <- retrieveCorrespondenceTable( endpoint = "FAO", prefix = "CPC21", ID_table = "CPC21-ISIC4", language = "en" ) knitr::kable( head(Res[, 1:5], 10), caption = "CPC21–ISIC4 correspondence tables available from FAO" ) ```
### (Optional) Inspect the underlying SPARQL query
For transparency and reproducibility, the SPARQL query used for retrieval can also be inspected by setting `showQuery = TRUE`. ```{r, eval = FALSE} Res2 <- retrieveCorrespondenceTable( endpoint = "FAO", prefix = "CPC21", ID_table = "CPC21-ISIC4", language = "en", showQuery = TRUE ) # Extract the SPARQL query used SPARQLquery <- Res2$SPARQL.query SPARQLquery ```
## Summary
The `correspondenceTables` package simplifies access to statistical classifications and correspondence tables published as Linked Open Data (LOD), including those provided by major repositories such as the EU Publications Office (CELLAR) and FAO. It offers a high-level R interface to: - identify available classifications and correspondences; - retrieve classification hierarchies and mapping tables without writing SPARQL queries; - explore classification structures to select relevant levels; - ensure reproducibility by exposing the underlying SPARQL queries when needed. This approach lowers the technical barrier to working with official classification systems, enabling analysts to integrate them seamlessly into their workflows while preserving transparency and reproducibility.