---
title: "Motivation"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Motivation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
if (!requireNamespace("dplyr", quietly = TRUE) || !requireNamespace("tidyr", quietly = TRUE)) {
stop("Please install 'dplyr' and 'tidyr' to run this vignette.")
}
library(dplyr)
library(tidyr)
```
> "A dataset is an identifiable collection of data available for access or download in one or more formats." ISO/IEC 20546
```{r setup}
library(dataset)
```
The `dataset` package is designed to enhance the **semantic richness of R datasets**, particularly those structured as `data.frame` or `tibble` objects. Its goal is to promote reusability and interoperability by enabling tidy datasets to carry rich metadata — from the start.
Through practical iterations, it became evident that the structure of a dataset cannot be separated from its **purpose**. As a result, the `dataset` package has evolved into a family of interrelated tools that support semantic clarity at both the column and dataset level.
## Problem statement
```{r example-ambiguity}
data.frame(
geo = c("LI", "SM"),
CPI = c("0.8", "0.9"),
GNI = c("8976", "9672")
)
```
At first glance, this dataset appears tidy — columns are variables, rows are observations. But is it interpretable?
- `geo`: This might refer to geography — but is "LI" Lichtenstein? What about "SM"? Is this ISO alpha-2? Eurostat’s codes? World Bank’s?
- `CPI`: Is this the Consumer Price Index, or the Corruption Perceptions Index? Something else entirely?
- `GNI`: It could mean Gross National Income or Global Nutrition Index — and even if it's GNI, is it measured in dollars, euros, or something else?
A tidy structure doesn't guarantee **semantic clarity**. Without metadata, every column is open to misinterpretation.
```{r smallcountrydataset}
options(sciphen = 4)
small_country_dataset <- dataset_df(
country_name = defined(
c("AD", "LI"),
concept = "http://data.europa.eu/bna/c_6c2bb82d",
namespace = "https://www.geonames.org/countries/$1/"
),
gdp = defined(
c(3897, 7365),
label = "Gross Domestic Product",
unit = "million dollars",
concept = "http://data.europa.eu/83i/aa/GDP"
),
population = defined(
c(77543, 40015),
label = "Population",
concept = "http://data.europa.eu/bna/c_f2b50efd"
),
dataset_bibentry = dublincore(
creator = person(given = "Jane", family = "Doe"),
title = "Small Country Dataset",
publisher = "Reprex"
)
)
```
```{r percapita}
small_country_dataset$gdp_capita <- defined(
small_country_dataset$gdp * 1e6 / small_country_dataset$population,
unit = "dollar",
label = "GDP Per Capita"
)
```
The **interoperability** and **future reusability** of data depends on the quality and presence of metadata — not just the structure of the data itself. The `dataset` package captures metadata as seamlessly and non-intrusively as possible.
It covers:
- [x] `Descriptive metadata`: Follows DCTERMS and DataCite standards to describe datasets for searchability, citation, and sharing.
- [x] `Structural metadata`: Uses well-defined variable names, consistent units of measure, and encourages globally resolvable identifiers for rows and values.
- [x] `Provenance metadata`: Tracks where the data came from and what happened to it — enabling transparent, reproducible data workflows.
## Limits on reusability
Let's take a look at the limitations of two, tidy datasets from an interoperability and reusability point of view:
```{r gdpexmampel, warning=FALSE, message=FALSE}
library(tibble)
# Dataset D (GDP in billions of USD)
D <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c("United States", "Canada", "United States", "Canada", "United States", "Canada"),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)
# Dataset E (GDP in billions of EUR)
E <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
country_name = c("United States", "France", "United States", "France", "United States", "France"),
GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)
```
These datasets are tidy, and from a technical perspective, we can easily combine them:
```{r ambigousjoin}
full_join(D, E)
```
This join is syntactically valid — but semantically misleading.
- One column contains GDP values in USD, the other in EUR.
- There’s no indication of this difference in the dataset itself.
Anyone reusing the dataset (especially if it’s saved to CSV or shared without code comments) could misinterpret it — perhaps fatally in an analysis.
Even in R, the metadata (e.g. currency) is detached from the values. This problem worsens when datasets are consumed by users who do not read R code or access the original documentation. Worse still, variables like `CPI` might refer to the *Consumer Price Index* or the *Corruption Perceptions Index* — and nothing in the dataset guarantees which one was meant.
## Enforcing Semantics
The `dataset` package introduces `defined()` vectors, which carry metadata like units, variable labels, namespaces, and definitions. When combining such vectors, this metadata is checked for compatibility.
You cannot concatenate GDP vectors that are denominated in different currencies:
Our strictly `defined` vectors check if the variable label, unit, namespace and definition match. If they are missing, the concatenation is possible, but when they are present, they must match.
You cannot concatenate USD and EUR denominated values:
```{r strintc, eval=FALSE}
GDP_D <- defined(c(21000, 2000, 22000, 2100, 23000, 2200), unit = "M_USD")
GDP_E <- defined(c(18000, 2500, 19000, 2600, 20000, 2700), unit = "M_EUR")
c(GDP_D, GDP_E)
```
```{r errormessage}
message("Error in c.haven_labelled_defined(GDP_D, GDP_E) :
c.haven_labelled_defined(x,y): x,y must have no unit or the same unit")
```
Concatenation is only allowed when metadata , in this case, the unit of measure, or more precisely, the monetary unit matches:
```{r matchigunits}
GDP_F <- defined(c(21000, 2000, 22000, 2100, 23000, 2200), unit = "M_EUR")
GDP_E <- defined(c(18000, 2500, 19000, 2600, 20000, 2700), unit = "M_EUR")
c(GDP_F, GDP_E)
```
## Reusability from the Start
Many metadata packages in R aim to enrich datasets **after they’re built** — when preparing them for publication, export, or external documentation. In contrast, the `dataset` package embeds semantics **at the moment of dataset creation**.
This early intervention:
- Adds value during exploratory and analytical work
- Avoids costly ambiguity later
- Makes R-native datasets more robust and shareable
- Enables integration with packages like `frictionless`, `dataspice`, or `rdflib` (the latter two are intended for deep future integration)
This design aligns with the [European Interoperability Framework (EIF)](https://ec.europa.eu/isa2/eif_en/), which outlines four levels of interoperability:
1. Legal
2. Organisational
3. Semantic
4. Technical
While many R packages address the *technical layer* (e.g. through standards-based file formats), and some touch on legal metadata, the `dataset` family places emphasis on the *organisational* and *semantic* layers — capturing meaning, context, and responsibility from the inside out.
### Tidy Data and Graphs
While tidy tabular data and graph-based data may share similar metadata needs, their workflows — and the assumptions they carry — differ considerably.
Carl Boettiger and others have shown how tidy data frames can be represented as RDF triples. In fact, serialising long-format tabular data into RDF is now a common approach for semantic data exchange. However, the **origin** of the data — whether from a graph, tabular database, or API — strongly influences its semantic shape and metadata requirements.
To support this diversity of reuse, the `dataset` package is complemented by two specialised extensions:
### `datacube`: Statistical Data and Metadata Exchange
This planned package focuses on **interoperability with SDMX** (Statistical Data and Metadata eXchange), a standard adopted by statistical offices such as Eurostat and the World Bank.
- Target user: Analysts working with structured tabular indicators and hierarchies
- Goal: Enable analysis and transformation within R, and export to SDMX-compatible formats
- Design: Enforces strict `data.frame` structure, linked concepts, units, and classifications aligned to SDMX
### `wbdataset`: Knowledge Graph Compatibility
This evolving package supports **interoperability with Wikidata and the Wikibase Data Model**, which underpin the world’s largest open knowledge graph.
- Target user: Analysts working with Wikidata or custom Wikibase instances
- Goal: Enable two-way workflows — from SPARQL query to enriched analysis and back to triple-contributions
- Design: Facilitates conversion between tidy data and semantic triples
### `rdflib`: Native Graph Integration
For graph-first use cases, the excellent `rdflib` package provides RDF parsing and serialisation within R. `dataset` is fully compatible and designed to work with `rdflib`, allowing metadata-enriched `dataset_df` objects to serve as staging grounds for RDF generation.
## Our Approach: Metadata from the Start
The `dataset` package is based on a simple principle:
> **Data and metadata should live together.**
This is achieved by using R's powerful (but underused) `attributes()` mechanism to store metadata **directly on vectors and data frames**. Attributes are preserved in `.rds` and `.rda` formats, ensuring metadata travels with the data through most R workflows.
```{r d-enriched}
D_enriched <- dataset_df(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c("United States", "Canada", "United States", "Canada", "United States", "Canada"),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200),
dataset_bibentry = dataset::dublincore(
title = "North American GDP Dataset",
description = "Dataset containing GDP data for North American countries.",
creator = person("Daniel", "Antal"), # Replace with the actual creator
publisher = "Reprex",
dataset_date = Sys.Date(),
subject = "GDP"
)
)
D_enriched
```
Other packages (like `dataspice`) also aim to improve metadata capture. But many of them:
- Store metadata in **separate CSV files**
- Focus on publication-time workflows
- Leave room for detachment or mismatch between data and metadata
In contrast, `dataset` stores metadata **in the object itself**, alongside the data, from the moment of creation. This ensures:
- Tighter integration between metadata and data
- Metadata survives transformations and joins
- Continuous improvement of metadata during analysis
The enriched dataset gives us more information about the organisation of the dataset as a whole, but still suffers from a loose definition of its structural elements, i.e., the rows and the columns. We will come back to this in the following sections.
The `dataset` package provides functions with sensible defaults and structures compliant with international metadata standards. The `dublincore()` and `datacite()` functions create enhanced bibentry objects and offer many functions to change the elements of these objects (i.e., to change the publication year, the names of the contributors, etc.) These functions facilitate adding metadata about the dataset as a whole (i.e., the overall organization of the data, not individual rows or columns). It also establishes a foundation for adding metadata about the internal structure of the dataset, such as the meaning and intended use of rows and columns.
Currently, the `dataset` package's approach to row/column-level metadata differs depending on whether the data is intended for a "datacube" or "wbdataset" (likely referring to a specific data format or use case). This is because datasets reused in tabular or graph forms have distinct semantic requirements, which necessitate different function interfaces even if the underlying metadata needs are similar. Future development may address these differences more uniformly.
### Keep data and metadata together
The `dataset` package shares similar goals with the rOpenSci `dataspice` package, particularly concerning tabular data. `dataspice` focuses on making datasets more discoverable online by adding Schema.org metadata, a lightweight ontology used by web search engines. A key interoperability goal for `dataset` is to capture *all* the metadata that `dataspice` utilizes *throughout* the analytical workflow and store it directly within the R `data.frame`'s attributes.
While `dataspice` encourages users to provide this metadata separately in a CSV file (a convenient and simple approach), this method has two critical weaknesses:
1. **User error:** Users may not complete the CSV file or might enter incorrect information.
2. **Synchronization issues:** The released dataset and its metadata can become detached, lost, or fall out of sync due to later updates.
Knowledge graphs offer a more robust solution by updating data and metadata simultaneously, ensuring consistency and enforcing strict metadata definitions. The `dataset` package aims to promote continuous metadata collection and storage within the R object itself, saving it alongside the data. R's ability to store rich metadata in attributes makes saving in `.rda` or `.rds` files a compelling option, even if these formats aren't universally interoperable.
Inspired by knowledge graphs, `dataset` extends the use of attributes to include graph-like metadata, such as recording the dataset's provenance: who downloaded the original data, who performed manipulations, which R packages ("software agents") were used, and so on.
### From tidy to tidier
Tidy data principles, well-established among R users, promote data reusability through a simple structure: rows represent observations, and columns represent variables with meaningful names. This aligns with the 3NF normal form in relational databases and, as Carl Boettiger illustrates, can also represent graph entries in a 3-column long format (subject–predicate–object). However, two key limitations require going beyond standard tidy principles:
- **Column name ambiguity:** Column names can be difficult to interpret and reuse over time. Richer semantic information about columns is essential for improved interoperability and reusability.
- **Row name limitations:** Row names are local identifiers, hindering workflows involving joins and slices across multiple datasets.
While the `haven` and `labelled` packages in the tidyverse provide enhanced column labels (similar to SPSS and STATA), they are insufficient for complex workflows involving numerous datasets. `dataset` addresses these limitations in two ways:
- The `defined()` class, which inherits from `labelled`, adds a *definition* and a *namespace* to labels, enhancing their meaning and meeting the requirements for linked open data.
- Data provenance: `dataset` records data provenance (who downloaded, manipulated, etc.) in the attributes, using a simplified version of the PROV model and PROV-O ontology. (Further development of this feature is planned for a separate package.)
The `defined()` class also allows extending tidyverse row IDs with a namespace or prefix, enabling conversion to globally unique identifiers (GUIDs), critical when working with data from many sources.
Beyond row and column identification, `dataset` addresses the question of *when* to define a new dataset, a challenge only partially answered in earlier versions and now delegated to the `datacube` package. This question hinges on the concept of a *datacube*, a structural enhancement of tidy data. Datacubes categorize variables into:
- **Attributes:** Metadata constraints.
- **Dimensions:** Variables suitable for summarizing or slicing data.
- **Measures:** Values often requiring a unit of measure.
```{r ambigousjoinwithdplyr}
library(tibble)
# Dataset D (GDP in billions of USD)
D <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c(
"United States", "Canada", "United States",
"Canada", "United States", "Canada"
),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)
# Dataset E (GDP in billions of EUR)
E <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
country_name = c("United States", "France", "United States", "France", "United States", "France"),
GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)
joined_data <- dplyr::full_join(D, E)
joined_data
```
The `datacube` package emphasizes the distinction between these variable types. For example, joining datasets with shared row IDs and columns isn't meaningful without considering units of measure (e.g., euros vs. dollars). Dimensions determine when a new dataset is needed. `dataset`, through the `defined()` class, allows specifying units of measure, while `datacube` handles the broader needs of attributes, measures, and dimensions.
### From tabular to graph data
The rOpenSci `rdflib` package, a wrapper for the Python library of the same name, provides powerful tools for reading and writing graph data. We believe `dataset` and `rdflib` are complementary and should be used together. While there's a small overlap (inspired by an internal `rdflib` function for working with NQuads, a form of RDF), we've avoided making `rdflib` a direct dependency of `dataset`. However, for users working with RDF-annotated graphs, we strongly recommend using `dataset` in conjunction with `rdflib` for importing, exporting, and exchanging data.
First, let us see how we would solve the ambiguity problems without `dataset`, relying only on *tidyverse* and *rdflib*.
```{r two-gdp-datasets}
library(tibble)
# Dataset D (GDP in billions of USD)
D <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "CAN", "USA", "CAN", "USA", "CAN"),
country_name = c(
"United States", "Canada", "United States",
"Canada", "United States", "Canada"
),
GDP = c(21000, 2000, 22000, 2100, 23000, 2200) # GDP in billions of USD
)
# Dataset E (GDP in billions of EUR)
E <- tibble(
year = c(2020, 2020, 2021, 2021, 2022, 2022),
geocode = c("USA", "FRA", "USA", "FRA", "USA", "FRA"),
country_name = c(
"United States", "France", "United States",
"France", "United States", "France"
),
GDP = c(18000, 2500, 19000, 2600, 20000, 2700) # GDP in billions of EUR
)
# Combine datasets
combined <- bind_rows(D, E)
# Convert all to character *before* pivoting
combined_char <- combined %>%
mutate(across(everything(), as.character))
# Rename geocode to subject
combined_subject <- combined_char %>%
rename(subject = geocode)
# Pivot longer, excluding the subject (triples are named, but predicates are not renamed yet)
named_triples <- combined_subject %>%
pivot_longer(
cols = c(year, country_name, GDP),
names_to = "predicate",
values_to = "object"
)
# Replace geocodes with Geonames IDs and add datatype annotations
geonames_mapping <- tibble(
subject = c("USA", "CAN", "FRA"),
geonames_id = c("6252001", "6251999", "2988317")
)
rdf_triples <- named_triples %>%
left_join(geonames_mapping, by = "subject") %>%
mutate(
subject = paste0("geonames:", geonames_id),
predicate = case_when(
predicate == "year" ~ "sdmx:TIME_PERIOD",
predicate == "country_name" ~ "schema:name",
predicate == "GDP" ~ "sdmx:OBS_VALUE",
TRUE ~ predicate
),
object = case_when(
predicate == "sdmx:TIME_PERIOD" ~ paste0("\"", object, "\"^^"),
predicate == "sdmx:OBS_VALUE" ~ paste0("\"", object, "\"^^"),
TRUE ~ object
)
) %>%
select(subject, predicate, object)
print("RDF Triples (subject, predicate, object):")
print(rdf_triples)
```
The `frictionless` package has a somewhat similar approach, but it does not support either a metadata standard or a strict serialisation standard.
Both `rdflib` and `frictionless` provide strong foundations for exporting datasets, but they differ in scope and assumptions. The `rdflib` package utilises the World Wide Web RDF standard, which is also used by SDMX and many open science repositories. It supports multiple serialisation forms, including JSON-LD. By contrast, `frictionless` relies on JSON containers but lacks a formal ontology layer.
Both packages can preserve technical metadata, but neither guides users toward a consistent semantic model. They provide the vehicle, but not the fuel.
The `dataset` package — especially through its downstream extensions like `datacube` and `wbdataset` — aims to fill this gap, offering metadata curation aligned with high levels of organisational and semantic interoperability. `datacube` focuses on SDMX-style statistical datasets, while `wbdataset` helps users adopt the Wikibase data model and workflow.
Unlike `rdflib`, which requires RDF/OWL expertise, and `frictionless`, which delegates semantics to arbitrary field descriptions, `dataset` provides structure and guidance without requiring users to leave the R environment.
### Publishing the data
Let us return to the ISO/IEC 20546 definition used as a motto:
"A dataset is an identifiable collection of data available for access or download in one or more formats."
Datasets should be published with their future reuse in mind. Several R packages support dataset publication, each targeting different use cases and levels of interoperability:
- The [`rdflib`](https://CRAN.R-project.org/package=rdflib) package envisions reuse in fully semantic environments, using RDF as a standard.
- The [`dataspice`](https://docs.ropensci.org/dataspice/) package focuses on enhancing discoverability by adding Schema.org metadata to HTML documentation.
- The [`frictionless`](https://CRAN.R-project.org/package=frictionless) package promotes packaging metadata (in JSON) and data (in CSV) for structured publication.
- The `wbdataset` package (in development) supports publication to Wikibase, which powers Wikidata.
The `dataset` package is designed to be compatible with all of these tools. It enables seamless export to:
- `rdflib` for RDF-based graph publication
- `dataspice` for search engine–optimised publishing
- `wbdataset` for contribution to the Wikibase ecosystem
Although `frictionless` is currently not a direct target, it can be supported if user demand arises.
Wikibase is an especially interesting target. It supports RDF exports and shares much with triplestores conceptually, but its design allows users to work with graph-structured data without requiring deep knowledge of RDF or SPARQL. While the `WikibaseR` package is no longer maintained, we intend to co-develop `wbdataset` and a new `WikibaseR` to restore seamless integration with Wikibase.
Below is a dataset of musical artists from small countries. It demonstrates how `dataset` handles dataset-level metadata (e.g. rights, authorship, provenance) and row-level annotations without requiring external formats or technologies.
```{r smallmusiciandataset1}
small_country_musicians <- data.frame(
qid = c("Q275912", "Q116196078"),
artist_name = defined(
c("Marta Roure", "wavvyboi"),
concept = "https://www.wikidata.org/wiki/Property:P2093"
),
location = defined(
c("Andorra", "Lichtenstein"),
concept = "https://www.wikidata.org/wiki/Property:P276"
),
date_of_birth = defined(
c(as.Date("1981-01-16"), as.Date("1998-04-28")),
concept = "https://www.wikidata.org/wiki/Property:P569"
)
)
small_country_musicians$age <- defined(
2024 - as.integer(
substr(
as.character(small_country_musicians$date_of_birth),
1, 4
)
),
label = "Age in 2024"
)
```
This dataset includes semantic definitions (linked to Wikidata properties), automatically tracks provenance, and can be exported using any of the supported tools. By embedding rich metadata directly in R objects, the `dataset` package ensures that publication-readiness and reuse potential are built in from the start.