
The dataset package extends tidyverse workflows with
lightweight semantic metadata, provenance tracking, and interoperable
dataset structures.
It supports gradual semantic stabilization ranging from lightweight semantic mappings to formally defined variables and semantically enriched datasets suitable for FAIR, machine-readable, and standards-aligned data exchange.
The package draws inspiration from:
The goal is to preserve metadata when reusing statistical and repository datasets, improve interoperability, and make it easy to turn tidy data frames into web-ready, publishable datasets that comply with ISO and W3C standards.
You can install the latest released version of
dataset from CRAN with:
install.packages("dataset")To install the development version from GitHub with pak
or remotes:
# install.packages("pak")
pak::pak("dataobservatory-eu/dataset")
# install.packages("remotes")
remotes::install_github("dataobservatory-eu/dataset")Real-world datasets rarely begin with fully standardized values.
Early in a project, inconsistencies may be easy to spot, such as mixing
AD and Andorra for the same country. As
datasets are combined from multiple sources, however, additional
variants often appear, for example the ISO-3166 alpha-2 code
AD, the country name Andorra, or the ISO-3166
alpha-3 code AND.
The prelabel() constructor provides a lightweight way to
stabilize such values before committing to a formal semantic
definition.
library(dataset)
x <- prelabel(
c("AD", "Andorra", "AND", "LI", "Liechtenstein"),
labels = c(
Andorra = "AD",
AND = "AD",
Liechtenstein = "LI"
)
)
as.character(x)
#> [1] "AD" "AD" "AD" "LI" "LI"Unlike a formal semantic definition, a prelabelled
vector records provisional mappings that may still evolve during data
integration. The original observational values remain available
alongside the current semantic assumptions:
attr(x, "prelabel")
#> Andorra AND Liechtenstein AD LI
#> "AD" "AD" "LI" "AD" "LI"When semantic assumptions become sufficiently stable, variables can
be formalized with defined() and combined into a
semantically enriched dataset_df() object:
library(dataset)
df <- dataset_df(
country = defined(
c("AD", "LI"),
label = "Country",
namespace = "https://www.geonames.org/countries/$1/"
),
gdp = defined(
c(3897, 7365),
label = "GDP",
unit = "million euros"
),
dataset_bibentry = dublincore(
title = "GDP Dataset",
creator = person("Jane", "Doe", role = "aut"),
publisher = "Small Repository"
)
)
print(df)
#> Doe (2026): GDP Dataset [dataset]
#> rowid country gdp
#> <chr> <chr> <dbl>
#> 1 obs1 AD 3897
#> 2 obs2 LI 7365This illustrates the semantic lifecycle supported by the package:
raw values
↓
prelabelled
↓
defined
↓
dataset_df
↓
RDF and FAIR publication
Because semantic assumptions and provenance are preserved explicitly, semantically enriched datasets can be exported as interoperable RDF triples without manually reconstructing metadata at publication time.
Export as RDF triples:
dataset_to_triples(df, format = "nt")#> [1] "<http://example.com/dataset#obsobs1> <http://example.com/prop/country> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#obsobs2> <http://example.com/prop/country> <https://www.geonames.org/countries/LI/> ."
#> [3] "<http://example.com/dataset#obsobs1> <http://example.com/prop/gdp> \"3897\"^^<xsd:decimal> ."
#> [4] "<http://example.com/dataset#obsobs2> <http://example.com/prop/gdp> \"7365\"^^<xsd:decimal> ."
Retain automatically recorded provenance:
provenance(df)#> [1] "<http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> ."
#> [2] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> ."
#> [3] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."
#> [4] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."
#> [5] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [6] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [7] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2026-06-03T06:29:26Z\"^^<xsd:dateTime> ."
The package does not attempt automatic ontology alignment, entity reconciliation, or rule-based semantic inference. It focuses on preserving semantic assumptions made by the analyst in a transparent and reproducible form.
We welcome contributions and discussion!
Please refer to this package as:
Daniel Antal. (2026). dataset: Create Data Frames that are Easier to Exchange and Reuse (0.4.4). The Comprehensive R Archive Network. https://zenodo.org/records/17621464, DOI: 10.32614/CRAN.package.dataset
See contributors on the website and in the DESCRIPTION file.
This project follows the rOpenSci Code of Conduct. By participating, you are expected to uphold these guidelines.