An Introduction to the dataset Package

Overview

The dataset package enriches R’s native data structures with machine-readable metadata. It allows variables and datasets to carry semantic definitions — such as URIs, labels, units, and provenance — which makes them suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems.

Unlike most metadata packages that attach metadata after the fact, dataset follows a semantic early-binding approach: metadata is embedded as soon as the data is created.

This vignette provides a high-level introduction. For details on key components, see the following:

Why extend tidy data?

Hadley Wickham (2014) defines tidy data with three principles:

This structure is ideal for analysis, but lacks semantic clarity, particularly when an analyst is working in a realistic, but not ideal scenario with several datasets received from various internet services. For example, two datasets might both contain a column named gdp, but one might be in euros and the other in dollars. Without metadata, tools cannot detect this mismatch.

The dataset package addresses this by allowing you to define variables explicitly, and to store dataset-level metadata within a tidy tibble.

Example: defining semantically rich vectors

Semantically rich vectors are vectors in a data.frame that contain richer semantics than a simple column name; a long-form human-readable title; a machine- and human-readable variable definition; and if needed, an external resource that contains the codebook.

library(dataset)

gdp <- defined(
  c(2355, 2592, 2884),
  label = "Gross Domestic Product",
  unit = "CP_MEUR",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

geo <- defined(
  rep("AD", 3),
  label = "Geopolitical Entity",
  concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
  namespace = "https://www.geonames.org/countries/$1/"
)

gdp
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 2355 2592 2884
geo
#> x: Geopolitical Entity
#> Defined as http://purl.org/linked-data/sdmx/2009/dimension#refArea 
#> [1] "AD" "AD" "AD"

In this case, we define geo as the geopolitical entity http://purl.org/linked-data/sdmx/2009/dimension#refArea, and we know that the AD value can resolve to Andorra: https://www.geonames.org/countries/AD/. These vectors now carry metadata you can inspect directly — including their label, unit, and concept URI — which will be preserved even after transformation or storage.

Example: creating a dataset from a metadata-enriched data frame

small_dataset <- dataset_df(
  geo = geo,
  gdp = gdp,
  identifier = c(gdp = "http://example.com/dataset#gdp"),
  dataset_bibentry = dublincore(
    title = "Small GDP Dataset",
    creator = person("Jane", "Doe", role = "aut"),
    publisher = "Small Repository",
    subject = "Gross Domestic Product"
  )
)

small_dataset
#> Doe (2025): Small GDP Dataset [dataset]
#>   rowid     geo       gdp       
#>   <defined> <defined> <defined>
#> 1 gdp1      AD        2355     
#> 2 gdp2      AD        2592     
#> 3 gdp3      AD        2884

This dataset not only stores the variables and values, but also includes embedded metadata that supports precise interpretation and repository-level publication.

as_dublincore(small_dataset)
#> Dublin Core Metadata Record
#> --------------------------
#> Title:       Small GDP Dataset
#> Creator(s):  Jane Doe [aut]
#> Contributor(s): :unas
#> Subject(s):  Gross Domestic Product
#> Publisher:   Small Repository
#> Year:        2025
#> Language:    :unas
#> Description: :unas

Exporting to RDF

As Carl Boettinger has shown in the vignettes accompanying the R-binding to the popular Python library rdflib, (see: A tidyverse lover’s intro to RDF), tidy datasets can be retrofitted with rich metadata if they are pivoted to a strictly three-column long format.

Our packages tries to lower the burden of such retrofitting with early binding and sensible defaults to serialise the dataset’s contents and the dataset’s bibliographic data to this format for those who are not familiar with RDF.

You can convert any dataset_df object into a tidy 3-column representation (subject–predicate–object) using dataset_to_triples():

triples <- dataset_to_triples(small_dataset,
  format = "nt"
)
triples
#> [1] "<http://example.com/dataset#gdpgdp1> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#gdpgdp2> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [3] "<http://example.com/dataset#gdpgdp3> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [4] "<http://example.com/dataset#gdpgdp1> <http://data.europa.eu/83i/aa/GDP> \"2355\"^^<xsd:decimal> ."                                        
#> [5] "<http://example.com/dataset#gdpgdp2> <http://data.europa.eu/83i/aa/GDP> \"2592\"^^<xsd:decimal> ."                                        
#> [6] "<http://example.com/dataset#gdpgdp3> <http://data.europa.eu/83i/aa/GDP> \"2884\"^^<xsd:decimal> ."

This 3-column format (subject–predicate–object) is compatible with semantic web tools such as SPARQL, rdflib, and triple stores.

mycon <- tempfile("my_dataset", fileext = "nt")
my_description <- describe(x = small_dataset, con = mycon)

# Only three statements are shown:
readLines(mycon)[c(4, 8, 12)]
#> [1] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."                                      
#> [2] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/title> \"Small GDP Dataset\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [3] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/type> <http://purl.org/dc/dcmitype/Dataset> ."
## Show two lines of provenance:
provenance(small_dataset)[c(6, 7)]
#> [1] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [2] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-08-25T22:11:34Z\"^^<xsd:dateTime> ."

Summary

The dataset package enriches tidy data by attaching metadata from the start of the workflow. It helps avoid semantic mismatches, supports RDF publication, and meets interoperability standards like SDMX, DataCite, and Dublin Core. Use it when you need:

For deeper examples, see: