Getting started with scholid

scholid is a lightweight, dependency-free (base R only) toolkit for working with scholarly and academic identifiers. It provides small, well-tested helpers to detect, normalize, classify, and extract common identifier strings.

This vignette introduces the interface and typical workflows for mixed, messy identifier data.

Installation

install.packages("scholid")

Interface

scholid exposes a small set of user-facing functions that operate consistently across identifier types:

These generic helpers dispatch internally to type-specific implementations such as is_doi(), normalize_orcid(), and extract_isbn().

Supported identifier types

scholid::scholid_types()
##  [1] "doi"        "arxiv"      "bibcode"    "openalex"   "swhid"     
##  [6] "ark"        "isni"       "orcid"      "ror"        "rrid"      
## [11] "uniprot"    "refseq"     "sra"        "geo"        "bioproject"
## [16] "assembly"   "isbn"       "issn"       "pmcid"      "pmid"

For per-type formats, validation rules, and classification order, see the How Scholarly Identifiers Are Defined vignette (vignette("scholid_definitions", package = "scholid")), also linked from the package site as About identifiers.

Detect: is_scholid()

is_scholid() checks whether each value is a valid identifier of a specific type. It expects canonical (or near-canonical) input; wrapped forms such as URLs should be normalized first. For checksum-based types (ORCID, ISBN, ISSN), both is_scholid() and normalize_scholid() verify the checksum. It is vectorized and preserves missing values.

x <- c(
    "10.1000/182",
    "not a doi",
    NA
)
scholid::is_scholid(
    x    = x,
    type = "doi"
)
## [1]  TRUE FALSE    NA

Normalize: normalize_scholid()

Normalization removes common wrappers and enforces a canonical representation. This is particularly useful when identifiers are stored as URLs or prefixed labels.

x <- c(
  "https://doi.org/10.1000/182.",
  "doi:10.1000/182",
  " 10.1000/182 "
)
scholid::normalize_scholid(
    x    = x, 
    type = "doi"
)
## [1] "10.1000/182" "10.1000/182" "10.1000/182"

For ORCID iDs, normalization removes URL prefixes and enforces hyphenated grouping.

x <- c(
  "https://orcid.org/0000-0002-1825-0097",
  "0000000218250097"
)
scholid::normalize_scholid(
    x    = x,
    type = "orcid"
)
## [1] "0000-0002-1825-0097" "0000-0002-1825-0097"

Normalization is designed to be predictable: - NA input stays NA. - Invalid inputs typically become NA_character_.

Extract: extract_scholid()

Extraction is for harvesting identifiers from unstructured text. The result is a list with one element per input element. Each element is a character vector of matches (possibly empty).

txt <- c(
  "See https://doi.org/10.1000/182 and doi:10.5555/12345678.",
  "No identifier here.",
  NA
)
scholid::extract_scholid(
    text = txt,
    type = "doi"
)
## [[1]]
## [1] "10.1000/182"      "10.5555/12345678"
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

The list return type is intentional: a single text string can contain multiple identifiers.

Classify: classify_scholid()

classify_scholid() returns the best-guess identifier type per element for mixed identifier columns. Classification is based on the set of available is_<type>() checks and the precedence order defined by scholid_types().

x <- c(
  "10.1000/182",
  "0000-0002-1825-0097",
  "PMC12345",
  "2101.00001v2",
  "not an id",
  NA
)
scholid::classify_scholid(x = x)
## [1] "doi"   "orcid" "pmcid" "arxiv" NA      NA

Normalization + classification in messy data

Many identifiers appear wrapped (URLs, prefixes, trailing punctuation). Classification is strict and expects canonical strings. A common pattern is:

  1. Extract identifiers from text.
  2. Normalize extracted values.
  3. Classify and/or validate.
txt <- "Read https://doi.org/10.1000/182 (and ORCID 0000-0002-1825-0097)."
dois <- scholid::extract_scholid(txt, "doi")[[1]]
orcids <- scholid::extract_scholid(txt, "orcid")[[1]]

dois_n <- scholid::normalize_scholid(dois, "doi")
orcids_n <- scholid::normalize_scholid(orcids, "orcid")

scholid::classify_scholid(c(dois_n, orcids_n))
## [1] "doi"   "orcid"
scholid::is_scholid(dois_n, "doi")
## [1] TRUE
scholid::is_scholid(orcids_n, "orcid")
## [1] TRUE

Detect: detect_scholid_type()

detect_scholid_type() performs best-effort type detection for mixed, messy identifier input. In contrast to classify_scholid(), detection also recognizes common wrapped forms such as URLs and prefixed labels (e.g., doi:, https://orcid.org/, arXiv:, PMID:).

Detection is useful when working with raw data where identifiers may not yet be normalized.

For example, wrapped identifiers are not classified strictly:

x <- c(
  "https://doi.org/10.1000/182",
  "ORCID: 0000-0002-1825-0097",
  "arXiv:2101.00001",
  "PMID: 12345",
  "not an id"
)
scholid::classify_scholid(x)
## [1] NA NA NA NA NA

However, they can be detected directly:

scholid::detect_scholid_type(x)
## [1] "doi"   "orcid" "arxiv" "pmid"  NA

Whitespace and minor formatting irregularities are handled conservatively:

scholid::detect_scholid_type(
  c(
    " 0000-0002-1825-0097 ",
    " 10.1000/182 ",
    "ISSN 0317-8471"
  )
)
## [1] "orcid" "doi"   "issn"

detect_scholid_type() does not modify values. Once the identifier type is known, use normalize_scholid() to convert wrapped input to canonical form and is_scholid() to validate already-canonical values. Both apply checksum verification where applicable.

A typical workflow for messy data is:

  1. Detect identifier types.
  2. Normalize by detected type.
  3. Validate canonical identifiers.

This separation keeps detection permissive, normalization focused on canonicalization of wrapped input, and validation available for already-canonical strings.

Design notes

scholid is intentionally small and conservative: