---
title: "Provider etiquette: batching and throttling"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Provider etiquette: batching and throttling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Why provider etiquette matters

`scholidonline` queries live external scholarly registries. These providers are
useful public infrastructure, but they are not unlimited local databases. They
may rate-limit requests, slow down responses, or temporarily refuse access when
many requests arrive in a short time.

For this reason, `scholidonline` tries to access providers efficiently and
politely. Two mechanisms are especially important:

- **Batching**, where multiple identifiers are resolved with one provider
  request when the provider supports it.
- **Throttling**, where the package waits between provider requests when
  needed.

Users usually do not need to manage these details manually. The exported
functions remain vectorized, and the return shape is the same regardless of
whether a provider request was scalar or batched internally.

Prefer vectorized calls such as:

```{r vectorized calls, eval = TRUE}
scholidonline::id_exists(
  c("31452104", "999999999"),
  type = "pmid",
  provider = "ncbi"
)
```

over manual loops such as:

```{r manual loops, eval = TRUE}
vapply(
  c("31452104", "999999999"),
  function(x) {
    scholidonline::id_exists(
      x,
      type = "pmid",
      provider = "ncbi"
    )
  },
  logical(1)
)
```

Vectorized calls give the package an opportunity to use provider-supported
batching and avoid unnecessary repeated requests.

Provider etiquette is especially relevant for scripted workflows, large
identifier vectors, repeated checks during development, and automated tests
that query live services. Even when each individual request is valid, many
rapid requests can make a provider temporarily unavailable for the current
session or client.

# Batching

Batching means that `scholidonline` may resolve multiple identifiers using a
single provider request. This is an internal optimization. It does not change
the public API or the shape of returned objects.

For example, `id_exists()` still returns one logical value per input:

```{r id_exists batching, eval = TRUE}
scholidonline::id_exists(
  c("31452104", "999999999", NA_character_),
  type = "pmid",
  provider = "ncbi"
)
```

Likewise, `id_metadata()` still returns one row per input identifier:

```{r id_metadata batching, eval = TRUE}
scholidonline::id_metadata(
  c("31452104", "999999999", NA_character_),
  type = "pmid",
  provider = "ncbi"
)
```

`id_links()` still returns a long data frame of discovered links:

```{r id_links batching, eval = TRUE}
scholidonline::id_links(
  c("PMC6784763", "PMC999999999", NA_character_),
  type = "pmcid",
  provider = "ncbi"
)
```

And `id_convert()` still returns one converted identifier per input:

```{r id_convert batching, eval = TRUE}
scholidonline::id_convert(
  c("31469695", "999999999", NA_character_),
  from = "pmid",
  to = "pmcid",
  provider = "ncbi"
)
```

Batching is provider- and operation-specific. Some providers offer clean
multi-identifier endpoints; others do not. `scholidonline` uses batching only
where the provider interface supports reliable mapping back to the original
input identifiers.

For example, batching is used for selected arXiv operations and for selected
NCBI-backed PMID, PMCID, and DOI operations. These include existence checks,
metadata retrieval, linked-identifier lookup, and supported identifier
conversions where the provider response can be mapped safely back to the input
vector.

When batching is not available, the package falls back to scalar provider calls
while preserving the same public return contract. This means users can write
the same vectorized code regardless of whether a provider currently supports a
batch endpoint for that operation.

Batching also helps with provider etiquette because one request for a vector of
identifiers is usually preferable to one request per identifier. For this
reason, vectorized calls should generally be preferred over manual loops.

# Throttling

Throttling means that `scholidonline` may wait before making a provider
request. The first request to a provider usually runs immediately. Later
requests may wait if they occur too soon after the previous request.

Package-managed rate limiting is enabled by default:

```{r rate limit default, eval = TRUE}
options(scholidonline.rate_limit = TRUE)
```

Users can disable package-managed waiting:

```{r disable rate limiting, eval = TRUE}
options(scholidonline.rate_limit = FALSE)
```

Provider-specific intervals can also be adjusted. For example, arXiv access is
intentionally conservative:

```{r arxiv throttling, eval = TRUE}
options(scholidonline.arxiv.min_interval = 3)
```

NCBI requests use a shorter default interval:

```{r NCBI throttling, eval = TRUE}
options(scholidonline.ncbi.min_interval = 0.34)
```

Europe PMC requests can also be controlled separately:

```{r PMC throttling, eval = TRUE}
options(scholidonline.epmc.min_interval = 1)
```

These options affect future requests in the current R session. They do not
change the meaning of results.

The rate limiter is process-local. It tracks requests made in the current R
session. It is not shared across parallel R sessions, background R processes,
or separate machines. If you run highly parallel code, each R process may have
its own rate-limit state.

A provider failure is not the same as a confirmed absence. In `id_exists()`,
the return values have distinct meanings:

- `TRUE`: the provider returned usable evidence that the identifier exists.
- `FALSE`: the provider returned usable evidence that the identifier does not
  exist.
- `NA`: the identifier could not be checked reliably, for example because it
  could not be normalized, the provider was unavailable, or the provider
  response could not be interpreted safely.

This distinction matters for live services. A temporary rate-limit response,
service outage, malformed response, or network failure should not be treated as
evidence that an identifier does not exist. In such cases, `NA` is the safer
result.

For normal use, it is best to keep rate limiting enabled and to prefer
vectorized calls over manual loops. Users who need stricter provider etiquette
can increase the provider-specific intervals. Users who already manage request
pacing externally can disable package-managed waiting with
`options(scholidonline.rate_limit = FALSE)`.