--- title: "Format Backends" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Format Backends} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction vectra reads and writes five formats: its native `.vtr`, CSV, SQLite, Excel, and GeoTIFF. Each reader returns the same `vectra_node` object. Once a node exists, the engine treats it identically regardless of where the data came from. A filter on a CSV scan produces the same plan tree as a filter on a `.vtr` scan. The same verbs work, the same expressions evaluate, and `collect()` materializes an R data.frame either way. This matters because format choice becomes a deployment decision, not a code decision. We can prototype a pipeline against a CSV export, convert the data to `.vtr` for production, and the pipeline code stays the same apart from the opening `tbl_*()` call. Writers mirror the readers. `write_vtr()`, `write_csv()`, `write_sqlite()`, and `write_tiff()` are all S3 generics dispatching on both `vectra_node` and `data.frame`. When the input is a node, writes stream batch-by-batch through the plan tree. The full dataset never needs to fit in memory. This is the core of vectra's format conversion story: pipe a reader into a writer, and data flows from source format to target format in bounded memory. ```{r intro-demo} library(vectra) # Write mtcars to .vtr, then read it back lazily f <- tempfile(fileext = ".vtr") write_vtr(mtcars, f) node <- tbl(f) node ``` The node prints its schema but holds no data. Columns, types, and row count are in the file header; the actual values stay on disk until `collect()`. ## The .vtr format `.vtr` is vectra's native binary columnar format. It stores data in row groups, where each row group contains all columns for a slice of rows. The current version (v4) applies a two-stage encoding stack per column per row group: first a logical encoding (dictionary for low-cardinality strings, delta for monotonic integers, or plain pass-through), then byte-level compression via Zstandard when it actually shrinks the data. This layout gives the scan node several optimization paths that other formats cannot support. Zone-map statistics (min/max per column per row group) let the engine skip entire row groups that cannot match a filter predicate. Column pruning avoids reading columns that the query never references. And hash indexes (`.vtri` sidecar files) enable O(1) row group lookup on equality predicates. ```{r vtr-roundtrip} f <- tempfile(fileext = ".vtr") write_vtr(mtcars, f) tbl(f) |> filter(cyl == 6) |> select(mpg, cyl, hp) |> collect() ``` When we call `write_vtr()` with a `vectra_node` input, the writer streams batches from the upstream plan and writes each batch as one row group. The full result never materializes in memory. For `data.frame` inputs, the data is written directly from R's memory in a single row group. The `batch_size` parameter on `write_vtr()` controls how many rows accumulate before flushing a row group. Smaller row groups mean more granular zone-map statistics and finer predicate pruning. Larger row groups reduce per-group overhead and compress better. ```{r vtr-batch-size} f <- tempfile(fileext = ".vtr") csv <- tempfile(fileext = ".csv") write.csv(mtcars, csv, row.names = FALSE) # Convert CSV to .vtr with 10-row row groups tbl_csv(csv) |> write_vtr(f, batch_size = 10) # The file now has multiple row groups tbl(f) |> collect() |> nrow() ``` `append_vtr()` adds new row groups to an existing file without rewriting it. This is useful for incremental data ingestion where new batches arrive over time. ## CSV `tbl_csv()` opens a CSV file for streaming reads. Column types are inferred from the first 1000 rows: numeric-looking columns become doubles, integer- looking columns become integers, and everything else is a string. Gzip- compressed files (`.csv.gz`) are detected and decompressed transparently. ```{r csv-read} csv <- tempfile(fileext = ".csv") write.csv(mtcars, csv, row.names = FALSE) tbl_csv(csv) |> filter(hp > 200) |> select(mpg, hp, wt) |> collect() ``` The `batch_size` parameter controls how many rows the CSV scanner reads per internal batch. The default (65536) works well for most files. Smaller values reduce peak memory at the cost of more read calls. `write_csv()` streams from any node. It writes a standard comma-separated file with a header row. There is no gzip output option; if we need compressed output, we convert to `.vtr` instead, which applies Zstandard automatically. ```{r csv-write} f <- tempfile(fileext = ".vtr") write_vtr(mtcars, f) out_csv <- tempfile(fileext = ".csv") tbl(f) |> filter(cyl == 4) |> write_csv(out_csv) # Verify the output read.csv(out_csv) |> head() ``` CSV has no predicate pushdown or column pruning. Every row and every column is read and parsed, then the engine's filter and project nodes discard what the query does not need. For one-off analyses this is fine. For repeated queries on the same data, converting to `.vtr` once and querying many times is faster. ## SQLite `tbl_sqlite()` connects to a SQLite database and streams a table through vectra's engine. Column types are inferred from the declared types in the `CREATE TABLE` statement. All filtering, grouping, and aggregation happen in vectra's C engine, not via SQL queries. The SQLite library is used only as a row source. ```{r sqlite-read} db <- tempfile(fileext = ".sqlite") write_sqlite(mtcars, db, "cars") tbl_sqlite(db, "cars") |> filter(mpg > 25) |> select(mpg, cyl, wt) |> collect() ``` The `table` argument names which table to scan. A database can hold many tables; each `tbl_sqlite()` call opens one. SQLite is a natural choice when data already lives in a database, or when other tools (Python, command-line utilities, web applications) need to read the same file. The format is self-contained, widely supported, and handles concurrent reads well. `write_sqlite()` streams from any node into a SQLite table. If the table already exists, rows are appended. This makes it straightforward to export vectra query results for consumption by non-R tools. ```{r sqlite-write} f <- tempfile(fileext = ".vtr") write_vtr(mtcars, f) db <- tempfile(fileext = ".sqlite") tbl(f) |> filter(cyl == 8) |> write_sqlite(db, "v8_cars") # Read it back through vectra tbl_sqlite(db, "v8_cars") |> collect() ``` ## Excel `tbl_xlsx()` reads a sheet from an Excel workbook. It requires the `openxlsx2` package, which is listed in Suggests. The sheet is specified by name or by 1-based index (default: first sheet). Unlike the other backends, Excel support is read-only. There is no `write_xlsx()` function. Excel's format is complex and writing it adds little value when CSV or SQLite serve the same interoperability purpose with less overhead. Under the hood, `tbl_xlsx()` reads the sheet into a data.frame via `openxlsx2` and then converts it to a vectra node. This means the sheet does load into memory during the initial read, but all subsequent operations (filter, mutate, join) are lazy and stream through the C engine. ```r # Not run (requires openxlsx2) node <- tbl_xlsx("survey_results.xlsx", sheet = "Q1_2025") node |> filter(score >= 80) |> select(respondent, score, region) |> collect() ``` For large Excel files that do not fit in memory, export to CSV first and use `tbl_csv()`. ## GeoTIFF `tbl_tiff()` reads raster data. Each pixel becomes a row with `x` and `y` columns (pixel center coordinates derived from the geotransform) and one column per band (`band1`, `band2`, etc.). NoData values become `NA`. This tabular representation lets us apply the full verb set to raster data: filter by spatial extent, threshold band values, compute derived indices, join with point observations. The default `batch_size` is 256 raster rows, much smaller than the CSV/SQLite default. Raster data is dense. A 10,000 x 10,000 single-band TIFF has 100 million pixels, so each raster row of 10,000 pixels already constitutes a meaningful batch. ```r # Not run (requires a GeoTIFF file) tbl_tiff("temperature.tif") |> filter(band1 > 25, x >= 10, x <= 20) |> collect() ``` `write_tiff()` writes query results back to a GeoTIFF. The data must contain `x` and `y` columns and one or more numeric band columns. Grid dimensions and the geotransform are inferred from the coordinate arrays. The `compress` parameter enables DEFLATE compression. ### Integer pixel types By default, `write_tiff()` writes 64-bit float pixels. The `pixel_type` parameter selects a smaller representation: `"int16"`, `"int32"`, `"uint8"`, `"uint16"`, or `"float32"`. Smaller types compress better and produce much smaller files — a global climate raster stored as `int16` + DEFLATE can be 5-10x smaller than the equivalent `float64`. ```r # Scale temperature to integer: value * 100, stored as int16 df |> mutate(band1 = round(band1 * 100)) |> write_tiff("temperature_int16.tif", compress = TRUE, pixel_type = "int16") ``` `NA` values in the input are automatically mapped to the integer nodata value (-32768 for `int16`, 65535 for `uint16`, etc.) and tagged in the GDAL\_NODATA header. The reader converts them back to `NA` on read. ### Embedded metadata The `metadata` parameter writes a GDAL\_METADATA XML string into TIFF tag 42112. This is useful for recording scale factors, offsets, variable names, or provenance without a sidecar file. ```r xml <- ' 0.01 0 bio1 ' write_tiff(df, "bio1.tif", pixel_type = "int16", metadata = xml) # Read the metadata back tiff_metadata("bio1.tif") ``` `tiff_metadata()` returns the XML string, or `NA` if the tag is absent. ### Point extraction `tiff_extract_points()` samples band values at specific `(x, y)` coordinates directly from the TIFF file. Only the strips containing query points are read, making it efficient for sparse lookups on large rasters. No spatial package is needed at runtime. ```r pts <- data.frame(x = c(10.5, 11.2), y = c(47.1, 47.3)) tiff_extract_points("temperature.tif", pts) ``` This round-trip capability makes vectra useful for raster ETL: read a TIFF, filter or transform it with standard verbs, write a new TIFF. No dependency on spatial packages is needed for the read-process-write loop itself, though `terra::rast(df, type = "xyz")` can convert results to `SpatRaster` objects when spatial operations are needed. ## Streaming conversion pipelines The practical payoff of uniform node types is format conversion with zero full-dataset materialization. We pipe a reader into a writer, and data flows from source to target one batch at a time. Memory use stays proportional to one batch, not the full dataset. The simplest case: convert CSV to `.vtr`. ```{r convert-csv-vtr} csv <- tempfile(fileext = ".csv") write.csv(mtcars, csv, row.names = FALSE) vtr <- tempfile(fileext = ".vtr") tbl_csv(csv) |> write_vtr(vtr) tbl(vtr) |> collect() |> head() ``` We can filter during conversion. Only rows passing the predicate are written. ```{r convert-filtered} csv <- tempfile(fileext = ".csv") write.csv(mtcars, csv, row.names = FALSE) vtr <- tempfile(fileext = ".vtr") tbl_csv(csv) |> filter(mpg > 20) |> mutate(kpl = mpg * 0.425144) |> write_vtr(vtr) tbl(vtr) |> collect() ``` Multi-step ETL works the same way. Read from one format, transform, write to another. ```{r etl-pipeline} # CSV -> filter + transform -> SQLite csv <- tempfile(fileext = ".csv") write.csv(mtcars, csv, row.names = FALSE) db <- tempfile(fileext = ".sqlite") tbl_csv(csv) |> filter(cyl >= 6) |> select(mpg, cyl, hp, wt) |> mutate(power_weight = hp / wt) |> write_sqlite(db, "powerful_cars") # SQLite -> VTR vtr <- tempfile(fileext = ".vtr") tbl_sqlite(db, "powerful_cars") |> write_vtr(vtr) tbl(vtr) |> collect() ``` Each arrow in the pipeline is a streaming connection. The CSV scanner reads a batch, the filter discards rows, the project computes new columns, and the SQLite writer inserts the result. Then the next batch flows through. At no point does the full dataset exist in memory. For datasets that dwarf available RAM, this is the only way these conversions can work. Conversion pipelines also compose with joins. We can read two sources in different formats and join them. ```{r join-across-formats} f1 <- tempfile(fileext = ".vtr") f2 <- tempfile(fileext = ".csv") cars_main <- mtcars[, c("mpg", "cyl", "hp")] cars_extra <- data.frame(cyl = c(4, 6, 8), label = c("small", "mid", "big")) write_vtr(cars_main, f1) write.csv(cars_extra, f2, row.names = FALSE) tbl(f1) |> left_join(tbl_csv(f2), by = "cyl") |> collect() |> head() ``` ## Batch size The `batch_size` parameter appears on readers and on `write_vtr()`. It controls different things depending on context. On **readers** (`tbl_csv`, `tbl_sqlite`), batch size sets how many rows the scanner reads per pull from the source. The default 65536 balances memory use against per-call overhead. Larger values read more rows per call, which can be faster for simple scan-heavy queries. Smaller values keep peak memory lower, which matters when individual rows are wide (many columns or long strings). On **`write_vtr()`**, batch size controls how many rows accumulate before flushing a row group to disk. This directly affects the structure of the output file. Each row group carries its own zone-map statistics (min and max per column), so more row groups mean finer-grained predicate pushdown. Fewer row groups mean less metadata overhead and better compression ratios, because the compressor sees more data per block. ```{r batch-size-effect} csv <- tempfile(fileext = ".csv") big <- data.frame( id = seq_len(1000), value = rnorm(1000) ) write.csv(big, csv, row.names = FALSE) # Small row groups: more granular zone maps f_small <- tempfile(fileext = ".vtr") tbl_csv(csv) |> write_vtr(f_small, batch_size = 100) # Default: single row group for 1000 rows f_default <- tempfile(fileext = ".vtr") tbl_csv(csv) |> write_vtr(f_default) cat("Small batches:", file.size(f_small), "bytes\n") cat("Default: ", file.size(f_default), "bytes\n") ``` For `tbl_tiff`, the default batch size is 256 raster rows, reflecting the high per-row density of raster data. Increasing it speeds up reads on files with few bands. Decreasing it helps when many bands produce very wide rows. There is no batch size parameter on `tbl()` (the `.vtr` reader). Row groups are defined by the file; the scanner reads one row group per `next_batch()` call, respecting whatever batch size was chosen at write time. ## Format comparison The table below summarizes what each format supports. The right format depends on the workload: `.vtr` for repeated analytical queries, CSV for interchange, SQLite for multi-tool access, Excel for one-off imports, GeoTIFF for raster data. | Feature | .vtr | CSV | SQLite | Excel | GeoTIFF | |:--------|:-----|:----|:-------|:------|:--------| | Streaming read | yes | yes | yes | no | yes | | Streaming write | yes | yes | yes | -- | yes | | Predicate pushdown | yes | no | no | no | no | | Column pruning | yes | no | no | no | no | | Zone-map skip | yes | no | no | no | no | | Hash index support | yes | no | no | no | no | | Compression | zstd | gzip (read) | -- | -- | deflate | | Size on disk | small | large | medium | medium | variable | | Random access | by row group | no | by rowid | no | by raster row | | External tool support | vectra only | universal | wide | wide | GIS tools | `.vtr` wins on query performance because it is the only format where the scanner can skip data before reading it. Zone maps, column pruning, and hash indexes all reduce I/O. For a dataset queried repeatedly, converting to `.vtr` once pays for itself quickly. CSV and SQLite have broad tool support. If the data needs to flow to Python, a dashboard, or a command-line tool, these formats avoid lock-in. SQLite adds transactional writes and concurrent read access that flat files lack. GeoTIFF is specialized. It carries spatial metadata (coordinate reference system, geotransform) that the other formats do not. When the output is a raster, GeoTIFF is the right choice; when the output is a table derived from raster data, `.vtr` or CSV may be more practical.