--- title: "Vaccination Data with SI-PNI" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vaccination Data with SI-PNI} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The **SI-PNI (Sistema de Informacao do Programa Nacional de Imunizacoes)** is Brazil's national immunization information system, managed by the Ministry of Health. It tracks vaccination doses applied and coverage rates across the country. The `healthbR` package provides access to SI-PNI data from **two sources**: | Source | Years | Data type | Granularity | Format | |--------|-------|-----------|-------------|--------| | **FTP DATASUS** | 1994--2019 | Aggregated counts | Annual per UF | .DBF files | | **OpenDataSUS CSV** | 2020--2025 | Individual-level microdata | Monthly national | CSV bulk downloads | `sipni_data()` automatically routes to the correct source based on the requested year. ## Data sources comparison | Feature | FTP (1994--2019) | CSV (2020--2025) | |---------|------------------|------------------| | Record type | Aggregated (dose counts per municipality/vaccine/age) | Individual (one row per vaccination dose) | | File types | DPNI (doses) or CPNI (coverage) | Single type (microdata) | | Variables | 7--12 per type | ~47 per record | | File size | Small (~100 KB per UF/year) | Large (~1.4 GB ZIP per month, national) | | Naming | UPPERCASE column names | snake_case column names | ## Getting started ```{r setup} library(healthbR) library(dplyr) ``` ### Check available years ```{r} sipni_years() #> [1] 1994 1995 ... 2024 2025 ``` ### Module information ```{r} sipni_info() ``` ## FTP path: doses applied (DPNI) The default type downloads aggregated dose counts (1994--2019): ```{r} # doses applied in Acre, 2019 ac_doses <- sipni_data(year = 2019, uf = "AC") ac_doses ``` ### Key variables (DPNI) | Variable | Description | |----------|-------------| | ANO | Reference year | | UF | UF code (IBGE 2 digits) | | MUNIC | Municipality code (IBGE 6 digits) | | IMUNO | Immunobiological code | | DOSE | Dose type (1st, 2nd, booster, etc.) | | QT_DOSE | Number of doses applied | | FX_ETARIA | Age group (coded) | ### Using the dictionary ```{r} # vaccine codes sipni_dictionary("IMUNO") # dose types sipni_dictionary("DOSE") # age groups sipni_dictionary("FX_ETARIA") ``` ## FTP path: vaccination coverage (CPNI) The CPNI type provides coverage rates per municipality: ```{r} # vaccination coverage in Acre, 2019 ac_coverage <- sipni_data(year = 2019, type = "CPNI", uf = "AC") ac_coverage ``` ### Key variables (CPNI) | Variable | Description | |----------|-------------| | ANO | Reference year | | UF | UF code (IBGE 2 digits) | | MUNIC | Municipality code (IBGE 6 digits) | | IMUNO | Immunobiological code | | QT_DOSE | Number of doses applied | | POP | Target population | | COBERT | Vaccination coverage (%) | ## CSV path: individual-level microdata (2020+) For years 2020 and later, SI-PNI provides individual-level microdata (one row per vaccination dose). The `type` parameter is ignored for these years: ```{r} # microdata for Acre, January 2024 ac_micro <- sipni_data(year = 2024, uf = "AC", month = 1) ac_micro ``` ### Key variables (CSV microdata) | Variable | Description | |----------|-------------| | sigla_uf_estabelecimento | UF of the health facility | | codigo_municipio_estabelecimento | Municipality (IBGE) | | tipo_sexo_paciente | Sex (M/F) | | numero_idade_paciente | Patient age | | nome_raca_cor_paciente | Race/color (descriptive) | | descricao_vacina | Vaccine name | | descricao_dose_vacina | Dose description | | data_vacina | Vaccination date | ### Exploring variables ```{r} # DPNI variables (FTP) sipni_variables() # CPNI variables (FTP) sipni_variables(type = "CPNI") # API/CSV variables (2020+) sipni_variables(type = "API") # search sipni_variables(search = "dose") ``` ## Month parameter for CSV data For years >= 2020, each month is a separate ~1.4 GB national CSV file. Use `month` to select specific months: ```{r} # single month jan <- sipni_data(year = 2024, uf = "AC", month = 1) # first quarter q1 <- sipni_data(year = 2024, uf = "AC", month = 1:3) # all 12 months (default, downloads ~17 GB total) full_year <- sipni_data(year = 2024, uf = "AC") ``` For FTP data (1994--2019), the `month` parameter is ignored because FTP files are annual. ## Example: vaccine doses by immunobiological (FTP) ```{r} ac_2019 <- sipni_data(year = 2019, uf = "AC") # decode immunobiological names imuno_labels <- sipni_dictionary("IMUNO") |> select(code, label) doses_by_vaccine <- ac_2019 |> group_by(IMUNO) |> summarize(total_doses = sum(as.integer(QT_DOSE), na.rm = TRUE), .groups = "drop") |> left_join(imuno_labels, by = c("IMUNO" = "code")) |> arrange(desc(total_doses)) doses_by_vaccine ``` ## Example: coverage trends over time ```{r} # coverage data for Sao Paulo, 2015-2019 sp_cov <- sipni_data( year = 2015:2019, type = "CPNI", uf = "SP" ) # average coverage by year sp_cov |> group_by(year) |> summarize( mean_coverage = mean(as.numeric(COBERT), na.rm = TRUE), .groups = "drop" ) ``` ## Example: individual-level analysis (2020+) ```{r} # COVID-19 vaccinations in Acre, January 2024 ac_jan <- sipni_data(year = 2024, uf = "AC", month = 1) # vaccines administered ac_jan |> count(descricao_vacina, sort = TRUE) # doses by sex ac_jan |> count(tipo_sexo_paciente) # age distribution ac_jan |> mutate(age = as.integer(numero_idade_paciente)) |> filter(!is.na(age)) |> mutate(age_group = cut(age, breaks = c(0, 5, 12, 18, 30, 60, Inf), right = FALSE)) |> count(age_group) ``` ## Mixed year requests When requesting years that span both sources (e.g., 2019 and 2024), `sipni_data()` fetches from FTP and CSV respectively and combines the results. Note that column names and structure differ between sources: ```{r} # this downloads FTP (2019) + CSV (2024) mixed <- sipni_data(year = c(2019, 2024), uf = "AC", month = 1) # columns from FTP (UPPERCASE) and CSV (snake_case) are combined # with NAs where columns don't overlap names(mixed) ``` ## Download tips - **FTP files** (1994--2019) are small (~100 KB each) and download quickly. - **CSV files** (2020+) are large (~1.4 GB per month, national). Start with a single month and UF. - The first download of a CSV month caches **all 27 UFs**. A second request for a different UF from the same month is instant from cache. - Multiple months are downloaded concurrently when possible. ## Smart type parsing ```{r} # parsed types (default) ac <- sipni_data(year = 2019, uf = "AC") class(ac$QT_DOSE) # integer # raw character columns ac_raw <- sipni_data(year = 2019, uf = "AC", parse = FALSE) ``` ## Cache management Downloaded data is cached locally for faster future access: ```{r} # check cache status sipni_cache_status() # clear cache if needed sipni_clear_cache() ``` If the `arrow` package is installed, data is cached in Parquet format. You can also use lazy evaluation: ```{r} # lazy query for FTP data (requires arrow) sipni_lazy <- sipni_data(year = 2019, uf = "AC", lazy = TRUE) sipni_lazy |> filter(QT_DOSE > 0) |> select(IMUNO, DOSE, QT_DOSE) |> collect() ``` ## Additional resources - OpenDataSUS (`dadosabertos.saude.gov.br`) - [Census vignette](censo-denominadores.html) for population denominators