---
title: "Getting Started with emburden"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with emburden}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

The **emburden** package provides tools for analyzing household energy burden using the Net Energy Return (Nh) methodology. This vignette will walk you through the basic workflow for calculating and analyzing energy burden metrics.

## Installation

You can install emburden from GitHub:

```{r installation, eval=FALSE}
# install.packages("devtools")
devtools::install_github("ericscheier/emburden")
```

```{r setup}
library(emburden)
library(dplyr)
```

## What is Energy Burden?

Energy burden is the ratio of household energy spending to gross income:

**Energy Burden (EB) = S / G**

Where:
- **S** = Energy spending (electricity, gas, other fuels)
- **G** = Gross household income

A household spending $3,000 on energy with $50,000 income has a 6% energy burden.

## Quick Example: Single Household

```{r single-household}
# Calculate energy burden for a single household
gross_income <- 50000
energy_spending <- 3000

# Method 1: Direct energy burden
eb <- energy_burden_func(gross_income, energy_spending)
print(paste("Energy Burden:", scales::percent(eb)))

# Method 2: Via Net Energy Return (mathematically identical)
nh <- ner_func(gross_income, energy_spending)
neb <- 1 / (nh + 1)
print(paste("Net Energy Burden:", scales::percent(neb)))
print(paste("Net Energy Return:", round(nh, 2)))
```

For a single household, both methods give the same result: **6% energy burden**.

## Loading Data

The package automatically downloads data from OpenEI on first use:

```{r load-data, eval=FALSE}
# Load census tract data for North Carolina
nc_tracts <- load_census_tract_data(states = "NC")

# Load household cohort data by Area Median Income
nc_ami <- load_cohort_data(dataset = "ami", states = "NC")

# View structure
head(nc_ami)
```

## Calculating Metrics from Cohort Data

When working with pre-aggregated cohort data (total income and spending), calculate metrics from the totals:

```{r cohort-example, eval=FALSE}
# Calculate mean income and spending from totals
nc_data <- nc_ami %>%
  mutate(
    mean_income = total_income / households,
    mean_energy_spending = (total_electricity_spend +
                           coalesce(total_gas_spend, 0) +
                           coalesce(total_other_spend, 0)) / households
  ) %>%
  filter(!is.na(mean_income), !is.na(mean_energy_spending), households > 0) %>%
  mutate(
    eb = energy_burden_func(mean_income, mean_energy_spending),
    nh = ner_func(mean_income, mean_energy_spending),
    neb = neb_func(mean_income, mean_energy_spending)
  )
```

## Aggregating Energy Burden (Critical!)

**Important**: Energy burden is a ratio and **cannot be aggregated using arithmetic mean**!

### The WRONG Way

```{r wrong-aggregation, eval=FALSE}
# ❌ WRONG: Direct averaging of energy burden introduces ~1-5% error
eb_wrong <- weighted.mean(nc_data$eb, nc_data$households)
```

### The CORRECT Way: Via Net Energy Return

```{r correct-aggregation, eval=FALSE}
# ✅ CORRECT Method 1: Aggregate using Nh, then convert to NEB
nh_mean <- weighted.mean(nc_data$nh, nc_data$households)
neb_correct <- 1 / (1 + nh_mean)

# ✅ CORRECT Method 2: Use neb_func() with weights (simpler!)
# neb_func() automatically uses the Nh method internally
neb_correct2 <- neb_func(nc_data$mean_income,
                         nc_data$mean_energy_spending,
                         weights = nc_data$households)

print(paste("Correct NEB (manual Nh):", scales::percent(neb_correct)))
print(paste("Correct NEB (neb_func): ", scales::percent(neb_correct2)))
# Both give identical results!
```

**Why does this work?** The Nh transformation allows us to use simple arithmetic weighted mean instead of harmonic mean, making aggregation both simpler and more intuitive. The `neb_func()` with weights does this automatically.

## Analysis by Income Bracket

```{r by-income, eval=FALSE}
# Method 1: Manual Nh aggregation
nc_by_income <- nc_data %>%
  group_by(income_bracket) %>%
  summarise(
    households = sum(households),
    nh_mean = weighted.mean(nh, households),
    neb = 1 / (1 + nh_mean),  # Correct aggregation
    .groups = "drop"
  )

# Method 2: Using neb_func() with weights (simpler!)
nc_by_income2 <- nc_data %>%
  group_by(income_bracket) %>%
  summarise(
    neb = neb_func(mean_income, mean_energy_spending, weights = households),
    households = sum(households),
    .groups = "drop"
  )

print(nc_by_income)
```

## Identifying High Energy Burden Households

The 6% energy burden threshold is commonly used to identify energy poverty:

```{r high-burden, eval=FALSE}
# 6% energy burden corresponds to Nh = 15.67
high_burden_threshold <- 15.67

high_burden_households <- sum(nc_data$households[nc_data$nh < high_burden_threshold])
total_households <- sum(nc_data$households)
high_burden_pct <- (high_burden_households / total_households) * 100

print(paste("Households with >6% energy burden:",
            scales::percent(high_burden_pct/100)))
```

## Using calculate_weighted_metrics()

For more complex grouped analysis, use the built-in function:

```{r weighted-metrics, eval=FALSE}
results <- calculate_weighted_metrics(
  graph_data = nc_ami,
  group_columns = "income_bracket",
  metric_name = "ner",
  metric_cutoff_level = 15.67,  # 6% burden threshold
  upper_quantile_view = 0.95,
  lower_quantile_view = 0.05
)

# Format for publication
results$formatted_median <- to_percent(results$metric_median)
print(results)
```

## Key Takeaways

1. **For single households**: Both EB and NEB give identical results
2. **For aggregation**: Always use the Nh method to avoid errors
3. **Never**: Directly average energy burden values
4. **Data loading**: Automatic from OpenEI (2018 and 2022 vintages available)
5. **Threshold**: 6% energy burden (Nh ≥ 15.67) identifies high burden households

## Temporal Comparison

The package provides a dedicated function for comparing energy burden across data vintages (2018 vs 2022):

```{r temporal-comparison, eval=FALSE}
# Compare by income bracket
comparison <- compare_energy_burden(
  dataset = "ami",
  states = "NC",
  group_by = "income_bracket"
)

# View results
print(comparison)

# The function automatically:
# - Loads both 2018 and 2022 data
# - Normalizes schema differences (4 vs 6 AMI brackets)
# - Performs proper Nh-based aggregation
# - Calculates changes in energy burden

# Grouping options:
# - "income_bracket": Compare by AMI/FPL brackets (default)
# - "state": Compare multiple states
# - "none": Overall state-level comparison

# Example: State-level comparison
state_comparison <- compare_energy_burden(
  dataset = "ami",
  states = "NC",
  group_by = "none"
)

# Access specific metrics
state_comparison$neb_2018         # 2018 energy burden
state_comparison$neb_2022         # 2022 energy burden
state_comparison$neb_change_pp    # Change in percentage points
state_comparison$neb_change_pct   # Relative change percentage
```

This is much simpler than manually loading and aggregating both vintages!

## Analyzing Energy Burden by Housing Characteristics

The LEAD Tool data includes detailed housing characteristics that enable analysis of how building attributes affect energy burden. Four key housing dimension columns are available:

- **TEN**: Housing tenure (1=Owned free/clear, 2=Owned with mortgage, 3=Rented, 4=Occupied without rent)
- **TEN-YBL6**: Tenure crossed with year structure built (6 categories)
- **TEN-BLD**: Tenure crossed with building type (single-family, multi-unit, etc.)
- **TEN-HFL**: Tenure crossed with primary heating fuel type (gas, electric, oil, etc.)

These columns preserve granular housing detail through the data aggregation process, allowing you to analyze energy burden patterns across different housing types.

### Example: Comparing Renters vs Owners by Heating Fuel

```{r housing-analysis, eval=FALSE}
# Load data with housing characteristics
nc_housing <- load_cohort_data(dataset = "ami", states = "NC")

# Analyze energy burden by tenure and heating fuel
housing_analysis <- nc_housing %>%
  filter(!is.na(TEN), !is.na(`TEN-HFL`)) %>%
  mutate(
    mean_income = total_income / households,
    mean_energy_spending = (total_electricity_spend +
                           coalesce(total_gas_spend, 0) +
                           coalesce(total_other_spend, 0)) / households,
    nh = ner_func(mean_income, mean_energy_spending)
  ) %>%
  group_by(TEN, `TEN-HFL`) %>%
  summarise(
    total_households = sum(households),
    nh_mean = weighted.mean(nh, households),
    neb = 1 / (1 + nh_mean),
    .groups = "drop"
  ) %>%
  arrange(desc(neb))

# View the top 10 tenure-heating fuel combinations with highest burden
head(housing_analysis, 10)
```

### Example: Energy Burden by Building Age and Type

```{r building-analysis, eval=FALSE}
# Analyze by building characteristics
building_analysis <- nc_housing %>%
  filter(!is.na(`TEN-YBL6`), !is.na(`TEN-BLD`)) %>%
  mutate(
    mean_income = total_income / households,
    mean_energy_spending = (total_electricity_spend +
                           coalesce(total_gas_spend, 0) +
                           coalesce(total_other_spend, 0)) / households,
    nh = ner_func(mean_income, mean_energy_spending)
  ) %>%
  group_by(`TEN-YBL6`, `TEN-BLD`) %>%
  summarise(
    total_households = sum(households),
    nh_mean = weighted.mean(nh, households),
    neb = 1 / (1 + nh_mean),
    .groups = "drop"
  )

# Identify building age/type combinations with highest burden
high_burden_buildings <- building_analysis %>%
  filter(neb > 0.06) %>%  # Above 6% burden threshold
  arrange(desc(neb))

print(high_burden_buildings)
```

### Key Insights from Housing Analysis

Housing characteristic analysis can reveal:

1. **Tenure effects**: How renters vs owners experience different energy burdens
2. **Heating fuel disparities**: Which fuel types create higher burden (often oil/propane)
3. **Building age impacts**: Older buildings typically have higher burden due to poor insulation
4. **Structure type patterns**: Multi-family vs single-family burden differences
5. **Vulnerable populations**: Combinations like "renter + old building + expensive fuel" often show extreme burden

This granular analysis helps target energy efficiency interventions to the housing types and populations that need them most.

## Next Steps

- See `vignette("methodology")` for mathematical details
- See `NEB_QUICKSTART.md` for quick reference
- Run example scripts in `analysis/scripts/` directory
- Read full documentation: `?energy_burden_func`, `?ner_func`

## References

- **Paper**: "Net energy metrics reveal striking disparities across United States household energy burdens"
- **LEAD Tool Data**: https://data.openei.org/
- **GitHub**: https://github.com/ericscheier/emburden