statlingua statlingua logo

Lifecycle: experimental R-CMD-check Contributions Welcome

The statlingua R package is designed to help bridge the gap between complex statistical outputs and clear, human-readable explanations. By leveraging the power of Large Language Models (LLMs), statlingua helps you effortlessly translate the dense jargon of statistical models—coefficients, p-values, model fit indices, and more—into straightforward, context-aware natural language.

Whether you’re a student grappling with new statistical concepts, a researcher needing to communicate findings to a broader audience, or a data scientist looking to quickly draft reports, statlingua makes your statistical journey smoother and more accessible.

Why statlingua?

Statistical models are powerful, but their outputs can be intimidating. statlingua empowers you to:

By providing clear and contextualized explanations, statlingua helps you focus on the implications of your findings rather than getting bogged down in technical minutiae.

Supported Models

As of now, statlingua explicitly supports a variety of common statistical models in R, including:

Installation

statlingua is not yet on CRAN, but you can install the development version from GitHub:

if (!requireNamespace("remotes")) {
  install.packages("remotes")
}
remotes::install_github("bgreenwell/statlingua")

You’ll also need to install the ellmer package, which you can obtain from CRAN:

install.packages("ellmer")  # >= 0.2.0

API Key Setup & ellmer Dependency

statlingua doesn’t directly handle API keys or LLM communication. It acts as a sophisticated prompt engineering toolkit that prepares inputs and then passes them to ellmer. The ellmer package is responsible for interfacing with various LLM providers (e.g., OpenAI, Google AI Studio, Anthropic).

Please refer to the ellmer package documentation for detailed instructions on:

Once ellmer is installed and has access to an LLM provider, statlingua will seamlessly leverage that connection.

Quick Example: Explaining a Linear Model

# Ensure you have an appropriate API key set up first!
# Sys.setenv(GEMINI_API_KEY = "<YOUR_API_KEY_HERE>") 

library(statlingua)

# Fit a polynomial regression model
fm_cars <- lm(dist ~ poly(speed, degree = 2), data = cars)
summary(fm_cars)

# Define some context (highly recommended!)
cars_context <- "
This model analyzes the 'cars' dataset from the 1920s. Variables include:
  * 'dist' - The distance (in feet) taken to stop.
  * 'speed' - The speed of the car (in mph).
We want to understand how speed affects stopping distance in the model.
"

# Establish connection to an LLM provider (in this case, Google Gemini)
client <- ellmer::chat_google_gemini(echo = "none")  # defaults to gemini-2.0-flash

# Get an explanation
explain(
  fm_cars,                 # model for LLM to interpret/explain
  client = client,         # connection to LLM provider
  context = cars_context,  # additional context for LLM to consider
  audience = "student",    # target audience
  verbosity = "detailed",  # level of detail
  style = "markdown"       # output style
)

# Ask a follow-up question
client$chat(
  "How can I construct confidence intervals for each coefficient in the model?"
)

For more examples, including output, see the introductory vignette.

Extending statlingua to Support New Models

One of statlingua’s core strengths is its extensibility. You can add or customize support for new statistical model types by crafting specific prompt components. The system prompt sent to the LLM is dynamically assembled from several markdown files located in the inst/prompts/ directory of the package.

The main function explain() uses S3 dispatch. When explain(my_model_object, ...) is called, R looks for a method like explain.class_of_my_model_object(). If not found, explain.default() is used.

Prompt Directory Structure

The prompts are organized as follows within inst/prompts/:

Example: Adding Support for vglm from the VGAM package

Let’s imagine you want to add dedicated support for vglm (Vector Generalized Linear Models) objects from the VGAM package.

  1. Create New Prompt Files: You would create a new directory inst/prompts/models/vglm/. Inside this directory, you’d add:

  2. Implement the S3 Method: Add an S3 method for explain.vglm in an R script (e.g., R/explain_vglm.R):

    #' Explain a vglm object
    #'
    #' @inheritParams explain
    #' @param object A \code{vglm} object.
    #' @export
    explain.vglm <- function(
        object,
        client,
        context = NULL,
        audience = c("novice", "student", "researcher", "manager", "domain_expert"),
        verbosity = c("moderate", "brief", "detailed"),
        style = c("markdown", "html", "json", "text", "latex"),
        ...
      ) {
      audience <- match.arg(audience)
      verbosity <- match.arg(verbosity)
      style <- match.arg(style)
    
      # Use the internal .explain_core helper if it suits,
      # or implement custom logic if vglm needs special handling.
      # .explain_core handles system prompt assembly, user prompt building,
      # and calling the LLM via the client.
      # 'name' should match the directory name in inst/prompts/models/
      # 'model_description' is what's shown to the user in the prompt.
      .explain_core(
        object = object,
        client = client,
        context = context,
        audience = audience,
        verbosity = verbosity,
        style = style,
        name = "vglm", # This tells .assemble_sys_prompt to look in inst/prompts/models/vglm/
        model_description = "Vector Generalized Linear Model (VGLM) from VGAM"
      )
    }

    The summarize.vglm method might also need to be implemented in R/summarize.R if summary(object) for vglm needs special capture or formatting for the LLM. If utils::capture.output(summary(object)) is sufficient, summarize.default might work initially.

  3. Add to NAMESPACE and Document:

  4. Testing: Thoroughly test with various vglm examples. You might need to iterate on your instructions.md and role_specific.md to refine the LLM’s explanations.

By following this pattern, statlingua can be systematically extended to cover a vast array of statistical models in R!

Contributing

Contributions are welcome! Please see the GitHub issues for areas where you can help.

License

statlingua is available under the GNU General Public License v3.0 (GNU GPLv3). See the LICENSE.md file for more details.