--- title: "Deaths by Horse-kick and other Hazards in the Prussian Army" author: "Antony Unwin and Bill Venables" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: gda.bib vignette: > %\VignetteIndexEntry{Deaths by Horse-kick and other Hazards in the Prussian Army} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "", echo = FALSE, fig.height = 6, fig.width = 9, out.width = "100%" ) ``` ```{r setup, include=FALSE} library(knitr) library(Horsekicks) library(dplyr) library(tidyr) library(ggplot2) theme_set(theme_bw() + theme(plot.title = element_text(hjust = 0.5), strip.background = element_rect(fill = "antiquewhite"))) ``` ## Introduction In 1898 the Russian statistician Ladislaus von Bortkiewicz (1868-1931) used a dataset on deaths in Prussian Army Corps to illustrate what he called the _Law of Small Numbers_ (@vonBortk:1898). At the time Statistics could only be confidently applied to large datasets where, for example, Gaussian approximations might apply. Von Bortkiewicz wanted to show that methods could be devised to deal with datasets where such approximate methods clearly could _not_ be used as well. He took annual data on deaths due to horse-kicks in 14 Army Corps over 20 years from 1875 to 1894. They had been reported in the official state publication _Preussische Statistik_. At the start of the period a unified Germany had just been created following their victory over France in the Franco-Prussian war of 1870-71. Germany was not involved in a war again until 1914 when the First World War began, although from the 1880s there were colonisation expeditions in Africa and the western Pacific. In more recent times the preference for most intermediate to advanced textbooks has, for good reasons, shifted to larger and larger data sets. Nevertheless the fascination with historically interesting datasets remains even if they are small. One such dataset is the horse-kick data and deserves a fresh look with modern tools, particularly at the level of detail at which it was originally reported and not as a gross summary of about 5 or 6 frequencies. The original data set was of death by horse-kick only, and covered the 20 calendar years from 1875 to 1894. This package offers an extension of the data in two ways: Firstly, it extends the series to 1907, making it for 33 years rather than 20, and, secondly, it provides data in parallel on two further accidental causes of death, by falling from a horse or by drowning. The bigger story is set out in detail in the article (@unwin:2025) which this package is intended to accompany. ## A glimpse at the data The following table shows the first few lines of the data set, called `hkdeaths`. ```{r} kable(head(hkdeaths), align = "r") ``` Most of the column names are self-explanatory, some need further explanation: * __regiments__ is the number of regiments in the __corps__ as given by von Bortkiewicz. The __corps__ sizes are unequal and __regiments__ provides one relative measure of __corps__ size. * __NCOs__ is the number of non-commissioned officer deaths by horse-kick up till 1899. These deaths are included in the __kick__ total. * __vonB_kick__ is the original von Bortkiewicz data. It spans only the first 20 years, and contains a minor transcription error when compared to the original source, _Preussiche Statistik_. To get some idea of what is going on in the data, we can show the change over time of the corps annual aggregate deaths for each of the three causes. ```{r, echo = TRUE} hazards <- hkdeaths |> group_by(year) |> summarise(horse_kicks = sum(kick), falls = sum(fall), drownings = sum(drown), .groups = "drop") hazards_long <- hazards |> pivot_longer(cols = c(horse_kicks, falls, drownings), names_to = "cause", values_to = "deaths") ggplot(hazards_long) + aes(x = year, y = deaths, colour = cause) + geom_line(linewidth = 1.5) + scale_colour_viridis_d() + labs(title = "Annual Deaths from three Causes", x = "Year", y = "Deaths") ``` There is a suprisingly consistent fall in the death rate by drowning over the period, of about 2% per annum: ```{r, echo=TRUE} drown_1 <- glm(drownings ~ I(year - 1891), data = hazards, family = poisson(link = log)) exp(coef(drown_1)) |> rbind() |> as.data.frame(check.names = FALSE) |> kable(digits = 2) ``` ```{r} b <- exp(coef(drown_1)) ``` So the estimated mean deaths by drowning in the middle of the series, 1891, was `r round(b[1], 1)` and the estimated decline is `r round((1 - b[2])*100, 1)`% per annum. It is natural to ask if this decline in drowning death rate is also consistent across corps. This can be studied using a facet plot. ```{r, echo=TRUE} ggplot(hkdeaths) + aes(x = year, y = drown) + geom_line(colour = "#21908CFF", linewidth=1.0) + facet_wrap(~ corps) + geom_smooth(method = "glm", formula = y ~ I(x-1891), method.args = list(family = poisson), linewidth = 1.0, colour = "#440154FF", se = FALSE) + labs(x = "Year", y = "Deaths", title = "Annual Drowning Deaths per Corps") ``` ```{r, eval=FALSE} # ggplot(hkdeaths) + aes(x = year, y = drown) + # geom_line(colour = "#DF536B", linewidth=1.0) + facet_wrap(~ corps) + # geom_smooth(method = "glm", formula = y ~ I(x-1891), method.args = list(family = poisson), # linewidth = 2, colour = "#2297E6") + labs(y = "Deaths by drowning") ``` Perhaps corps XIV is the one exception, showing a slight _increase_ in drowning deaths per annum over the period. All other corps show a decline to some degree. ## References