In this vignette, we discuss how to use robust estimators of location (and scale). The estimators are organized by i) estimating method, 2) population characteristic and 3) type of implemented function.
Bare-bone methods are stripped-down versions of the survey methods in
terms of functionality and informativeness. These functions may serve
users and package developers as building blocks. In particular,
bare-bone functions cannot compute variances. The survey methods are
much more capable and depend—for variance estimation—on the R package
survey
Lumley (2021, 2010).
First, we load the namespace of the robsurvey
package
and attach it to the search path.
The argument quietly = TRUE
suppresses the start-up
message in the call of library("robsurvey")
.
The losdata
are a simple random sample without
replacement of n = 70 patients from a (fictive) hospital population of N
= 2 479 patients in inpatient treatment.Note
2 Next, we attach the data to the search path.
The first 3 rows of the data are:
where
los
length-of-stay in hospital (days)weight
sampling weightfpc
population size (finite population correction)We consider estimating average length-of-stay in hospital (LOS, days).
In order to use the survey methods (not bare-bone
methods), we must attach the survey
package (Lumley, 2010, 2021) to the search
path
and specify a survey or sampling design object
> dn <- svydesign(ids = ~1, fpc = ~fpc, weights = ~weight, data = losdata,
+ calibrate.formula = ~1)
Note. Since version 4.2, the
survey package allows the definition of pre-calibrated
weights (see argument calibrate.formula
of the function
svydesign()
). This vignette uses this functionality (in
some places). If you have installed an earlier version of the
survey
package, this vignette will automatically fall back
to calling svydesign()
without the calibration
specification. See vignette Pre-calibrated
weights of the survey
package to learn more.
The distribution of variable los is skewed to the right (see boxplot), and we see a couple of rather heavy outliers. On the logarithmic scale, the distribution is slightly skewed to the right. The outliers need not be errors. Following Chambers (1986), we distinguish representative outliers from non-representative outliers.
The outliers visible in the boxplot refer to a few individuals who stayed for a long time in inpatient care. Moreover, we assume that these outliers represent patients in the population that are similar in value (i.e., representative outliers).
The outliers tend to inflate the variance of the weighted mean. Although the outliers are not some kind of error, it is beneficiary to use estimators other than the weighted mean to estimate the population average of los. We are interested in robust estimators which—although being biased as estimators of the population mean—will often have a smaller mean square error than the weighted mean; thus, are more efficient.
The following estimation methods are available.
weighted_mean_trimmed()
|
weighted_total_trimmed()
|
In place of the weighted mean, we consider the 5% one-sided trimmed
weighted population mean of los. The lower end of the distribution is
not trimmed (lower bound: LB = 0
). The 5% largest
observations are trimmed (upper bound: UB = 0.95
).
We obtain an estimate of (roughly) 9.3 days. Note that the return value is a scalar.
If a bare-bone method is called with argument
info = TRUE
, the function returns a list, the names of
which are shown below.
The survey methods are:
svymean_trimmed()
|
svytotal_trimmed()
|
As before, we are interested in computing the 5% one-sided trimmed
weighted population mean of los. In contrast to
weighted_mean_trimmed()
, the method
svymean_trimmed()
computes the standard error of the
estimate using the functionality of the survey
package.
The estimated location, variance, and standard error of the estimator
can be extracted from object m
with the following
commands.
The summary()
method summarizes the most important facts
about the estimate. The summary is particularly useful for
M-estimators (see below) but less so for other estimators.
> summary(m)
Weighted trimmed estimator (0, 0.95) of the population mean
mean SE
los 9.324 1.064
Sampling design:
Independent Sampling design
svydesign(ids = ~1, fpc = ~fpc, weights = ~weight, data = losdata,
calibrate.formula = ~1)
Additional utility functions are residuals()
,
fitted()
, and robweights()
. These functions
are mainly relevant for M-estimators.
The bare-bone methods are:
weighted_mean_winsorized()
|
weighted_total_winsorized()
|
weighted_mean_k_winsorized()
|
weighted_total_k_winsorized()
|
We compute the 5% one-sided winsorized weighted population mean of
los. The lower end of the distribution is not winsorized (lower bound:
LB = 0
). The 5% largest observations are winsorized (upper
bound: UB = 0.95
).
The one-sided winsorized estimators can also be specified in absolute terms by winsorizing a fixed number \(k=1,2,\ldots\) of observations. This estimator is called the one-sided k-winsorized mean (and total) and is computed as follows
The survey methods are:
svymean_winsorized()
|
svytotal_winsorized()
|
svymean_k_winsorized()
|
svytotal_k_winsorized()
|
The utility functions coef()
, vcov()
,
SE()
, summary()
, residuals()
,
fitted()
, and robweights()
are available.
Winsorization and trimming act directly on the values of an
estimator. Other estimation methods reduce the sampling weight of
potential outliers instead. A hybrid method of winsorization and weight
reduction to treat influential observations has been proposed by Dalén (1987). An observation \(y_i\) is called influential if its expanded
value, \(w_iy_i\), is exceedingly
large. Let \(c>0\) denote a
winsorization or censoring cutoff value. Dalén’s estimator
"Z2"
and "Z3"
of the population \(y\)-total are given by \(\sum_{i \in s} [w_i
y_i]_{\circ}^c\), where \(\circ\) is a placeholder for
"Z2"
or "Z3"
and \[
\begin{align*}
[w_i y_i]_{Z2}^c =
\begin{cases}
w_i y_i & \text{if} \quad w_i y_i \leq c, \\
c & \text{otherwise},
\end{cases}
&\qquad \text{and} \qquad
[w_i y_i]_{Z3}^c =
\begin{cases}
w_i y_i & \text{if} \quad w_i y_ \leq c, \\
c + (y_i - c/w_i) & \text{otherwise}.
\end{cases}
\end{align*}
\]
Estimator "Z2"
censors the terms \(w_iy_i\) at \(c\). In estimator "Z3"
,
observations \(y_i\) such that \(w_iy_i > c\) contribute to the estimated
total only with \(c\) plus the excess
over the cutoff, \((w_iy_i - c)\). Note
that the excess over the threshold has a weight of 1.0 (Lee, 1995). An estimator of the population \(y\)-mean obtains by dividing the estimator
of the estimated \(y\)-total by the
(estimated) population size.
From a practical point of view, the choice of constant \(c\) in Dalén’s estimators is rather tricky because we cannot only derive \(c\) from a large order statistic, say \(y_{(k)}\), \(k < n\) (like for trimming). Instead, the corresponding weight \(w_{(k)}\) needs to be taken into account.
The bare-bone methods are:
weighted_mean_dalen()
|
weighted_total_dalen()
|
The estimators "Z2"
and "Z3"
can be
specified by the argument type
; by default,
type = "Z2"
. The censoring threshold \(c\) is implemented as argument
censoring
.
The survey methods are:
svymean_dalen()
|
svytotal_dalen()
|
The utility functions coef()
, vcov()
,
SE()
, summary()
, residuals()
,
fitted()
, and robweights()
are available.
The bare-bone methods are:
weighted_mean_huber()
|
weighted_total_huber()
|
weighted_mean_tukey()
|
weighted_total_tukey()
|
The estimators with postfix _huber
and
_tukey
are based on, respectively, the Huber and Tukey
(biweight) \(\psi\)-function.
As losdata is a simple random sample, M-estimators of
type = "rht"
is the method of choice. Here, we compute the
Huber-type robust weighted M-estimator of the mean with
robustness tuning constant \(k=8\).
The function huber2()
is an implementation of the
weighted Huber proposal 2 estimator. It is only available as bare-bone
method.Note 3
The survey methods are:
svymean_huber()
|
svytotal_huber()
|
svymean_tukey()
|
svytotal_tukey()
|
The Huber M-estimator of the mean (and its standard error) can be computed with
The summary()
method summarizes the most important facts
about the M-estimate
> summary(m)
Huber M-estimator (type = rwm) of the population mean
mean SE
los 11.17 1.328
Robustness:
Psi-function: with k = 8
mean of robustness weights: 0.9877
Algorithm performance:
converged in 4 iterations
with residual scale (weighted MAD): 5.93
Sampling design:
Independent Sampling design
svydesign(ids = ~1, fpc = ~fpc, weights = ~weight, data = losdata,
calibrate.formula = ~1)
The estimated scale (weighted MAD) can be extracted with the
scale()
function. Additional utility functions are
coef()
, vcov()
, SE()
,
residuals()
, fitted()
, and
robweights()
. The following figure shows a plot of the
robustness weights against the residuals. We see that large residuals
are downweighted.
An adaptive M-estimator of the total (or mean) is defined by letting the data chose the tuning constant \(k\). Let \(\widehat{T}\) denote the weighted total, and let \(\widehat{T}_k\) be the Huber M-estimator of the weighted total with robustness tuning constant \(k\). Under quite general regularity conditions, the estimated mean square error (MSE) of \(\widehat{T}_k\) can be approximated by (see e.g., Gwet and Rivest, 1992; Hulliger, 1995)
\[\widehat{\mathrm{mse}}\big(\widehat{T}_{k}\big) \approx \mathrm{var}\big(\widehat{T}_{k}\big) +\big(\widehat{T} - \widehat{T}_{k}\big)^2.\]
The minimum estimated risk (MER) estimator (Hulliger, 1995) selects \(k\) such that \(\widehat{\mathrm{mse}}\big(\widehat{T}_{k}\big)\) is minimal (among all candidate estimators). Now, suppose that we have been working on the M-estimator with \(k=8\).
Next, we compute the MER, starting from the current
M-estimate, i.e., object m
.
> mer(m)
Search interval: [1, 10]
Minimum found for k = 10
Rel. efficiency gain: 35%
mean SE
los 11.46 1.478
Hence, the MER is 35% more efficient than the classical estimator as an estimator of the population total.
The weighted quantile (and median) can be computed by
> weighted_quantile(los, weight, probs = c(0.1, 0.9))
10% 90%
3 22
> weighted_median(los, weight)
50%
8
When all weights are equal, weighted_quantile()
is equal
to base::quantile()
with argument
type = 2
.
The normalized weighted median absolute deviations about the weighted median can be computed with
By default, the normalization constant to make the weighted MAD an
unbiased estimator of scale at the Gaussian core model is
constant = 1.482602
. This constant can be changed if
necessary.
The normalized weighted interquartile range can be computed with
By default, the normalization constant to make the weighted IQR an
unbiased estimator of scale at the Gaussian core model is
constant = 0.7413
. This constant can be changed if
necessary.
1 All bare-bone methods can be called with the argument
info = TRUE
to return a list with the following entries:
characteristic (e.g., mean), estimator (e.g., trimmed estimator),
estimate (numerical value), variance (by default: NA
),
robust (list of arguments that specify robustness), residuals (numerical
vector), model (list of data used for estimation), design (by default:
NA
), call.
2 We have constructed the losdata
as a
showcase; though, the LOS measurements are real data that we have taken
from the \(201\) observations in Ruffieux et al. (2000). Our losdata
are
a sample of size \(n = 70\) from the
\(201\) original observations.
3 The function huber2()
is is similar to
MASS::hubers()
(Venables and Ripley,
2002). It differs from the implementation in MASS
in that
it allows for weights and is initialized by the (normalized) weighted
interquartile range (IQR) not the median absolute deviations (MAD).
The paper of Chambers (1986) is the landmark paper about outliers in finite population sampling. Lee (1995) and Beaumont and Rivest (2009) are a good starting point to learn about robustness in finite population sampling.
Trimming and winsorization are discussed in Lee (1995) and Beaumont and Rivest (2009). The variance estimators of the weighted trimmed and winsorized estimators are straightforward adaptions of the classical estimators; see Huber and Ronchetti (2009, Chap. 3.3) or Serfling (1980, Chap. 8). A rigorous treatment in the context of finite population sampling can be found in Shao (1994).
Rao (1971) was among the first to propose weight reduction. Consider a sample of size \(n\), and suppose that the \(i\)th observation is an outlier. He suggested to reduce the outlier’s sampling weight \(w_i\) to one, and redistribute the weight difference \(w_i−1\) among the remaining observations. As a result, observation \(i\) does not represent other values like it. Dalén’s estimator offers a more general notion of weight reducution; see Dalén (1987) and also Chen et al. (2017).
In the context of finite population sampling, M-estimators were first studied by Chambers (1986). He investigated robust methods in the model- or prediction based framework of Royall and Cumberland (1981). Model-assisted estimators were introduced (for ratio estimation) by Gwet and Rivest (1992) and studied by Lee (1995), and Hulliger (1995, 1999, 2005). A recent comprehensive treatment can be found in Beaumont and Rivest (2009).
BEAUMONT, J.-F. AND RIVEST, L.-P. (2009). Dealing with outliers in survey data. In: Sample Surveys: Theory, Methods and Inference ed. by Pfeffermann, D. and Rao, C. R. Volume 29A of Handbook of Statistics, Amsterdam: Elsevier, Chap. 11, 247–280. DOI: 10.1016/S0169-7161(08)00011-4
CHAMBERS, R. (1986). Outlier Robust Finite Population Estimation. Journal of the American Statistical Association 81, 1063–1069, DOI: 10.1080/01621459.1986.10478374.
DALEN, J. (1987). Practical Estimators of a Population Total Which Reduce the Impact of Large Observations. Research report, Statistics Sweden, Stockholm.
CHEN, Q., ELLIOTT, M. R., HAZIZA, D., YANG, Y., GHOSH, M., LITTLE, R. J. A., SEDRANSK, J. AND THOMPSON, M. (2017). Approaches to Improving Survey-Weighted Estimates. Statistical Science 32, 227–248, DOI: 10.1214/17-STS609.
GWET, J.-P. AND RIVEST, L.-P. (1992). Outlier Resistant Alternatives to the Ratio Estimator. Journal of the American Statistical Association 87, 1174–1182, DOI: 10.1080/01621459.1992. 10476275.
HUBER, P. J. AND RONCHETTI, E. (2009). Robust Statistics, New York: John Wiley & Sons, 2nd edition, DOI: 10.1002/9780470434697.
HULLIGER, B. (2006). Horvitz–Thompson Estimators, Robustified. In: Encyclopedia of Statistical Sciences ed. by Kotz, S. Volume 5, Hoboken (NJ): John Wiley & Sons, 2nd edition, DOI: 10.1002/0471667196.ess1066.pub2.
HULLIGER, B. (1999). Simple and robust estimators for sampling. In: Proceedings of the Survey Research Methods Section, American Statistical Association, 54–63, American Statistical Association.
HULLIGER, B. (1995). Outlier Robust Horvitz–Thompson Estimators. Survey Methodology 21, 79–87.
LEE, H. (1995). Outliers in business surveys. In: Business survey methods ed. by Cox, B. G., Binder, D. A., Chinnappa, B. N., Christianson, A., Colledge, M. J. and Kott, P. S. New York: John Wiley & Sons, Chap. 26, 503–526, DOI: 10.1002/9781118150504.ch26.
LUMLEY, T. (2021). survey: analysis of complex survey samples. R package version 4.0, URL https://CRAN.R-project.org/package=survey.
LUMLEY, T. (2010). Complex Surveys: A Guide to Analysis Using R: A Guide to Analysis Using R, Hoboken (NJ): John Wiley & Sons.
RAO, J. N. K. (1971). Some Aspects of Statistical Inference in Problems of Sampling from Finite Populations. In: Foundations of Statistical Inference ed. by Godambe, V. P. and Sprott, D. A. Toronto: Holt, Rinehart, and Winston, 171–202.
ROYALL, R. M. AND CUMBERLAND, W. G. (1981). An Empirical Study of the Ratio Estimator and Estimators of its Variance. Journal of the American Statistical Association 76, 66–82, DOI: 10.2307/2287043.
RUFFIEUX, C., PACCAUD, F. AND MARAZZI, A. (2000). Comparing rules for truncating hospital length of stay. Casemix Quarterly 2, 3–11.
SHAO, J. (1994). L-Statistics in complex survey problems. The Annals of Statistics 22, 946–967, DOI: 10.1214/aos/1176325505.
SERFLING, R. J. (1980). Approximation theorems of mathematical statistics, New York: John Wiley & Sons. DOI: 10.1002/9780470316481.
VENABLES, W. N. AND RIPLEY, B. D. (2002). Modern Applied Statistics with S, New York: Springer-Verlag, 4th edition, DOI: 10.1007/978-0-387-21706-2.