This document gives a description of the functionality of the HardyWeinberg package (Graffelman (2015)); it essentially reproduces and extends the “example session” described in the aformentioned article, which is accessible via the vignette of the package. References for the statistical procedures that are used can be found in the vignette and in the help files of the functions of the package.
We subsequently address:
2.1. Genotype and allele counts
2.2. Testing for Hardy-Weinberg equilibrium
2.2.1. Autosomal markers
2.2.2. X-chromosomal markers
2.3. Special topics
2.3.1. Missing genotype data
2.3.2. Power computation
2.3.3. Population substructure
3.2. Testing HWE with many variants
3.3. Visualising the HWE test results
3.3.1. Ternary diagrams
3.3.2. QQ plots
4.1. Triallelic variants
The package is installed with the instructions
and its vignette can be consulted with
vignette("HardyWeinberg")
.
We use a sample of 1000 individuals genotyped for the MN blood group
locus as an example. We store the genotype counts (298, 489 and 213 for
MM, MN and NN respectively) in a vector x
with named
elements:
Most functions of the HardyWeinberg package will accept the counts in
any order, as long as they are labelled with two letters. If counts are
not labelled, the default order (AA,AB,BB) is assumed. The allele
frequency of the M allele can be calculated using the given order of the
genotypes with af(x)
, though we usually calculate the minor
allele frequency (MAF) with
The function maf
can also be used, with its
option
argument, to extract all sorted allele counts and
their relative frequencies.
Genotype frequencies can be conveniently visualised in a ternary
diagram (in genetics also known as a de Finetti diagram) with function
HWTernaryPlot
:
We conduct several classical tests for Hardy-Weinberg equilibrium, addressing both autosomal and X chromosomal tests.
We use the MN blood group locus to illustrate autosomal procedures.
The classical chi-square test can be carried out using
HWChisq
:
HW.test <- HWChisq(x, verbose = TRUE)
#> Chi-square test with continuity correction for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 = 0.1789563 DF = 1 p-value = 0.6722717 D = -3.69375 f = 0.01488253
This shows that the chi-square statistic has value 0.1790, and that
the corresponding p-value for the test is 0.6723. Using a significance
threshold of five percent, we do not reject HWE for the MN locus. When
verbose
is set to FALSE
(default) the test is
silent, and HW.test
is a list object containing the results
of the test (chi-square statistic, the p-value of the test, half the
deviation from HWE (D
) for the heterozygote (\(D = \frac{1}{2} (f_{AB} - e_{AB}\))), the
minor allele frequency (p
), the inbreeding coefficient
f
and the expected counts under the equilibrium
assumption). By default, HWChisq
applies a continuity
correction. This is not recommended for low minor allele frequencies. In
order to perform a chi-square test without Yates’ continuity correction,
it is necessary to set the cc
parameter to zero:
HW.test <- HWChisq(x, cc = 0, verbose = TRUE)
#> Chi-square test for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 = 0.2214896 DF = 1 p-value = 0.6379073 D = -3.69375 f = 0.01488253
The test with correction gives a smaller \(\chi^2\)-statistic and a larger p-value in comparison with the ordinary \(\chi^2\) test.
The likelihood ratio test (LRT) for HWE can be performed with
HWLratio
:
HW.lrtest <- HWLratio(x, verbose = TRUE)
#> Likelihood ratio test for Hardy-Weinberg equilibrium
#> G2 = 0.2214663 DF = 1 p-value = 0.637925
Note that the \(G^2\)-statistic and the p-value obtained are very close to the chi-square statistic and its p-value.
An exact test for HWE can be performed by using routine
HWExact
:
HW.exacttest <- HWExact(x, verbose = TRUE, pvaluetype = "midp")
#> Haldane Exact test for Hardy-Weinberg equilibrium (autosomal)
#> using MID p-value
#> sample counts: nMM = 298 nMN = 489 nNN = 213
#> H0: HWE (D==0), H1: D <> 0
#> D = -3.69375 p-value = 0.6330965
The exact test leads to the same conclusion, we do not reject HWE
(mid p-value = 0.6331). Both one-sided and two-sided exact tests are
possible by using the argument alternative
, which can be
set to "two.sided"
, "greater"
, or
"less"
. Three different ways of computing the p-value of an
exact test are implemented, and can be specified by the
pvaluetype
argument, which can be set to standard or
selome
(sum equally likely or more extreme) p-value, the
dost
(double one-sided tail probability) p-value, or the
better powered midp
p-value. By default
HWExact
calculates a standard exact p-value, though the mid
p-value is recommended for having better statistical properties. The
exact test is based on a recursive algorithm. For very large samples, R
may give an error message “evaluation nested too deeply: infinite
recursion”. This can usually be resolved by increasing R’s limit on the
number of nested expressions with
options(expressions = 10000)
prior to calling
HWExact
. See ?HWExact
for more information on
this issue.
The permutation test for HWE can be run by using
HWPerm
:
set.seed(123)
#HW.permutationtest <- HWPerm(x, verbose = TRUE)
#Permutation test for Hardy-Weinberg equilibrium
#Observed statistic: 0.2214896 17000 permutations. p-value: 0.6508235
and the number of permutations can be specified via the
nperm
argument. By default the chi-square statistic will be
used as the test statistic, but alternative statistics may be supplied
by the FUN
argument.
If, for some reason, the equilibrium status of a particular marker is
at stake, all frequentist tests can be run to see to what extent they do
agree or disagree. Function HWAlltests
performs all tests
with one call and returns a table of all p-values.
#HW.results <- HWAlltests(x, verbose = TRUE, include.permutation.test = TRUE)
# Statistic p-value
#Chi-square test: 0.2214896 0.6379073
#Chi-square test with continuity correction: 0.1789563 0.6722717
#Likelihood-ratio test: 0.2214663 0.6379250
#Exact test with selome p-value: NA 0.6556635
#Exact test with dost p-value: NA 0.6723356
#Exact test with mid p-value: NA 0.6330965
#Permutation test: 0.2214896 0.6508235
The MN data concern a large sample (\(n =\) 1000) with an intermediate allele frequency (p = 0.4575), for which all test results closely agree. For smaller samples and more extreme allele frequencies, larger differences between the tests are typically observed.
A Bayesian procedure for HWE consists of the calculation of the posterior distribution for Lindley’s disequilibrium parameter \(\alpha\). We calculate and plot this density, representing the equilibrium situation \(\alpha = 0\) by vertical red line.
post.dens <- HWLindley(seq(-1,1,by=0.01),x)
plot(seq(-1,1,by=0.01),post.dens,type="l",xlab=expression(alpha),
ylab=expression(pi(alpha)))
segments(0,0,0,HWLindley(0,x),lty="dotted",col="red")
Alternatively, we calculate a Bayesian 95% credible interval with
The results of the Bayesian analysis neither produce evidence against the equilibrium hypothesis.
We perform HWE tests for X-chromosomal markers, using a vector of five elements containing both male and female genotype counts. Hemizygous male genotype counts should be labelled with a single letter, diploid female counts with two. We consider the X-chromosomal single nucleotide polymorphism (SNP):
HWChisq(SNP1,cc=0,x.linked=TRUE,verbose=TRUE)
#> Chi-square test for Hardy-Weinberg equilibrium (X-chromosomal)
#> Chi2 = 7.624175 DF = 2 p-value = 0.022102 D = NA f = -0.0003817242
When males are excluded from the test we get:
HWChisq(SNP1[3:5],cc=0)
#> Chi-square test for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 = 9.485941e-05 DF = 1 p-value = 0.9922291 D = 0.05990783 f = -0.0003817242
Note that the test including males is significant, whereas the test excluding males is not.
The X-chromosomal likelihood ratio test gives:
HWLratio(SNP1,x.linked=TRUE)
#> Likelihood ratio test for Hardy-Weinberg equilibrium for an X-linked marker
#> G2 = 7.693436 DF = 2 p-value = 0.02134969
and is again very similar to the chi-square test.
The exact test for HWE for an X-chromosomal marker can be performed
by adding the x.linked=TRUE
option:
HWExact(SNP1,x.linked=TRUE)
#> Graffelman-Weir exact test for Hardy-Weinberg equilibrium on the X-chromosome
#> using SELOME p-value
#> Sample probability 5.682963e-05 p-value = 0.02085798
which gives a p-value similar to the \(\chi^2\) test. When the mid p-value is used we obtain
HWExact(SNP1,x.linked=TRUE,pvaluetype="midp")
#> Graffelman-Weir exact test for Hardy-Weinberg equilibrium on the X-chromosome
#> using MID p-value
#> Sample probability 5.682963e-05 p-value = 0.02082957
These exact tests show that the joint null of Hardy-Weinberg proportions and equality of allele frequencies has to be rejected. An exact test using the females only gives again a non-significant result:
The permutation test for X-linked markers gives
#HWPerm(SNP1,x.linked=TRUE)
#Permutation test for Hardy-Weinberg equilibrium of an X-linked marker
#Observed statistic: 7.624175 17000 permutations. p-value: 0.02211765
A summary of all frequentist X-chromosomal tests is obtained by
#HWAlltests(SNP1,x.linked=TRUE,include.permutation.test=TRUE)
# Statistic p-value
#Chi-square test: 7.624175 0.02210200
#Chi-square test with continuity correction: 7.242011 0.02675576
#Likelihood-ratio test: 7.693436 0.02134969
#Exact test with selome p-value: NA 0.02085798
#Exact test with dost p-value: NA NA
#Exact test with mid p-value: NA 0.02082957
#Permutation test: 7.624175 0.02211765
Results of all tests are similar. Finally, we test equality of allele frequencies in males and females with an exact test:
AFtest(SNP1)
#> Fisher Exact test for equality of allele frequencies for males and females.
#>
#> Table of allele counts:
#> A B
#> M 399 205
#> F 774 528
#>
#> Sample of 1255 indivduals with 1906 alleles. p-value = 0.006268363
For this SNP, there is a significant difference in allele frequency between males and females.
A Bayesian test for HWE for variants on the X-chromosome is
implemented in the function HWPosterior
. A Bayesian
analysis of the same SNP is obtained by:
HWPosterior(males=SNP1[1:2],females=SNP1[3:5],x.linked=TRUE)
#> Bayesian test for Hardy-Weinberg equilibrium of X-chromosomal variants.
#>
#> Posterior_Prob log10(Bayes Factor)
#> M0 (HWE): 0.3384 0.1859
#> M1 (f!=0): 0.0138 -1.3774
#> M2 (d!=1): 0.6222 0.6939
#> M3 (f!=0 & d!=1:) 0.0256 -1.1035
and shows that a model with Hardy-Weinberg proportions for females and different allele frequencies for both sexes has the largest posterior probability, and the largest Bayes factor, which is in accordance with the previous frequentist procedures.
We indicate how to test for HWE when there is missing genotype data.
We use the data set Markers
for that purpose.
data(Markers)
Markers[1:12,]
#> SNP1 iG iT SNP2 SNP3
#> 1 TT 641 1037 AA GG
#> 2 GT 1207 957 AC AG
#> 3 TT 1058 1686 AA GG
#> 4 GG 1348 466 CC AA
#> 5 GT 1176 948 AC AG
#> 6 GG 1906 912 CC AA
#> 7 GG 1844 705 CC AA
#> 8 GG 2007 599 CC AA
#> 9 GT 1369 1018 AC AG
#> 10 GG 1936 953 CC AA
#> 11 GG 1952 632 AC AG
#> 12 <NA> 947 920 AC AG
Note that this data is at the level of each individual. Dataframe
Markers
contains one SNP with missings (SNP1
),
the two allele intensities of that SNP (iG
and
iT
) and two covariate markers (SNP2
and
SNP3
). Here, the covariates have no missing values. We
first test SNP1
for HWE using a chi-square test and
ignoring the missing genotypes:
Xt <- table(Markers[,1])
Xv <- as.vector(Xt)
names(Xv) <- names(Xt)
Xv
#> GG GT TT
#> 46 32 20
HW.test <- HWChisq(Xv,cc=0,verbose=TRUE)
#> Chi-square test for Hardy-Weinberg equilibrium (autosomal)
#> Chi2 = 8.67309 DF = 1 p-value = 0.003229431 D = -6.77551 f = 0.297491
This gives a significant result (p-value = 0.0032). If the data can
be assumed to be missing completely at random (MCAR), then we may impute
missings by randomly sampling the observed data. This can be done by
supplying the method = "sample"
argument, and we create 50
imputed data sets (m = 50
).
set.seed(123)
Results <- HWMissing(Markers[,1], m = 50, method = "sample", verbose=TRUE)
#> Test for Hardy-Weinberg equilibrium in the presence of missing values
#> Inbreeding coefficient f = 0.2936
#> 95 % Confidence interval ( 0.1058 , 0.4813 )
#> p-value = 0.0022
#> Relative increase in variance of f due to missings: r = 0.3351
#> Fraction of missing information about f: lambda = 0.2529
As could be expected, the conclusion is the same: there is
significant deviation from HWE (p-value = 0.0022). It will make more
sense to take advantage of variables that are correlated with
SNP1
, and use multiple imputation of the missings of
SNP1
using a multinomial logit model. The multinomial logit
model will be used when we set method = "polyreg"
or leave
the method
argument out, since "polyreg"
is
the default for imputation of factor variables by means of a multinomial
logit model used by R package mice. We test
SNP1
(with missings) for HWE, using a multinomial logit
model to impute SNP1
using information from the allele
intensities iG
and iT
and the neighbouring
markers SNP2
and SNP3
.
set.seed(123)
Results <- HWMissing(Markers[, 1:5], m = 50, verbose = TRUE)
#> Warning: Number of logged events: 1
#> Test for Hardy-Weinberg equilibrium in the presence of missing values
#> Inbreeding coefficient f = 0.0608
#> 95 % Confidence interval ( -0.1061 , 0.2278 )
#> p-value = 0.4751
#> Relative increase in variance of f due to missings: r = 0.0596
#> Fraction of missing information about f: lambda = 0.0564
Note the sharp drop of the inbreeding coefficient, and the missing
data statistics \(\lambda\) and \(r\). The test is now not significant
(p-value = 0.4751). Exact inference for HWE with missings is possible by
setting the argument statistic="exact"
. This gives the
result
set.seed(123)
Results <- HWMissing(Markers[, 1:5], m = 50, statistic = "exact", verbose = TRUE)
#> Warning: Number of logged events: 1
#> Two-sided Exact test for Hardy-Weinberg equilibrium in the presence of missing values
#> p-value = 0.4426941
and a similar p-value is obtained.
Tests for HWE have low power for small samples with a low minor
allele frequency, or samples that deviate only moderately from HWE. It
is therefore important to be able to compute power. The function
HWPower
can be used to compute the power of a test for HWE.
If its disequilibrium argument theta
(the effect size), is
set to 4 (the default value), then the function computes the Type I
error rate for the test; the disequilibrium parameter \(\theta\) is defined as \(f_{AB}^2/(f_{AA} \cdot f_{BB})\).
Function mac
is used to compute the minor allele count.
We compute the power for the data on the MN locus:
x <- c(MM = 298, MN = 489, NN = 213)
n <- sum(x)
nM <- mac(x)
pw4 <- HWPower(n, nM, alpha = 0.05, test = "exact", theta = 4,
pvaluetype = "selome")
print(pw4)
#> [1] 0.04822774
pw8 <- HWPower(n, nM, alpha = 0.05, test = "exact", theta = 8,
pvaluetype = "selome")
print(pw8)
#> [1] 0.9996853
These computations show that for a large sample like this one, the
Type I error rate (0.0482) is very close to the nominal rate, 0.05, and
that the standard exact test has good power (0.9997) for detecting
deviations as large \(\theta=8\), which
is a doubling of the number of heterozygotes with respect to HWE. Type I
error rate and power for the chi-square test can be calculated by
setting test="chisq"
. With the allele frequency of this
sample (0.5425), \(\theta=8\) amounts
to an inbreeding coefficient of -0.1698
When a sample consists of individuals stemming from different source
populations characterised by different allele frequencies, tests for HWE
often end up significant, indicating in fact that the assumption of a
homogeneous allele frequency is not satisfied. It is recommended to test
for HWE in a stratified manner for each source population, if the
provenance of the individuals is known. Asymptotic procedures to test
HWE across multiple samples genotyped for the same biallelic
marker have been developed and are available in function
HWStrata
.
We use the Glyoxalasa polymorphism to illustrate a test across strata.
data("Glyoxalase")
Glyoxalase <- as.matrix(Glyoxalase)
HWStrata(Glyoxalase)
#> Olson's asymptotic test for HWE across strata for a single biallelic marker.
#> T2 = 0.003521695 p-value = 0.9526782
This test does not reject HWE for the entire sample. When stratified
testing is applied using HWExactStats
, a significant
deviation is found for one population.
Whenever a sample is known to consist of individuals of two (or more)
subpopulations, the relative contributions of the subpopulations to the
mixed sample can be estimated by the EM algorithm, if the genotype or
allele frequencies of the source populations are known. This is
implemented for a biallelic SNP with individuals from two subpopulations
in function HWEM
.
We estimate the contributions of two known populations to a sample of
genotype frequencies that is a mix of these two populations. The
genotype frequencies of the two source populations are stored in vectors
g1
and g2
, and the mixed sample in
x
.
g1 <- c(AA=0.034, AB=0.330, BB=0.636)
g2 <- c(AA=0.349, AB=0.493, BB=0.158)
x <- c(AA=0.270, AB=0.453, BB=0.277)
Their contributions are estimated using HWEM
G <- cbind(g1,g2)
contributions <- HWEM(x,G=G)
contributions
#> Group1 Group2
#> 0.2511847 0.7488153
About 25 percent is estimated to stem for the first population, and 75 percent from the second. If only the allele frequencies of the source populations are known, contributions can be estimated by supplying the allele frequencies and assuming Hardy-Weinberg proportions:
Autosomal tests for HWE assume equality of allele frequencies in the
sexes. When sex is taken into account, several scenarios are possible,
according to whether males of females genotypes satisfy the assumptions
of equal allele frequencies and HWE or not. The function
HWPosterior
can be used to perform Bayesian model selection
using the posterior probability of each scenario. We consider an example
using a SNP of the JPT sample taken from the 1000G project, using their
male and female genotype counts.
data(JPTsnps)
JPTsnps[1,]
#> AA AB BB AA AB BB
#> 46 10 0 40 8 0
Results <- HWPosterior(males=JPTsnps[1,1:3],
females=JPTsnps[1,4:6],
x.linked=FALSE,precision=0.05)
#> M_11 M_12 M_13 M_14 M_15 M_21 M_22 M_23 M_24 M_25
#> 0.5571 0.0463 0.0349 0.2384 0.0075 0.0620 0.0211 0.0226 0.0025 0.0077
#> Best fitting M_11 0.5570735
The results show that for this variant, equality of allele
frequencies in the sexes and HWP for both sexes (model \(M_{11}\)) is the model with the largest
probability. For more accurate results, higher precision of posterior
probabilities can be obtained by specifying
precision=0.005
, at the expense of increasing the
computation time.
Alternatively, the same variant can be analysed by calculating Akaike’s information criterion (AIC) for each scenario. This is achieved by
data(JPTsnps)
AICs <- HWAIC(JPTsnps[1,1:3],JPTsnps[1,4:6])
#> Best fitting M_11 99.54001
AICs
#> M_11 M_12 M_13 M_14 M_15 M_21 M_22 M_23
#> 99.54001 100.81297 100.55911 99.83219 101.83219 101.51680 102.78852 102.53483
#> M_24 M_25
#> 101.83219 103.80656
In this case, the AIC criterion identifies the same \(M_{11}\) model as the best fitting model.
We use the data set CEUchr22
to illustrate biallelic
procedures with multiple SNPs. The dataset CEUchr22
contains 10,000 SNPs sampled at random from chromosome 22 of the CEU
population from the 1000 genomes project.
data("CEUchr22")
CEUchr22[1:5,1:5]
#> SNP1 SNP2 SNP3 SNP4 SNP5
#> NA06984 0 0 0 0 0
#> NA06985 0 0 0 0 0
#> NA06986 0 0 0 0 0
#> NA06989 0 0 0 0 0
#> NA06994 0 0 0 0 0
This data is in (0,1,2) format and we first construct a matrix with the three genotype counts for each SNP.
Z <- MakeCounts(CEUchr22)
Z <- Z[,1:3]
head(Z)
#> AA AB BB
#> SNP1 99 0 0
#> SNP2 99 0 0
#> SNP3 99 0 0
#> SNP4 99 0 0
#> SNP5 99 0 0
#> SNP6 99 0 0
We use a ternary diagram with a colour ramp for the frequency of the genotype count patterns.
This shows that SNPs that are monomorphic for the reference allele (A) make up about 82% of the database. We redo this plot filtering out SNPs with a minor allele frequency (MAF) below 5%:
This reveals that SNPs with a zero count for the BB homozygote and a varying low count for AB are relatively more common. We study the distribution of the MAF with a histogram, excluding the monomorphics:
This shows the typical pattern of more frequent low MAF
polymorphisms. We can also use function maf
to extract the
minor allele count for each SNP, and represent these in a barplot:
This shows, as expected, that SNPs with just one, two or three copies of the minor allele are most common.
The aforementioned functions HWChisq
,
HWLratio
, HWExact
, HWPerm
all
test a single biallelic marker for HWE. If the genotype counts AA, AB,
BB are collected in a three-column matrix, with each row representing a
marker then large sets of markers can be tested most efficiently with
the functions HWChisqStats
for the chi-square test, and
with HWExactStats
for the exact tests. These routines
return the p-values or test statistics for each marker. These functions
have fewer options but are computationally better optimized. Both
functions allow for X-linked markers via the x.linked
argument. Exact tests that rely on exhaustive enumeration are slow in R,
and HWExactStats
uses by default faster C++ code generously
shared by Christopher Chang. The same C++ code is used in the current
version (2.0) of the PLINK software (https://www.cog-genomics.org/plink/2.0/). We apply these
functions to the previously obtained genotype counts of the
CEUChr22
data, using only polymorphic SNPs:
Zpoly <- Z[!is.mono(Z),]
npoly <- nrow(Zpoly)
chisq.pvalues <- HWChisqStats(Zpoly,pvalues=TRUE)
exact.pvalues <- HWExactStats(Zpoly,midp=TRUE)
bonferronithreshold <- 0.05/npoly
sum(chisq.pvalues < bonferronithreshold)
#> [1] 24
sum(exact.pvalues < bonferronithreshold)
#> [1] 9
If a (conservative) Bonferroni correction is applied, some highly significant SNPs are detected. The exact test detects fewer, as it is more conservative. A set of low MAF SNPs is significant in a chi-square test, but not in the exact test.
Genetic association studies, genome-wide association studies in
particular, use many genetic markers. In this context graphics such as
ternary plots, log-ratio plots and QQ plots become particularly useful,
because they can reveal whether HWE is a reasonable assumption for the
whole data set. We begin to explore the CEUChr22
SNPs by
making a ternary plot.
Zu <- UniqueGenotypeCounts(Z)
#> 10000 rows in X
#> 549 unique rows in X
Zu <- Zu[,1:3]
HWTernaryPlot(Zu,region=1,pch=1)
The curves around the Hardy-Weinberg parabola delimit the acceptance
region for a chi-square test for HWE, with a default significance
threshold of five percent. A number of SNPs is seen to have significant
deviation from HWP, either for having an excess or a lack of
heterozygotes. One SNP is seen to consist almost entirely of
heterozygotes. The acceptance region of the exact test can be shown by
setting region=7
.
The exact test acceptance region is wiggly due to the discrete nature of the exact test. This region shows fewer significant markers, notably towards the homozygote vertices, and illustrates that the exact test is more conservative than the chi-square test.
For large databases of SNPs, drawing the ternary plot can be time
consuming. Usually the matrix with genotype counts contains several rows
with the same counts. The ternary plot can be constructed faster by
plotting only the unique rows of the count matrix. Function
UniqueGenotypeCounts
, illustrated above, extracts the
unique rows of the count matrix and also counts their frequency.
When many statistical tests are performed, the distribution of the
obtained p-values is of interest. For tests based on continuous test
statistics, under the null hypothesis the distribution of the p-values
is expected to be uniform. We exclude monomorphic variants using the
logical function is.mono
. We first use qqunif
to compare the chi-square p-values of the HWE test against a uniform
distribution (panel A). We next simulate markers under HWE with
HWData
, matching the simulated markers in sample size and
allele frequency distribution to the observed data. This is achieved by
setting argument p
of HWData
equal to the
allele frequencies of the observed data, where the latter are computed
with function af
. The corresponding QQ-plot of the
simulated p-values is shown in panel B. For the empirical data, the
observed p-value distribution strongly deviates from the uniform
distribution, as well as from the expected pattern under HWE.
Given that genotype data is discrete, often with low counts, the null
distribution of the p-values is in fact not uniform. We therefore build
a new QQ plot of exact p-values against p-values that are obtained from
sampling the true null a few times (five times in the code below) using
HWQqplot
, doing this both for the observed data (panel C)
and for data sampled from the Levene-Haldane equilibrium distribution,
conditional on the observed minor allele counts (panel D). The pattern
for the empirical data differs strongly form the expected p-value
distribution as shown in panel D. There are many more significant
markers than expected under the HWE assumption.
data("CEUchr22")
Z <- MakeCounts(CEUchr22)
Z <- Z[,1:3]
Z <- Z[!is.mono(Z),]
alfreq <- af(Z)
alcount <- maf(Z,option=3)[,1]
chisq.pvals <- HWChisqStats(Z,pvalues=TRUE)
set.seed(123)
Z.sim.chi <- HWData(nm=nrow(Z),n=99,p=alfreq)
Z.sim.chi <- Z.sim.chi[!is.mono(Z.sim.chi),]
chisq.pvals.sim <- HWChisqStats(Z.sim.chi,pvalues=TRUE)
set.seed(123)
Z.sim.exa <- HWData(nm=nrow(Z),n=99,nA=alcount,conditional = TRUE)
Z.sim.exa <- Z.sim.exa[!is.mono(Z.sim.exa),]
opar <- par(mfrow=c(2,2),mar=c(3,3,2,0)+0.5,mgp=c(2,1,0))
par(mfg=c(1,1))
qqunif(chisq.pvals,logplot = TRUE,
plotline = 1,main="A: observed QQ uniform")
par(mfg=c(1,2))
qqunif(chisq.pvals.sim,logplot = TRUE,xylim=15,
main="B: simulated QQ uniform")
par(mfg=c(2,1))
set.seed(123)
HWQqplot(Z,nsim=5,logplot=TRUE,main="C: observed QQ HWE null")
par(mfg=c(2,2))
set.seed(123)
HWQqplot(Z.sim.exa,nsim=5,logplot=TRUE,main="D: simulated QQ HWE null")
The function HWData
allows for the simulation of genetic
markers under equilibrium and disequilibrium conditions. This enables us
to create simulated data sets that match the observed data set in sample
size and allele frequency distribution, as shown in the previous
section. The comparison of graphics and statistics for observed and
simulated datasets is helpful when assessing the extent of HWE for a
large set of markers. We simulate \(m=100\) markers for \(n=100\) individuals by taking random
samples from a multinomial distribution with \(\theta_{AA} = p^2, \hspace{1mm} \theta_{AB} = 2pq,
\hspace{1mm}\) and \(\theta_{BB} =
q^2\). This is done by routine HWData
, which can
generate data sets that stem from a population that is either in or out
of Hardy-Weinberg equilibrium. Routine HWData
can generate
data that are in exact equilibrium
(exactequilibrium = TRUE
) or that are generated from a
multinomial distribution (default). The markers generated by
HWData
are independent (there is no linkage
disequilibrium). HWData
returns a matrix of genotype
counts, which are converted to genotypic compositions (i.e., the
relative genotype frequencies) if argument counts
is set to
FALSE
. Routine HWData
can simulate genotype
counts under several conditions. A fixed allele frequency can be
specified by setting p
to a scalar or vector with the
desired allele frequencies and specifying conditional=TRUE
.
Sampling is then according to Levene-Haldane’s exact distribution. If
conditional
is FALSE
, the given vector
p
of allele frequencies will be used in sampling from the
multinomial distribution. If p
is not specified,
p
will be drawn from a uniform distribution, and genotypes
are drawn from a multinomial distribution with probabilities \(p^2, 2pq\) and \(q^2\) for AA, AB and BB respectively. It is
also possible to generate data under disequilibrium, by specifying a
vector of inbreeding coefficients f
. The use of
HWData
is illustrated below by simulating several data
sets. Each simulated dataset is plotted in a ternary diagram below in
order to show the effect of the different simulation options. We
subsequently simulate 100 markers under HWE with allele frequency 0.5
(X1
), 100 markers under HWE with a random uniform allele
frequency (X2
), 100 markers under inbreeding (\(f = 0.5\)) with allele frequency 0.5
(X3
), 100 markers under inbreeding (\(f=0.5\)) with a random uniform allele
frequency (X4
), 100 markers with fixed allele frequencies
of 0.2, 0.4, 0.6 and 0.8 (25 each, X5
) and 100 markers in
exact equilibrium with a random uniform allele frequency
(X6
).
set.seed(123)
n <- 100
m <- 100
X1 <- HWData(m, n, p = rep(0.5, m))
X2 <- HWData(m, n)
X3 <- HWData(m, n, p = rep(0.5, m), f = rep(0.5, m))
X4 <- HWData(m, n, f = rep(0.5, m))
X5 <- HWData(m, n, p = rep(c(0.2, 0.4, 0.6, 0.8), 25), conditional = TRUE)
X6 <- HWData(m, n, exactequilibrium = TRUE)
opar <- par(mfrow = c(3, 2),mar = c(1, 0, 3, 0) + 0.1)
par(mfg = c(1, 1))
HWTernaryPlot(X1, main = "(a)")
par(mfg = c(1, 2))
HWTernaryPlot(X2, main = "(b)")
par(mfg = c(2, 1))
HWTernaryPlot(X3, main = "(c)")
par(mfg = c(2, 2))
HWTernaryPlot(X4, main = "(d)")
par(mfg = c(3, 1))
HWTernaryPlot(X5, main = "(e)")
par(mfg = c(3, 2))
HWTernaryPlot(X6, main = "(f)")
In practice, SNPs mostly have a skewed distribution of the minor
allele frequency (MAF), such that low MAF variants are much more common.
SNPs with such an allele frequency distribution can be simulated with
HWData
by setting the parameters of the beta distribution
shape1
and shape2
to 1 and 10
respectively:
The HardyWeinberg package has incorporated functions for dealing with multiallelic markers, geared towards the analysis of microsatellite or Short Tandem Repeat (STR) datasets. Special functions for triallelic and multiallelic variants are illustrated below.
The classical three-allelic ABO locus can be tested for equilibrium
with an iterative algorithm implemented in HWABO
, as shown
for the sample (A=182,B=60,AB=17,OO=176) below.
x <- c(fA=182,fB=60,nAB=17,nOO=176)
al.fre <- HWABO(x)
#> Iteration history:
#> pA pB pO ll
#> 0 0.3333333 0.33333333 0.3333333 -194.706389
#> 1 0.2984674 0.11149425 0.5900383 -13.488684
#> 2 0.2709650 0.09445916 0.6345758 -9.196185
#> 3 0.2655411 0.09328308 0.6411759 -9.099231
#> 4 0.2646231 0.09318236 0.6421945 -9.096756
#> 5 0.2644732 0.09317075 0.6423560 -9.096691
#> 6 0.2644490 0.09316911 0.6423819 -9.096690
#> fA fB nAB nOO
#> Observed 182.000 60.00000 17.0000 176.000
#> Expected 178.212 55.84582 21.4351 179.507
#> X2 = 1.375706 p-value = 0.2408339
Allele frequencies, initially set to being equally frequent, converge in six iterations to their final values. A Chi-square test with one degree of freedom indicates equilibrium can not be rejected.
A general triallelic locus can be tested for equilibrium with an
exact test by HWTriExact
as shown below, supplying the
genotype counts as a six element named vector.
x <- c(AA=20,AB=31,AC=26,BB=15,BC=12,CC=0)
#results <- HWTriExact(x)
#Tri-allelic Exact test for HWE (autosomal).
#Allele counts: A = 38 B = 73 C = 97
#sum probabilities all outcomes 1
#probability of the sample 0.0001122091
#p-value = 0.03370688
The output gives the probability of the observed sample, and the
exact test p-value. For this example, the null hypothesis of equilibrium
proportions is rejected at a significance level of five percent.
HWTriExact
uses a complete enumeration algorithm programmed
in R, which can be slow, depending on the genotype counts of the
particular sample. A faster analysis for triallelics is to use a network
algorithm. For the data at hand, the exact test based on the network
algorithm is carried out by with HWNetwork
x <- c(AA=20,AB=31,AC=26,BB=15,BC=12,CC=0)
x <- toTriangular(x)
x
#> A B C
#> A 20 0 0
#> B 31 15 0
#> C 26 12 0
m <- c(A=0,B=0,C=0)
results <- HWNetwork(ma=m,fe=x)
#> Network algorithm for HWE Exact test with multiple alleles
#> 3 alleles detected.
#> 0 males and 104 females
#> Allele counts:
#> A B C
#> Males 0 0 0
#> Females 97 73 38
#> All 97 73 38
#> Probability of the sample: 0.0001122091
#> p-value: 0.03370688
We note that HWNetwork
allows for the X chromosomal
variants, and requires the specification of male and female genotype
counts (arguments ma
and fe
). To do an
autosomal test as shown above, hemizygous male counts should be set to
zero, and the female genotype counts should be set to contain the summed
autosomal counts of males and females. Second, note that the
fe
argument is required to be a lower triangular matrix,
and for this reason the counts are first reorganised in this format with
toTriangular
. The p-value is exactly the same as before.
For markers with more alleles, a permutation test will generally be
faster. We run the permutation test for the triallelic autosomal locus
analysed above
set.seed(123)
x <- c(AA=20,AB=31,AC=26,BB=15,BC=12,CC=0)
x <- toTriangular(x)
#results <- HWPerm.mult(x)
#Permutation test for Hardy-Weinberg equilibrium (autosomal).
#3 alleles detected.
#Observed statistic: 0.0001122091 17000 permutations. p-value: 0.03405882
Note that this gives a similar, but not identical p-value, in
comparison with HWTriExact
above. As this works with
two-character named vectors this currently allows the permutation test
to be used for variants with well over 52 alleles.
An X chromosomal variant is tested for HWE by supplying separate vectors for males and females as shown below:
males <- c(A=1,B=21,C=34)
females <- c(AA=0,AB=1,AC=0,BB=8,BC=24,CC=15)
results <- HWTriExact(females,males)
#> Tri-allelic Exact test for HWE and EAF (X-chromosomal)
#> Allele counts: na = 2 nb = 62 nc = 88
#> Sample contains: 56 males and 48 females
#> sum probabilities all outcomes 1
#> probability of the sample 0.005343291
#> p-value = 0.8309187
and this can also be done with the network algorithm by
males <- c(A=1,B=21,C=34)
females <- toTriangular(c(AA=0,AB=1,AC=0,BB=8,BC=24,CC=15))
results <- HWNetwork(ma=males,fe=females)
#> Network algorithm for HWE Exact test with multiple alleles
#> 3 alleles detected.
#> 56 males and 48 females
#> Allele counts:
#> C B A
#> Males 34 21 1
#> Females 54 41 1
#> All 88 62 2
#> Probability of the sample: 0.005343291
#> p-value: 0.8309187
For genetic markers with multiple alleles such as microsatellites
(STRs), the different alleles are often separated in different columns.
The example below uses autosomal microsatellite data from the US
National Institute of Standards and Technology (NIST), stored in
dataframe NistSTRs
, containing 29 STRs for 361 individuals.
The two alleles of a marker are coded in successive columns. The
corresponding alleles are often coded as integers, and we use function
AllelesToTriangular
to obtain a lower triangular matrix
with the autosomal genotype counts. This matrix is used as input for
HWPerm.mult
that runs the permutation test.
data("NistSTRs")
NistSTRs[1:5,1:5]
#> CSF1PO-1 CSF1PO-2 D10S1248-1 D10S1248-2 D12S391-1
#> BC11352 11 13 13 15 17.0
#> GC03394 10 12 15 16 19.0
#> GT36864 10 12 15 17 17.3
#> GT36866 11 11 13 16 18.0
#> GT36875 11 11 13 14 18.0
n <- nrow(NistSTRs)
p <- ncol(NistSTRs)/2
n
#> [1] 361
p
#> [1] 29
A1 <- NistSTRs[,1]
A2 <- NistSTRs[,2]
GenotypeCounts <- AllelesToTriangular(A1,A2)
print(GenotypeCounts)
#> A10 A11 A12 A13 A14 A8 A9
#> A10 17 0 0 0 0 0 0
#> A11 47 35 0 0 0 0 0
#> A12 61 78 44 0 0 0 0
#> A13 12 20 26 0 0 0 0
#> A14 1 1 4 1 0 0 0
#> A8 1 2 1 0 0 0 0
#> A9 3 5 2 0 0 0 0
set.seed(123)
#out <- HWPerm.mult(GenotypeCounts)
#Permutation test for Hardy-Weinberg equilibrium (autosomal).
#7 alleles detected.
#Observed statistic: 2.290724e-11 17000 permutations. p-value: 0.8644706
For this seven-allelic STR, there is no evidence against HWE.
Function HWStr
is a wrapper function allowing to test a set
of STRs coded in the two-column format by a permutation or chisquare
test. All STRs in the dataframe NistSTRs
can be tested
with
#Results <- HWStr(NistSTRs,test="permutation")
#29 STRs detected.
#> Results
# STR N Nt MinorAF MajorAF Ho He Hp pval
#1 CSF1PO-1 361 7 0.0055 0.3601 0.7341 0.7194 1.4016 0.8614
#2 D10S1248-1 361 9 0.0014 0.3075 0.7645 0.7586 1.5552 0.6488
#3 D12S391-1 361 16 0.0014 0.1717 0.8975 0.8909 2.3686 0.8955
#4 D13S317-1 361 8 0.0014 0.3255 0.7618 0.7837 1.7102 0.1049
#5 D16S539-1 361 7 0.0180 0.3144 0.7645 0.7600 1.5933 0.1641
#6 D18S51-1 361 15 0.0014 0.1704 0.8560 0.8758 2.2139 0.1926
#7 D19S433-1 361 15 0.0014 0.3615 0.7673 0.7694 1.7624 0.9667
#8 D1S1656-1 361 15 0.0028 0.1496 0.9252 0.8992 2.4073 0.8699
#9 D21S11-1 361 16 0.0014 0.2825 0.8227 0.8288 1.9946 0.6156
#10 D22S1045-1 361 8 0.0055 0.3823 0.7535 0.7220 1.4823 0.9785
#11 D2S1338-1 361 12 0.0014 0.1856 0.8698 0.8814 2.2464 0.5066
#12 D2S441-1 361 11 0.0014 0.3435 0.7867 0.7693 1.6734 0.7125
#13 D3S1358-1 361 9 0.0014 0.2729 0.7562 0.7900 1.6437 0.3789
#14 D5S818-1 361 9 0.0014 0.3878 0.7064 0.6977 1.3939 0.7615
#15 D6S1043-1 361 14 0.0014 0.2964 0.7978 0.8232 2.0059 0.0220
#16 D7S820-1 361 9 0.0014 0.2562 0.8310 0.8161 1.7925 0.5045
#17 D8S1179-1 361 10 0.0014 0.3296 0.7839 0.8072 1.8386 0.8606
#18 F13A01-1 361 12 0.0014 0.3490 0.7590 0.7353 1.5471 0.8793
#19 F13B-1 361 6 0.0055 0.3892 0.7008 0.7175 1.3790 0.8276
#20 FESFPS-1 361 7 0.0014 0.4114 0.6731 0.6930 1.3085 0.8506
#21 FGA-1 361 14 0.0014 0.2050 0.8670 0.8594 2.1163 0.0528
#22 LPL-1 361 8 0.0014 0.4224 0.7175 0.6947 1.3386 0.9721
#23 Penta_C-1 361 10 0.0014 0.3947 0.7701 0.7523 1.6035 0.0248
#24 Penta_D-1 361 13 0.0014 0.2327 0.8587 0.8247 1.8925 0.1464
#25 Penta_E-1 361 19 0.0014 0.1994 0.8920 0.8910 2.4391 0.4811
#26 SE33-1 361 39 0.0014 0.0942 0.9501 0.9483 3.1472 0.2506
#27 TH01-1 361 8 0.0014 0.3449 0.7424 0.7646 1.5616 0.2379
#28 TPOX-1 361 8 0.0014 0.5249 0.6537 0.6404 1.2572 0.5545
#29 vWA-1 361 10 0.0014 0.2839 0.8061 0.8076 1.7577 0.4501
The p-values of the permutation test in the last column show that at a usual five percent significance level, HWP would be rejected for Penta_C and D6S1043 and that FGA-1 is borderline.