Say you are 1m80 tall. How much of that is due to your genetics? Heritability studies try to measure the proportion of the variance in height observed in a population that is explained by genetic variations. It can be used to investigate any quantitative (continuous) traits like political opinions, IQ, sexuality, etc.

GCTA is a simple mixed linear model to compute some specific narrow-sense or additive heritabillity. The first part of the estimation procedure is about quantifying the genetic similarity of each pair of individuals. In these notes I write down what I wish were more detailed in the GCTA papers about this procedure. These notes are not self-contained.

They’re mostly based on:

Things I’ve learned:

The inbreeding coefficient $F$ can be estimated in many ways, and so conceptualized in many ways.
In the GCTA paper, the estimators are written as polynomials of the counts of alleles $X$… but they’re just linear combination of indicator functions which are 1 for each different values of counts (0, 1 or 2). This simplifies calculations of variance a lot, for instance.
In stats genetics, many quantities are “observed over expected ratios”. The expected is not a mathematical expectation but a prediction according to some more or less simple model (much like a null hypothesis).

The inbreeding coefficient

The diagonal terms of the naive (first) estimator of the GRM have expectation $1+F$ where $F$ is the inbreeding coefficient. What is $F$? There are different definitions of inbreeding coefficients, so it can be confusing. Here we are talking about a population-level inbreeding coefficient for a single gene. (The inbreeding coefficient can then be averaged over all genes. There are alternative definitions of inbreeding at the individual-level, considering many genes in a single individual.)

Let $X$ be a random variable modelling the allele counts of a given SNP. For a diploid individual and a SNP with 2 alleles A and a, it can take values 0 (for aa), 1 (for aA and Aa) and 2) (for AA). We consider two different probability distributions for this variable, encoding different assumptions about mating behavior.

Null model: Under the Hardy-Weinberg hypotheses, $X \sim Bin(2, p_0)$. The expected fraction of heterozygous allele counts is just the sum of probability of sampling Aa $p_0(1-p_0)$ + that of sampling aA $(1-p_0)p_0$, that is $2p_0(1-p_0)$. We note $h = 2p_0(1-p_0)$.
Alternative model: without any assumptions, $X$ is distributed according to a categorical distribution with probabilities $f_0, f_1, f_2$. By definition, the fraction of heterozygous allele counts is $f_1$.

$F$ is the deviation of the prediction of the number of heterozygous alleles in the alternative model ($f_1$) from that of the null model ($h$). It is normalized by $h$. That is, $F = \frac{h - f_1}{h}$. This normalisation by $h$ seems slightly arbitrary. One of the estimators presented below can help us understand it.

Expected - observed / observed?

It is also commonly defined as $\frac{E - O}{E}$ where $E$ stands for expectation and $O$ observed:

$F$ is a function of the unknown true parameters of the alternative model ($f_1$). So when we define $F$ in terms of observations, we’re already conflating the definition in terms of parameters versus the estimators. Maybe confusing.
“Expectation”: the null model is parametrised so that the heterozygous allele count is not directly a parameter, but a function of the single parameter $p_0$. It is not an expectation in the probabilistic sense of the term, but the predicted count $P(X=1)$ according to the null model. It is an “expectation” in the sense that it is what we would expect, were the HW conditions true.
“Observation”: in the alternative model, $f_1$ can be estimated easily and reliably using the empirical mean over individuals $\hat{f_1}$. This empirical mean uses the “observations”.

A useful assumption

The expectations and variances of the estimators are calculated using one important assumption: that the marginal allele probabilities $p_i$ of the allele A of SNP $i$ is known. In reality, they are not known but well estimated by their empirical frequencies. So for a given SNP $i$, $p_i \approx \hat{p_i} = \frac{1}{2n} Σ_j x_{ij}$ (where $n$ is the number of individuals). Why can we assume so?

$p_i$ is estimated on all individuals (usually, a large number).
(not sure about that) when estimating such a probability from count data, the more the probability is close to 0.5, the less variance there is in the empirical mean (for a binomial variable it is $np(1-p)$). Genotyped SNPs should be chosen to be relatively informative (=high entropy)? Then the estimates of the allele probabilities should be relatively low variance, compared to a randomly chosen SNP?

Estimator based on the diagonal of the SNP-derived GRM

The diagonal term of the GRM $A_jj$ is the average over SNPs ($i$) of $A_{ijj} = \frac{(X_{ij} - 2p_i)²}{h_i}$ ($i$ is the SNP index, $j$ the individual index). Let’s omit the indices.

It is presented as being “based on the variance of additive genetic values”, because it is a variance term (we assumed above that $E[X]=2p$ is known) and “additive” because $X$ are sum of allele counts.

Expectation

\[\begin{align} E[A_{ijj}] &= E[\frac{(X - 2p)²}{h}] \\ &= \frac{1}{h} E[(X - 2p)²] &&\text{because $h$ fixed and known}\\ &= \frac{1}{h} E[{X}² - 4p X + 4p²] \\ &= \frac{1}{h} (E[{X}²] - 4p E[X] + 4p²) \\ \end{align}\]

By definition,

$E[X] = f_0 \times 0 + f_1 \times 1 + f_2 \times 2 = f_1 + 2f_2$,
$E[X^2] = f_0 \times 0 + f_1 \times 1² + f_2 \times 2² = f_1 + 4f_2$.

Moreover, by assumption the parameter $p$ is estimated without error, using $2 p = E[X]$. So we have $2 p = f_1 + 2f_2$. We want to express the expectation as a function of $f_1$ and $h$, so we eliminate all $f_2$ terms using $f_2 = p - f_1 / 2$.

\[\begin{align} E[A_{ijj}] &= \frac{1}{h} ((f_1 + 4 f_2) - 4p (f_1 + 2f_2) + 4 p²) \\ &= \frac{1}{h} ((f_1 + 4 p - 2 f_1) - 4p (f_1 + 2 p - f_1) + 4p²) \\ &= \frac{1}{h} ((-f_1 + 4 p) - 4p (2 p) + 4p²) \\ &= \frac{1}{h} ((-f_1 + 4 p) - 4 p²) \\ &= \frac{1}{h} (4p(1 - p) - f_1) \\ &= \frac{2 h - f_1}{h} \\ &= 1 + \frac{h - f_1}{h} \\ &= 1 + F \\ \end{align}\]

So $A_{ijj} - 1 = \frac{(X_{ij} - 2p_i)^2 - h_i}{h_i}$ is an estimator of $F$.

Estimator based on SNP homozygosity

Plug-in $\hat{F}$

We can estimate $F = \frac{h - f_1}{h}$:

$h=2 p (1-p)$ but we assume that $p$ is known without variance, so we know $h$.
An estimator of $f_1$ is simply $\mathbf{1}_{x=1}(x)$, notation for the indicator function that is $1$ where $x=1$ and $0$ elsewhere.
- This indicator function is equal to $(2-x)x$ (on $\{0, 1, 2\}$ !!!)… It took me a surprisingly long time to see that… I don’t know why this notation is preferred.
- We plug it in the expression of $F$ to get an estimator of $F$.

So $\frac{h - \mathbf{1}_{x=1}(X)}{h} = \frac{h - (2-X)X}{h}$ is an estimator of $F$.

Inbreeding coefficient seen as a prior probability in a latent variable model

Another way to obtain this is presented in Li & Horvitz. Consider these 2 models:

The same null model as before: $X \sim Bin(n=2, p_0)$ (“Model I” in Li & Horvitz)
- It has a single unknown parameter $p_0$.
- The probability of observing heterozygous alleles is $p(X=1) = 2 p_0 (1-p_0)$.
A model with a binary latent variable (say, $Y$) which is 1 when inbreeding has occured, 0 otherwise. The model factorizes as $q(X) = q(Y=0)q(X|Y=0) + q(Y=1) q(X|Y=1)$. (“Model II” in Li & Horvitz)
- This model switches between two behaviors.
- In particular, let’s look at $X=1$. When there is no inbreeding ($Y=0$), the alleles are sampled independently from the gene pool. $X=1$ if we sample both alleles one time, so we have $q(X=1|Y=0) = 2 p_0' (1-p_0')$.
- When there is inbreeding ($Y=1$), a single allele is sampled and duplicated, so we cannot observe aA or Aa, so $q(X=1|Y=1) = 0$.
- $q(X=1) = q(Y=0) q(X=1|Y=0) + q(Y=1) q(X=1|Y=1) =\\ q(Y=0) q(X=1|Y=0) = (1 - F) 2 p_0' (1-p_0')$.

To go forward in the derivation of the estimator, we want to express the parameter $p_0$ as a function of the other $p_0'$ (or vice versa). I’m not sure what this means to think of the true parameters being a function of each other… However, we can estimate both $p_0$ and $p_0'$ using the same estimator. Indeed, for the first model, $E[X] = 2 p_0$ so we can estimate $p_0$ using $\hat{p_0} = \bar{X}/2$ where $\bar{X}$ is the empirical mean. For the second model, the parameter $p_0'$ can be estimated in the same way, since:

\[\begin{align} E[X] &= E[E[X|Y]] \\ &= F E[X|Y=0] + (1-F) E[X|Y=1] \\ &= F (2 p_0') + (1-F) (2 p_0') \\ &= 2 p_0' \\ \end{align}\]

Now if we use $\hat{p_0}$ as an estimator both for $p_0$ and $p_0'$ and express the likelihoods using the estimate, we get:

$q(X=1|\hat{p_0}) = (1 - F) 2 \hat{p_0} (1 - \hat{p_0}) = (1 - F) p(X=1|\hat{p_0})$
so $1 - F = q(X=1|\hat{p_0}) / p(X=1|\hat{p_0})$
so $F = 1 - q(X=1|\hat{p_0}) / p(X=1|\hat{p_0})$

Then:

$p(X=1|\hat{p_0}) = h$ (again, we assume $\hat{p_0} = p_0$),
$q(X=1|\hat{p_0}) = E[1_{x=1}(X)]$ so we approximate it using the indicator function directly.

We see that the inbreeding coefficient $F$ can also be defined as some prior probability of inbreeding $F = q(Y=1)$!

Taking a step back, what did we just do here? We have two models that differ slightly. We have parameters that can be estimated easily ($p_0$ and $p_0'$), and a parameter, $F$, that is harder to estimate. We write down some probabilities according to these 2 models. It doesn’t need to be the probability of the same event for the 2 models! But here it’s convenient to use the predictions of $X=1$, because they differ only by a factor $1-F$. We then plug-in the estimators we already have. The estimators also don’t need to be the same, as is the case here, as long as they’re related by a mathematical function (they’re just functions of the sample), and as long as the parameter of interest appears in one of them. Then we can massage the expression to find an estimator for the parameter of interest… Maybe this is a well-known method but I had never seen that.

Estimator based on the correlation between uniting gametes

Let $B_1$ and $B_2$ be random variables representing the alleles of the male and female gametes. By definition, $X$ is the allele count, that is $X = B_1 + B_2$. The two gametes fuse to produce the zygote. $B_1=1$ if the male gamete contains allele A, else it is 0; and similarly for $B_0$. The correlation between these is:

\[\begin{align} Cov(B_1, B_2) &= E[(B_1 - p)(B_2 - p)] \\ &= E[B_1 B_2] - pE[B_1 + B_2] + p^2 \\ &= E[B_1 B_2] - pE[X] + p^2 \\ \end{align}\]

Because of the way we have coded the alleles as integers in $B_1$, $B_2$ and $X$, we have that $B_1 B_2 = 1$ if and only if $X=2$. Again, we can express the estimator function $1_{x=2}(x) = x(x-1) / 2$ for all $x \in \{0, 1, 2\}$. So an estimator of this is simply

\[\begin{align} \hat{Cov}(B_1, B_2) &= X(X-1)/2 - pX + p^2 \\ \end{align}\]

To obtain the correlation, we divide by the std deviation of $B_1$ times std deviation of $B_2$. We have $Var[B_1] = E[B_1²] - E[B_1]² = p - p² = p(1-p) = h/2$, and similarly for $Var[B_2]$. So the denominator is $\sqrt{Var[B_1] Var[B_2]} = h/2$.

So the estimator for the correlation is:

\[\begin{align} \hat{Corr}(B_1, B_2) &= \frac{X(X-1)/2 - pX + p^2}{h/2} \\ &= \frac{X(X-1) - 2pX + 2p^2}{h} \\ &= \frac{X^2 - X(1 + 2p) + 2p^2}{h} \\ \end{align}\]

But also it is the average of the two other estimators

I was surprised by that. The variance of an estimator of $F$ is a function of the true parameter $p$. Both the first and second estimators have a similar variance in terms of $p$: to simplify, assuming that $F$ is 0 (because we expect it to be small anyways), both variances are equal to $(1-h)/h$. The variance is 1 in $p=0.5$, but rises as $p$ goes close to 0 or 1. So the two estimators are roughly equivalent even in particular regions of $p$; it is not the case that one of them especially shines for low $p$, for example.

Yet when we average them, they give us a variance of $1$ everywhere. That’s because they’re anti-correlated below 0.2 and above 0.8, where the variance of the individual estimators is very high (see below).

Decomposition in indicator functions

On $x \in \{0, 1, 2\}$, we have:

$\mathbf{1}_{x=0}(x) = (x-1)(x-2)/2$
$\mathbf{1}_{x=1}(x) = x(2-x)$
$\mathbf{1}_{x=2}(x) = x(x-1)/2$

The numerators of the estimators are polynomials of degree 2, like those indicator functions. So these numerators can be written as linear combination of these indicator functions: $a \mathbf{1}_{x=0}(x) + b \mathbf{1}_{x=1}(x) + c \mathbf{1}_{x=2}(x) + d$.

The 2nd estimator’s numerator is just $h - \mathbf{1}_{x=1}$, so $a=c=0$ and $b=-1$.

It is not surprising that $b$ is negative: when we count $X=1$, it means we have observed 2 different alleles, which is impossible with inbreeding. So the estimate of $F$ should decrease compared to the baseline level $d$.

Moreover, the second estimate is the same whether $X=0$ and $X=2$. It does not use all the available information, so it is expected that it doesn’t perform amazingly?

What about the 1st estimator? Can this decomposition help us understand how it works intuitively? Does the first estimator use information about $X=0$ and $X=2$?

Let’s factorize the numerator of the first estimator into a linear combination of the indicator functions. This numerator is $(X-2p)² - h = X² × (\color{red}{1}) + X × (\color{blue}{-4p}) + (4p² - h)$. We decompose it into:

\[\begin{align} \mathrm{numerator} &= a \mathbf{1}_{x=0}(X) + b \mathbf{1}_{x=1}(X) + c \mathbf{1}_{x=2}(X) + d \\ &= a (X²/2 - 3X/2 + 1) + b (-X² + 2X) + c (X²/2 - X/2) + d \\ &= X² × (\color{red}{a/2 - b + c/2}) + X × (\color{blue}{-3a/2 + 2b - c/2}) + 1 × (a + d) \\ \end{align}\]

Identifying the 2, we have only 3 equations, but 4 unknowns $a, b, c, d$. However, to simplify the interpretation of the coefficients, we can set $a = -c$: we want the counts of $X=0$ and $X=2$ to act in an opposite direction with regards to some baseline level $d$.

Then, using the $X²$ factors, we identify and obtain $\color{red}{a/2 - b - a/2 = 1}$ so $b = -1$. This is the same as for the second estimator.

Furthermore, using the factors of $X$, we get $\color{blue}{-3a/2 + 2b + a/2 = -4p}$, that is $-a + 2b = -4p$ so $a = 2b + 4p = -2 + 4p = 2(2p - 1)$ And $c = -a = 2(1 - 2p)$.

Why does it make sense that $c = 2(1 - 2p)$? When $p≈0$, $c≈2$. Thus, observing $X=2$ bumps the numerator of $\hat{F}$ by around $2$. Why? When $p$ is low, observing $X=2$ is very unlikely, but even more so if the alleles are sampled independently (without inbreeding). So this is very likely due to inbreeding, where the allele is sampled once and copied. Similarly, when $p$ is close to $1$, observing $X=0$ increases our estimation of inbreeding.

Covariance of the 2 first estimators

It is a complete mess to calculate the covariance between estimators using the polynomial notations… but when we use the decomposition in indicator function, everything is much simpler since many products of indicators cancel out.

\[\begin{align} Cov(\hat{F_1}, \hat{F_2}) &= E[\hat{F_1}, \hat{F_2}] - E[\hat{F_1}]E[\hat{F_2}] \\ &= E[\hat{F_1}, \hat{F_2}] - F² \\ \end{align}\]

\[\begin{align} E[\hat{F_1} \hat{F_2}] &= \frac{1}{h²} E[(h - \mathbf{1}_{x=1}(X))(a \mathbf{1}_{x=0}(X) - \mathbf{1}_{x=1}(X) - a \mathbf{1}_{x=2}(X) + d)] \\ &= \frac{1}{h²} [h [a f_0 - f_1 - a f_2 + d] - E[\mathbf{1}_{x=1}(X) (a \mathbf{1}_{x=0}(X) - \mathbf{1}_{x=1}(X) - a \mathbf{1}_{x=2}(X) + d)]] \\ &= \frac{1}{h²} [h [a f_0 - f_1 - a f_2 + d] - E[- \mathbf{1}_{x=1}(X) - \mathbf{1}_{x=1}(X) d]] && \text{$\mathbf{1}_{x=1}(X) \mathbf{1}_{x=2}(X) = 0$, ...}\\ &= \frac{1}{h²} [h [a (1 - f_1 - f_2) - f_1 - a f_2 + d] - E[- \mathbf{1}_{x=1}(X) + \mathbf{1}_{x=1}(X) d]] \\ &= \frac{1}{h²} [h [a - f_1 - af_1 - 2af_2 + d] - (- f_1 + f_1 d)] \\ &= \frac{1}{h²} [h [a - f_1 - a(2p) + d] + f_1(1 - d)] \\ &= \frac{1}{h²} [h [4p - 2 - f_1 - 8p² + 4p -3h + 2] + f_1(3h - 1)] && \text{$d=-3h+2$, $a=4p-2$}\\ &= \frac{1}{h²} [h [4p - 2 - f_1 - 8p² + 4p -6p + 6p² + 2] + f_1(3h - 1)] \\ &= \frac{1}{h²} [h [-2p² + 2p - f_1] + f_1(3h - 1)] \\ &= \frac{1}{h²} [h (h - f_1) + f_1(3h - 1)] \\ &= \frac{h-f_1}{h} + \frac{1}{h²}(f_1(3h - 1)) \\ &= F + \frac{1}{h²}(f_1(3h - 1)) \\ &= F + \frac{3h - 1}{h}\frac{f_1}{h} \\ &= F + \frac{3h - 1}{h} + \frac{3h - 1}{h}(\frac{f_1}{h} - 1) \\ &= F + \frac{3h - 1}{h} + \frac{3h - 1}{h}F \\ &= F(1 - \frac{3h - 1}{h}) + \frac{3h - 1}{h}\\ &= F\frac{1 - 2h}{h} + \frac{3h - 1}{h}\\ \end{align}\]

So we recover the paper’s statement that

\[\begin{align} Cov(\hat{F_1}, \hat{F_2}) &= \frac{3h - 1}{h} + F\frac{1 - 2h}{h} - F² \\ \end{align}\]

We can set $F=0$ to simplify. The first term drops very low below 0 as $p$ goes to 0 or 1, which is why the average estimator is low variance in these regions where each estimator is very high variance.