Integrating External Controls by Regression Calibration for Genome-Wide Association Study

Lirong Zhu; Shijia Yan; Xuewei Cao; Shuanglin Zhang; Qiuying Sha

doi:10.20944/preprints202312.1184.v1

Submitted:

14 December 2023

Posted:

15 December 2023

You are already at the latest version

Abstract

Genome-wide association studies (GWAS) have successfully revealed many disease-associated genetic variants. For a case-control study, the adequate power of an association test can be achieved with a large sample size, although genotyping large samples is expensive. A cost‐effective strategy to boost power is to integrate external control samples with publicly available genotyped data. However, the naïve integration of external controls may inflate the type I error rates if ignoring the systematic differences (batch effect) between studies, such as the differences in sequencing platforms, genotype calling procedures, population stratification, and so forth. To account for the batch effect, we propose an approach by integrating External Controls into the Association Test by Regression Calibration (iECAT-RC) in case-control association studies. Extensive simulation studies show that iECAT-RC not only can control type I error rates but also can boost statistical power in all models. We also apply iECAT-RC to the UK Biobank data for M72 Fibroblastic disorders by considering genotype calling as the batch effect. Four SNPs associated with Fibroblastic disorders have been detected by iECAT-RC and the other two comparison methods. However, our method has a higher probability of identifying these significant SNPs in the scenario of an unbalanced case-control association study.

Keywords:

Genome-wide association test

;

case-control study

;

batch effect

;

data integration

Subject:

Computer Science and Mathematics - Probability and Statistics

Introduction

Genome-wide association studies (GWAS) play a major role in associating specific genetic variants with common diseases and complex traits [1,2,3]. Sometimes, researchers may have limited access to individuals’ genetic information with specific traits and large-scale genetic studies can be expensive and resource-intensive [4]. Thus, with a small sample size in GWAS, the association test could have low power and may also increase the possibility of false-positive findings, especially for infrequent variants (i.e., MAF < 5%) [5,6].

The rapid development of sequencing technologies has promoted substantial advancement in GWAS, particularly in obtaining comprehensive genetic information from limited samples [7,8]. The integration of sequenced samples provides a great opportunity for identifying novel genetic associations and increasing the statistical power of single-variant association tests [9]. Nevertheless, the challenges associated with integrating sequenced samples arise from various factors, such as the utilization of diverse sequencing platforms, variations in genotype calling procedures, the presence of population stratification, and so forth [10]. In a single study, by incorporating sequenced samples from other studies as an external control sample, the power of single-variant tests can be significantly increased without incurring additional sequencing costs. However, the systematic differences (batch effect) between studies could inflate the type I error rates and increase the possibility of false-positive findings in association studies [11].

Several methods have been proposed recently to address the systematic differences between genotyped data of internal and external sources using likelihood-based methods [12]. Liu and Leal proposed a method SEQCHIP to correct bias for integrating genotype data in rare variant association studies [13]. Derkach et al. proposed another method that substitutes the genotype calls by the expected values given observed sequence data to account for differential read depths between studies [14]. Motivated by Derkach et al., Chen and Lin proposed regression calibration (RC) methods to account for differential sequencing errors between cases and controls [15]. Although these methods are powerful, computing genotype probabilities and storing sequence reads data can be challenging and expensive for large-scale studies. Thus, ProxECAT incorporates external controls to estimate enrichment of rare variants using allele counts in case-control analysis [16]. However, nonuse of the internal control samples potentially limits the power of the association test. iECAT allowed the incorporation of external controls in single variant association tests [11]. And the batch effect between internal and external studies can be assessed by comparing odds ratio estimates of alleles using internal control samples and combined control samples from internal and external studies. Then an empirical Bayesian-type shrinkage estimator is constructed based on the degrees of odds ratios in the single-variant test. And it is demonstrated that this method can control type I error rates, as well as improve the power of the association test. However, this method cannot adjust for covariates such as age, gender, and so on [11]. Based on the aforementioned method, Li and Lee proposed a novel score based test, which constructs a shrinkage score statistic using exclusively internal samples and external control samples, allowing for covariate adjustment [17]. However, the power increase of this method in association testing by integrating external controls is limited for extremely unbalanced case-control studies.

In this study, we present a novel approach that integrates External Controls into Association Tests by Regression Calibration (iECAT-RC) to incorporate external control samples in case-control association studies. The objective of this research is to boost the statistical power of the single-variant association test by integrating external controls with the adjustment of batch effects. We propose an approach that adjusts the genotypes of an external control sample to approximate the same distribution as the genotypes in the internal control sample through regression calibration. Furthermore, we apply the Saddlepoint approximation [18] and efficient resampling [19] methods to control type I error rates with imbalanced case-control and low minor allele count (MAC) scenarios, respectively.

Materials and Methods

Consider a phenotype with case and control states. We code a case as 0 and control as 1. Assume that the internal study has the sample size

n^{I}

with

n_{0}^{I}

controls and

n_{1}^{I}

cases and

n_{0}^{I} + n_{1}^{I} = n^{I}

; the external study has

n_{0}^{E}

controls. For the

i^{t h}

subject, let

y_{i} = 0 / 1

be the dichotomous phenotype. Denote

G_{1}, G_{2}, \dots, G_{n_{0}^{I}}, G_{n_{0}^{I} + 1}, G_{n_{0}^{I} + 2}, \dots, G_{n^{I}}

, and

g_{1}, g_{2}, \dots, g_{n_{0}^{E}}

as the genotypes of the internal control sample, the internal case sample, and the external control sample at a genetic variant, respectively, with indicating the number of copies of the minor allele carried by the subject at that genetic variant. We denote

X_{i}^{I}

be the first

p

principal components of internal genotypes, and

X_{i}^{E}

be the first

p

principal components of external genotypes for the

i^{t h}

subject.

Motivated by the novel method iECAT-Score [20], we propose a new method by integrating external controls into association tests to boost the statistical power. Our proposed method involves three steps. Step 1. adjusting the genotypes of external controls using regression calibration; Step 2. conducting single-variant association test; and Step 3. calibrating single-variant test using Saddlepoint approximation (SPA) [18] and efficient resampling (ER) methods [19], particular addressing scenarios of case-control imbalance and low minor allele count (MAC). By following these three steps, the iECAT-RC method effectively minimizes the impact of batch effects and improves the power of association testing.

Step 1. adjusting the genotypes of external controls by regression calibration

To adjust the genotype of external control samples for the batch effect, we propose to use the following procedure:

1). Without loss generality, we assume

n_{0}^{E} \geq n_{0}^{I}

. We randomly choose

n_{0}^{I}

individuals with genotypes

g_{k 1}, \dots, g_{k n_{0}^{I}}

from external control samples.

2). We assume a linear regression model

G_{i} = β_{0}^{(k)} + β_{1}^{(k)} g_{k i} + α_{I}^{(k)} X_{i}^{I} + α_{E}^{(k)} X_{k i}^{E}

for

i = 1, \dots, n_{0}^{I}

, where

{\hat{β}}^{(k)} = {({\hat{β}}_{0}^{(k)}, {\hat{β}}_{1}^{(k)}, {\hat{α}}_{I}^{(k)}, {\hat{α}}_{E}^{(k)})}^{T}

is the least square estimate of

β^{(k)} = {(β_{0}^{(k)}, β_{1}^{(k)}, α_{I}^{(k)}, α_{E}^{(k)})}^{T}

.

3). We repeat 1) and 2)

K

times. We obtain

{\hat{β}}^{(1)}, \dots, {\hat{β}}^{(K)}

and calculate the average value

\hat{β} = {({\hat{β}}_{0}, {\hat{β}}_{1}, {\hat{α}}_{I}, {\hat{α}}_{E})}^{T} = \sum_{k = 1}^{K} {\hat{β}}^{(k)} / K

. Let

G_{n^{I} + i} = {\hat{β}}_{0} + {\hat{β}}_{1} g_{i} + {\hat{α}}_{I} X_{i}^{I} + {\hat{α}}_{E} X_{i}^{E}

for

i = 1, \dots, n_{0}^{I}

. When

G_{n^{I} + i} < a_{0}

, we let

G_{n^{I} + i}

take 0, where

a_{0}

is determined such that the frequency of 0 in the internal control genotypes equals to the frequency of 0 in

G_{n^{I} + i}

for

i = 1, \dots, n_{0}^{I}

. When

a_{0} \leq G_{n^{I} + i} < a_{1}

, we let

G_{n^{I} + i}

take 1, where

a_{1}

is determined such that the frequency of 1 in the internal control genotypes equals to the frequency of 1 in

G_{n^{I} + i}

for

i = 1, \dots, n_{0}^{I}

. When

G_{n^{I} + i} > a_{1}

, we let

G_{n^{I} + i}

take 2.

We repeat the above procedure till we obtain

G_{n^{I} + i}

for

i = 1, \dots, n_{0}^{E}

. Then we perform the association test based on the internal case-control data and external control data with genotypes

G_{1}, G_{2}, \dots, G_{n_{0}^{I}}, G_{n_{0}^{I} + 1}, G_{n_{0}^{I} + 2}, \dots, G_{n^{I}}, G_{n^{I} + 1}, \dots, G_{n^{I} + n_{0}^{E}}

.

Step 2. Single-variant association test

We combine the internal samples and external control samples with the adjusted genotypes.

G = {(G_{1}, G_{2}, \dots, G_{n})}^{T}

is the vector of genotypes at a variant for

n

subjects, where

n = n^{I} + n^{E}

. Assume that there are

p

covariates, then we relate the phenotype

Y_{i}

to the covariate

Z_{i}

, and genotype

G_{i}

using the logistic regression model

logit [P (Y_{i} = 1 | Z_{i}, G_{i})] = Z_{i}^{T} α + G_{i} β

, where the phenotype

Y_{i}

follows a Bernoulli distribution. In this equation,

α

is a

p \times 1

vector of coefficients for

p

covariates including the intercept, and

β

is the genotype effect at the variant. Assessing whether the association exists between the phenotype and the genotype at a variant is equivalent to testing

H_{0} : β = 0

.

Let

μ = {μ_{i}} = {P (Y_{i} = 1 | Z_{i})}

and

{\hat{μ}}_{i}

be the maximum-likelihood estimate of

μ_{i}

under

H_{0}

. In the score test, the score is given by

S = {\tilde{G}}^{T} (Y - \hat{μ})

. where

Y = {(Y_{1}, \dots, Y_{n})}^{T}

,

\tilde{G} = {{\tilde{G}}_{i}} = G - Z {(Z^{T} V Z)}^{- 1} Z^{T} V G

is the covariate adjusted genotype vector and

V = d i a g {{\hat{μ}}_{i} (1 - {\hat{μ}}_{i})}

[2]. Under the null hypothesis of no genetic effect,

E (S) = 0

and

V a r (S) = \sum_{i = 1}^{n} {\tilde{G}}_{i}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i})

. Then the score test statistic

T_{S c o r e} = S^{2} / V a r (S)

asymptotically follows the chi-square distribution with 1 degree of freedom, and the p-value can be obtained as

p = P (χ_{1}^{2} > S^{2} / V a r (S))

.

Step 3. Calibrating single-variant test using SPA and ER methods

The single-variant score test approximates the null hypothesis by normal distribution. The variance estimates based on such asymptotic test behaves well for common variants and balanced case-control studies. When allele frequency is extremely low resulting from low MAC, or when the case-control ratio is unbalanced, the underlying distribution of test statistic could be highly skewed. In such cases, the traditional asymptotic-based score test performs poorly with conservative or anticonservative results [21,22].

To account for scenarios of unbalanced case-control ratio, we apply the SPA method to obtain the p-value when the score estimates lie far from mean zero [18]. When the MAC is low (

M A C < 10

) either in the internal sample, combining sample, or external sample, we apply the ER method to obtain the p-values [19].

1). SPA method

SPA is an improvement over the normal approximation which only uses the mean and variance to approximate the underlying distribution. SPA uses the entire cumulant-generating function (CGF). Given the score test statistic

S = \sum_{i = 1}^{n} {\hat{G}}_{i} (Y_{i} - {\hat{μ}}_{i})

, the estimation of the CGF of

S

is

K (t) = \log (E_{H_{0}} (e^{t s})) = \sum_{i = 1}^{n} \log (1 - {\hat{μ}}_{i} + {\hat{μ}}_{i} e^{{\hat{G}}_{i} t}) - t \sum_{i = 1}^{n} {\hat{G}}_{i} {\hat{μ}}_{i}

. According to the SPA method, the distribution of

S

can be estimated by

\Pr (S < s) \approx \tilde{F} (s) = Φ \{w + \frac{1}{w} \log (\frac{v}{w})\},

where

w = sgn (\hat{t}) \sqrt{2 (\hat{t} s - K (\hat{t}))}

,

v = \hat{t} \sqrt{K^{″} (\hat{t})}

,

K^{'} (t)

and

K^{″} (t)

are the estimations of the first- and second-order derivatives of

K

,

\hat{t}

is the solution to the equation

K^{'} (\hat{t}) = s

, and

Φ

is the distribution of a standard normal distribution. The p-value can be obtained using the R package SPAtest.

1). ER method

ER method is used for rare variant association test with binary trait. Given phenotypes

Y

, genotypes

G

, and covariates

Z

, the p-value of ER method is defined as

\Pr (Q \geq \hat{Q} | Y, G, X) = \sum_{d = 0}^{m} \Pr (Q \geq \hat{Q} | D = d, Y, G, Z) \Pr (D = d | Y, G, Z)

where

\hat{Q}

is the score test statistic from the original phenotype,

m

is the number of individuals with minor alleles, and

D

is the number of cases among

m

individuals carrying a minor allele. The p-value can be obtained using the R package SKAT.

Simulations

In order to evaluate the performance of the proposed method iECAT-RC related to the type I error rates and power, we carry out simulation studies under a series of scenarios. We generate the binary phenotypes with cases and controls from a logistic regression model:

logit [P (Y = 1 | Z, G)] = α_{0} + 0.5 Z_{1} + 0.5 Z_{2} + β G + ε

, where

Z_{1}

is a continuous covariate generated from the standard normal distribution;

Z_{2}

is a binary covariate taking values

0

and

1

with the probability of

0.5

;

α_{0}

is chosen such that the disease prevalence is

0.05

;

G

is the genotype at a variant generated from a binomial distribution

B I N (2, M A F)

;

β

is the effect size of the variant; and

ε

follows a standard normal distribution.

M A F

is sampled from the empirical Mini-Exome genotype data provided by the GAW17, which includes

24, 487

variants in

3205

genes introduced in Sha et al [2].

To mimic the batch effect between internal and external control studies, we first define the differential variant size (DVS), that is the proportion of the variants subject to different MAFs between the internal and external control samples. For such variants, we set the MAFs of the external controls to be randomly generated based on the following two scenarios to mimic the level of batch effect: (1)

Uniform (0.1 q, 4 q)

and (2)

2 q

, where

q

is the MAF of the corresponding variants in the internal sample. Subsequently, we consider different numbers of cases and controls in the internal sample and the number of controls in the external controls. We set the following three ratios between the internal cases, internal controls and external controls

(n_{1}^{I} : n_{0}^{I} : n_{0}^{E})

: (1)

5000 : 5000 : 10000

, (2)

6667 : 3333 : 10000

, and (3)

500 : 5000 : 10000

. Thus, we consider a total of six models. Model 1: the ratio

(n_{1}^{I} : n_{0}^{I} : n_{0}^{E})

is

5000 : 5000 : 10000

and MAF of the external sample is from

2 q

; Model 2: the ratio is

6667 : 3333 : 10000