1. Introduction
The identification of gene-environment (G×E) interactions for complex traits has been a longstanding challenge in genetic studies. Ottman (1996) defined G×E interaction as “different effect of a genotype on disease risk in persons with different environmental exposures." Traditionally, G×E interactions have been studied using a single environmental factor, as incorporating multiple environmental factors exponentially increases model complexity. This can lead to biased estimates and large standard errors, a phenomenon known as the curse of dimensionality. Moreover, an increasing number of epidemiological studies have shown that disease risk can be influenced by simultaneous exposure to multiple environmental factors (Carpenter et al., 2002; Sexton and Hattis, 2007). However, little is known about how multiple environmental factors, when considered collectively, interact with genetic factors to affect disease outcomes. Investigating this area could provide valuable insights and inform strategies for future disease prevention.
Guan et al. (2023) proposed a single-index varying-coefficients model of the form
to address non-linear gene-environment interactions. However, this model is limited to continuous phenotypes. In this paper, we generalize the model to handle binary phenotype data, given that many complex human diseases are binary traits in nature. Specifically, we propose a generalized single-index varying-coefficients model (gSIVCM):
where
represents the unknown gene effect of
, modeled as a non-parametric function modulated by its loading index
. Here,
X represents
q-dimensional environmental exposures, and
captures the joint effect of
q environmental factors. The effect of these factors on the response is modeled through the
function. In this model, we allow the effect of
on the risk of
Y to vary across multiple
X through the
function.
If , then has no effect on Y. If , where is a constant, the effect of on Y is constant, and there is no interaction between and the mixture of X. If , we conclude that interacts with the environmental mixture, and the form of the interaction effect is captured by the non-parametric function .
The unique structure of gSIVCM enables us to capture how a mixture of multiple environmental factors interacts with genetic factors. Moreover, model (
1) is highly flexible and can accommodate a wide range of models. For example, when
and
, it reduces to a generalized varying-coefficients model; when
and
, it becomes a standard generalized single-index model.
When both the number of gene variables and the number of environmental variables
X are large, the complexity of model (
1) introduces unique challenges for variable selection, particularly due to the nonlinear, non-parametric structure of the function
and its unknown loading parameter
. Guan et al. (2023) recently proposed a three-step variable selection approach for single-index varying-coefficients model (SIVCM), which classifies the non-parametric gene effect into varying, constant, and zero effects, while also identifying non-zero loading parameters. In this paper, we extend this variable selection approach to gSIVCM. Instead of penalizing the squared error loss as in SIVCM, we implement a penalized log-likelihood method tailored to our model.
Variable selection has been a central topic in modern statistical research. The core idea is to add a penalty term to the optimization function. Different choices of penalty functions yield estimators with varying properties. Fan and Li (2001) introduced three key criteria for a penalized estimator: sparsity, unbiasedness, and continuity. They also characterized the oracle property, which ensures that the model performs as well as if the true sub-model were known in advance. This has become a benchmark for evaluating new penalized estimators. Examples of penalized functions include Bridge regression (Frank and Friedman, 1993), the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996), Adaptive LASSO (Zou, 2006), Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li, 2001), and Minimax Concave Penalty (MCP) (Zhang, 2010). While LASSO is well-regarded for its simple formulation and efficient algorithm (e.g., LARS), it does not possess the oracle property. In contrast, Adaptive LASSO, SCAD, and MCP all satisfy the oracle property. For our model, we chose the MCP penalty due to its desirable theoretical and practical properties.
The rest of the paper is organized as follows:
Section 2 introduces the proposed variable selection method, including the formulation of the penalized log-likelihood, the iterative optimization procedure, and the selection of tuning parameters and initial values for
.
Section 3 discusses the asymptotic properties of the proposed method. In
Section 4, we evaluate the performance of our method through several finite-sample simulations. Finally, in
Section 5, we apply our method to a Type 2 diabetes dataset, followed by conclusion and discussion.
2. Statistical Method
Throughout the paper, superscript T is used to denote matrix transpose, is used to denote norm, is used to denote the natural logarithm of a. For the sake of simplicity, we use constant and non-zero constant interchangeably.
2.1. Model Setup
The generalized single-index varying-coefficients model for the binary case is set up as follows:
where
represents the binary response variable, and
n denotes the sample size. The set
comprises
unknown continuous functions.
contains environmental variables, and
represents the loading parameters of the model.
where
and
is a continuous or discrete vector of length
n for
. Consequently, for
,
represents the effect of
on
Y on a log-odds scale, while
serves as the intercept term.
For simplicity of notation, we denote
. Thus, model (
1) can be rewritten as:
2.2. Estimation Method
Our goal is to estimate and select unknown functions and unknown loading parameter . For the sake of identifiability, we assume , , and that cannot take the form .
We first approximate the non-parametric function with B-spline basis functions. Without loss of generality, we assume , denote K as the number of internal knots and h as the degree of the B-spline basis function. For instance, indicates linear splines, represents quadratic splines, and represents cubic splines. From standard B-spline theory, let be internal knots satisfying , and let be the left-closed, right-open interval for , and the closed interval . Let be a collection of functions f defined on satisfying: (i) the restriction of f to is a polynomial of degree h or less for ; (ii) f is times continuously differentiable on .
Let
. Then, by Schumaker (1981), we have a normalized B-spline basis function
for
. There exists a linear transformation matrix
such that:
where each component of
is a function of
u. Then, for
, we approximate
by:
where
and
. With the B-spline approximation,
in (
2) can be approximated by
where:
Hence, the selection of the non-parametric function
is transformed to the selection of its B-spline coefficients
. Note that the transformation
allows us to separate the constant effect of
from its varying effect on the response. More specifically:
If , then is a varying effect predictor.
If and , then is a constant effect predictor.
If and , then has no effect on the response.
Given the binary response, we adopt the penalized log-likelihood approach, and the log-likelihood function is defined as:
where
is the
ith subject of
. The penalized log-likelihood objective function is then defined as:
where
are penalty functions for
,
, and
, respectively. The indicator function
equals 1 if the condition in the parentheses is satisfied, and 0 otherwise.
Note that the construction of the penalty term in function (
4) reflects an “order" in selecting the effect of
: first, the model determines whether
has a varying effect; if not, it then decides whether
has a non-zero constant effect or no effect at all. Furthermore, we do not penalize the intercept term
or
due to model constraints.
For the penalty function, we use the MCP penalty proposed by Zhang (2010):
with regularization parameters
and
.
2.3. The Estimation Step
To optimize function (
4), we follow the approach proposed by Guan et al. (2023) and adopt a three-step iterative method.
Step 1: Given a preliminary estimator of
, denoted by
, we obtain the first-step estimation of
, denoted by
, via a group penalized regression:
where
Step 1 classifies
into two categories: varying (V) or non-varying (NV). Specifically,
if
, and
if
.
Step 2: In Step 2, we refine the selection of for functions in the non-varying category identified in Step 1. Specifically, we select the non-zero constant effects and classify the non-parametric functions into constant (C) and zero (0). This is achieved by penalizing only when for . No penalty is applied to . Additionally, is excluded from the model when , i.e., .
The Step 2 estimator
is obtained via penalized regression:
where
and
is the
ith element of
, defined as:
After Steps 1 and 2, we obtain the estimator based on and classify for into V, C, or 0. The next step is to estimate the loading parameter given .
Step 3: The estimator
is obtained via penalized regression:
where
and
is the
ith element of
, defined as:
Finally, set and iterate Steps 1 to 3 until convergence.
Remark: For Steps 1 and 2, we use the block coordinate descent algorithm for group penalties. For Step 3, we employ the local quadratic approximation (LQA) algorithm proposed by Fan and Li (2001). For further details, readers are referred to the Appendix. This iterative approach requires selecting tuning parameters , the order h, and the number of internal knots K for the B-spline approximation, as well as an appropriate initial value for .
2.4. Selection of Tuning Parameters
2.4.1. Selection of Tuning Parameters
We propose using the Bayesian Information Criterion (BIC) (Schwarz, 1978) to select the tuning parameters.
Step 1: We select
as the minimizer of
where
is the minimizer of
defined above,
is chosen as the estimator from the previous iteration, and
is the total number of non-zero coefficients when
is the penalized parameter.
Step 2: We select
as the minimizer of
where
is the minimizer of
defined above,
is chosen as the estimator from the previous iteration, and
is the total number of non-zero coefficients when
is the penalized parameter.
Step 3: We select
as the minimizer of
where
is the minimizer of
,
is the minimizer of the B-spline coefficients from Step 2, and
is the total number of non-zero
coefficients when
is the penalized parameter.
The parameters are searched over a grid of exponentially decreasing values, with a minimum value of and the maximum value set such that all penalized estimators are zero. We use 100 tuning parameters in the search.
2.4.2. Selection of Order h and Number of Internal Knots K
Since h represents the order of the B-spline basis function, higher degrees introduce more complex interactions and collinearity between environmental factors and genetic predictors. We search for the optimal h in the set . For K, only when (where n is the sample size, r is the smoothness of , and ), the selection approach achieves oracle properties. We therefore search for the optimal K in the neighborhood of , denoted by . In our simulations, .
We fit the following intercept-only model using the B-spline approximation:
Denote its estimator as
and
. Let
. The optimal
K and
h are selected as the values minimizing
For the iterative algorithm proposed above, we require a reasonable initial value for
(denoted by
) to begin. We obtain
by fitting model (
5) with the selected
K and
h.
2.5. Theoretical Properties
We study the properties of the penalized likelihood estimator. Let and represent the true values of and , respectively, and let denote the true value of the B-spline coefficients . Assume without loss of generality that for , for , is varying for , non-zero constant for , and zero for . The following theorem establishes the consistency of the penalized least square estimators.
Theorem 1. Assume that the regularity conditions (A1)–(A7) in the Appendix hold, and that the number of knots satisfies . Then:
-
(i)
;
-
(ii)
,
where , r is defined in condition (A2) of the supplemental material, and denotes the first derivative of the penalty function . Furthermore, under additional regularity conditions outlined in the Appendix, we can show that the estimator possesses sparsity.
Theorem 2. Assume the regularity conditions (A1)–(A7) in the Appendix hold and . Let . If and as , then with probability approaching 1:
- (i)
for ;
- (ii)
for , where is some non-zero constant;
- (iii)
for .
Theorems 1 and 2 indicate that our penalized likelihood estimator is consistent and possesses oracle properties. The proofs are given in the Appendix.
3. Simulation
We evaluated the performance of our model via finite-sample simulations. The performance is assessed in the following ways: (1) classification accuracy of , denoted as the oracle percentage; (2) IMSE of the estimated m-function; (3) selection accuracy of ; and (4) estimation accuracy of (MSE). A total of R simulations is conducted for all cases.
The oracle percentage of is defined as the proportion of correct classifications across R simulations. For instance, if is a varying function and it is classified as varying in g simulations, then the oracle percentage for is .
The IMSE of
is given by
where
is the number of points for estimating the MSE of the predicted function;
and
are estimators of the B-spline coefficients for the
r-th simulation using the proposed approach;
is the estimator of the loading parameter
for the
r-th simulation; and
corresponds to the
quantile within the range of
. In our simulations,
is set to 100.
The oracle percentage of is defined as the proportion of correct selections of across R simulations. For example, if and is selected as non-zero in g simulations, then the oracle percentage for is .
The MSE of
is computed as
where
is the estimator for
in the
r-th simulation. This represents the average MSE for
.
The simulation data was generated based on the model (
1). The environmental variable
X was drawn from a
distribution. For the loading parameter
, we set
and all other
were set to zero. We evaluated the performance of the proposed approach with both continuous (e.g., gene expressions) and discrete (e.g., SNPs) predictors
G.
3.1. Continuous G
In the continuous case, the non-parametric functions
were set as:
The predictors
G were generated from
distribution. We conducted
simulations to evaluate the model’s performance under
,
and
.
Table 1 presents the selection and estimation accuracy for
with continuous predictors. Across all cases, the selection accuracy was close to
for varying, constant, and zero-effect coefficients. The IMSE of the proposed model was in the order of
or
for varying and constant effect predictors. As the model dimension
p increased (from 50 to 100), a slight increase in the model IMSE was observed. Conversely, as the sample size
n increased (from 1000 to 2000), both the model IMSE and the oracle IMSE decreased, aligning with the asymptotic properties of the proposed model. These results suggest that the proposed variable selection approach performs well in both selection and estimation accuracy for the non-parametric functions
.
Table 2 presents the selection and estimation accuracy for the loading parameter
. In all scenarios, the selection accuracy for non-zero loading parameters (
) was nearly
, with MSE values in the order of
to
. For zero loading parameters (
), the selection accuracy was approximately
when
. As the sample size increased to
, the oracle percentages improved to
, with MSE values in the order of
to
. Overall, the MSEs improve as the sample size increases, and the model MSE is very close to the oracle MSE, indicating strong estimation performance for the loading parameters.
3.2. Discrete G
We extended our evaluation to examine the performance of the proposed model with discrete predictors, G. Single nucleotide polymorphism (SNP) data is one of the most commonly used types of genetic data. SNPs take values of 0, 1, and 2, representing the genotypes aa, Aa, and AA, respectively. Additionally, SNPs exhibit a wide range of minor allele frequencies (MAF), making it crucial for our simulations to incorporate these characteristics.
To reflect these properties,
G was simulated using the following probability distribution function:
where
denotes the
jth predictor for the
ith subject, with
and
, and
is the MAF of the minor allele A.
The gene effect functions,
, were set in
Table 3.
In this setup, predictors
exhibit varying and constant effects with
. The purpose of setting varying MAFs is to check the selection and estimation performance under different MAFs. For zero-effect predictors
, their
values are uniformly distributed within the range
. The environmental variables
X was generated from a unif
distribution. Finally,
Y was generated according to the model specified in (
1). We evaluated the performance of the proposed model through
simulations, considering
,
, and
.
Table 4 presents the selection and estimation accuracy of the non-parametric function
for discrete
G. We observed that the sample size
n and MAF (
) of
were the primary factors influencing the performance of the proposed model. To better visualize their impact, we present
Figure 1.
As the sample size increased (from 1000 to 2000), the model’s performance improved significantly. For instance, the oracle percentages for increased from approximately to nearly , and the corresponding IMSE decreased substantially. These results align with the asymptotic theory of the proposed model. Conversely, as the MAF for decreased (from 0.5 to 0.1), we observed a decline in performance, reflected in both oracle percentages and model IMSE. For example, in the case where , the oracle percentages for (), (), and () were approximately , , and , respectively. The IMSE increased correspondingly, from 0.4 () to 0.5 () and then to 1.3 (). This is a common phenomenon in genetic studies as smaller MAFs provide less data information to estimate the corresponding function, leading to poor estiamtion and selection performance.
Table 5 demonstrates the selection and estimation results for the loading parameter
. We observed that the sample size
n was the primary factor influencing model performance. When the sample size was large (
), the oracle percentage for non-zero loading covariates (
) was
, and the oracle percentage for zero loading covariates (
) was approximately
. The MSE for
was in the order of
to
. In contrast, when the sample size was relatively small (
), the oracle percentage for non-zero loading covariates remained at
, but the oracle percentage for zero loading parameters decreased to around
. Comparing the cases where
and
, we saw a slight reduction in selection accuracy for zero loading parameters when
. This is expected, as model performance typically declines with increased model complexity. As the sample size increased to 2000, this difference in performance was not significant.
Based on simulation results with both continuous and discrete G variables, we observed the following key findings about the proposed model. First, the model demonstrates robust performance with large sample sizes ( or 2000). Second, when , the false positive rate for the loading parameter was around 5%. This may reflect an inherent limitation of the LQA algorithm that cannot shrink zero parameters to exactly zero when the sample is small. Third, the model exhibits superior performance for SNP variants with larger MAFs (e.g., or 0.5) compared to SNP variants with smaller MAF (). This enhanced performance likely stems from the more data information content provided by SNPs with higher MAFs. Similar phenomenon is commonly observed in other genetic association studies. Overall, the simulation results demonstrate that the method performs reasonably well under finite sample conditions.
4. A Case Study
We demonstrated the applicability of our proposed model using a type 2 diabetes dataset containing genotypes (SNPs), environmental factors, and the phenotype (presence of type 2 diabetes). This dataset comprises two nested case-control cohort studies: the Nurses’ Health Study (NHS) and the Health Professionals Follow-Up Study (HPFS) from the Gene-Environment Association Studies Consortium (GENVEA). Detailed descriptions of these cohorts can be found in Colditz and Hankinson (2005) and Rimm et al. (1991). Initially, the dataset included 3,391 females (NHS) and 2,599 males (HPFS).
After data cleaning, which involved removing subjects with mismatched genotypes and phenotypes, SNPs with more than missing data, SNPs with MAF, and SNPs deviating from Hardy-Weinberg equilibrium (p-value ), the final dataset included 5,865 subjects (2,494 males and 3,371 females), with 2,733 cases and 3,132 controls, and 655,002 SNPs. The dataset also contained 12 continuous environmental factors, such as height, weight, age, and alcohol consumption. We fit a marginal logistic regression model for all 12 factors and, selected five environmental factors for the analysis: total physical activity (, denoted as act), BMI (), alcohol intake (, denoted as alcohol), heme iron intake (, denoted as heme), and glycemic load (, denoted as gl).
Based on SNP locations, we mapped all SNPs to known genes and selected genes containing more than 30 SNPs, resulting in a total of 2,178 genes. For each gene, we applied the proposed variable selection approach to identify significant SNPs and their effects. To ensure identifiability, the first element of the loading parameter was constrained to be a non-zero positive value. We fit the proposed model five times for each gene, varying the environmental factor used as the first element each time. An SNP was deemed significant if it was identified as either varying or constant across all environmental factor orders, indicating a strong signal.
In total, our model identified 13 varying-effect SNPs and 26 constant-effect SNPs. Here, we present one of the selected varying-effect SNPs as an example. Please refer to
Table A1 and
Table A2 in the Appendix for the complete list of selected varying- and constant-effect SNPs. Previous studies (Sale et al., 2007; Grant et al., 2006) have suggested that the gene
TCF7L2 is associated with type 2 diabetes across multiple populations. Specifically, Sale et al. (2007) reported a strong association between type 2 diabetes and SNPs rs7903146 and rs7901695 within this gene. Consistent with these findings, our model identified SNP rs7901695 as a constant-effect predictor, indicating no interaction with the five environmental factors (Note that SNP rs7903146 was not observed in this dataset).
Figure 2 shows plots of the marginal total environmental effect and the interaction effect of the SNP rs6537663 in gene
TCF7L2, with heme iron intake as the first loading covariate. As the index increases, we observed that the marginal effect initially decreases, then increases, and subsequently exhibits a rapid decrease as the total effect of the five environmental factors increases. For the interaction effect, it fluctuates around 0 as the total effect of the five environmental variables increases, indicating that the SNP is unresponsive (or insensitive) to changes in these variables. However, as the index
continues to increase, the SNP reacts to environmental changes, with a dramatic increase in risk for type 2 diabetes beyond a certain threshold. This estimated effect suggests that the genetic sensitivity of the SNP to the total effect of the five environmental variables follows a threshold model. This finding has practical implications: most individuals tolerate daily environmental changes, including dietary variations, without adverse effects. However, when such changes exceed a certain limit, the risk of disease may increase.
Table 6 presents the selection and estimation results of the loading parameters
, with heme iron intake as the first loading covariate. The model selects all loading parameters except for alcohol consumption (alcohol). Notably, we observed that body mass index (BMI) has the largest effect, which aligns with practical knowledge, as BMI is positively associated with type 2 diabetes and is a well-established risk factor for the disease. Notably, the sign of the loading parameters aligns with known associations in the literature. For example, Hu et al. (1999) reported that higher physical activity is linked to a lower risk of type 2 diabetes, while Field et al. (2001) demonstrated that high BMI increases the risk of type 2 diabetes. Additionally, Rajpathak et al. (2009) found that high heme iron intake is associated with an increased risk of type 2 diabetes, and Salmerón et al. (1997) showed that a high glycemic load is linked to a higher risk of type 2 diabetes. The signs of the estimated loading coefficients are consistent with these established findings.
5. Discussion
G×E interactions have been extensively studied in the literature, leading to the development of numerous statistical models. In this paper, we propose a three-step iterative variable selection approach for the generalized single-index varying-coefficients model (gSIVCM) with a binary response. Our goal is to identify varying, constant, and zero-effect genes, as well as to select non-zero environmental factors that interact with varying-effect genes. Biologically, our approach is attractive as it provides a novel perspective on G×E interactions. The flexibility of our model allows for the detection of non-linear interactions, making it particularly suitable when gene effects are non-linearly influenced by simultaneous exposure to multiple environmental factors. Statistically, gSIVCM reduces the dimensionality of the model by treating multiple indices X as a single index, which significantly alleviates the curse of dimensionality when multiple environmental factors interact with gene effects.
Our work builds upon the framework introduced by Guan et al. (2023) with a three-step variable selection approach for SIVCM with continuous responses. We extended our previous work to binary responses, broadening its applicability to a wider range of biological studies. For continuous responses, researchers are often interested in specific quantiles of the response, such as birth weight, rather than the mean. It is a natural progression to extend our variable selection approach to quantile regression settings, which we plan to explore in future studies.
Among the listed genes in
Table A1 in the appendix,
ABCA1 and
NTRK2 show strong evidence of association with type 2 diabetes (T2D), with
ABCA1 influencing lipid metabolism and insulin secretion (Kruit et al., 2012; Singh et al., 2022) and
NTRK2 regulating energy balance and glucose metabolism via BDNF signaling (Xu et al., 2019). Genes such as
GALNT2,
PTK2B, and
RBM15-AS1 are linked to metabolic pathways or inflammation, indirectly influencing T2D risk (Weissglas-Volkov et al., 2011; Wang et al., 2021; Wang et al., 2023). Other genes, including
LARS2,
UNC5C, and
SCAI, have limited or emerging evidence, often tied to mitochondrial dysfunction, apoptosis, or cellular stress, highlighting the need for further investigation into their roles in T2D pathogenesis (van Berge et al., 2019; Morrison et al., 2020; Smith et al., 2019). For genes listed in
Table A2,
TCF7L2 stands out as a well-established genetic risk factor for T2D. Variants in
TCF7L2 are strongly associated with impaired insulin secretion and glucose metabolism, contributing significantly to T2D susceptibility (Zhou et al., 2017). Another important gene is
GALNT2, which has been implicated in lipid and glucose metabolism; polymorphisms in this gene are linked to alterations in glycemic traits and may indirectly influence T2D risk (Weissglas-Volkov et al., 2011). For the other genes, though no strong evidence directly linking them to T2D has been identified in the current literature, they may still play roles in metabolic or cellular pathways relevant to diabetes pathophysiology, warranting further investigation.
In this work, we demonstrated the model’s applicability through a gene-based analysis. Our method can also be extended to pathway-based analysis. In the human genome, pathways typically include a diverse array of genes, with each gene containing hundreds to thousands of SNPs. A pathway-based SNP-level analysis can be conducted by modeling SNPs within a pathway as genetic variables. Alternatively, a pathway-based gene-level analysis can be performed by summarizing the genomic information of each gene into a few principal components (PCs) using principal component analysis (PCA) (Jolliffe, 2002) or sparse principal component analysis (sPCA) (Zou et al., 2006). The proposed method can then be applied to select significant PCs within a pathway, facilitating the identification of key interactions between genes and environmental mixtures. By accounting for the genomic structure, this pathway-based approach has the potential to yield more interpretable and biologically meaningful results.
Acknowledgments
This work was supported in part by the National Institutes of Health (NIH) under award number R21HG010073 (to Y. Cui). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Funding support for the GWAS of Gene and Environment Initiatives in Type 2 Diabetes was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01HG004399). The datasets used for the analyses described in this manuscript were obtained from dbGaP at https://www.ncbi.nlm.nih.gov/gap/throughthedbGaPaccessionnumberphs000091.v2.p1.
Appendix A
Appendix A.1. Computational Algorithm
Step 1 and Step 2 follow directly from the group coordinate descent algorithm shown in the main text, and we omit the details here.
Step 3:
Since this is a penalized likelihood problem, we adopt the local quadratic approximation (LQA) approach proposed by Fan and Li (2001). Let
denote the most recent value of
. By applying Taylor expansion at
, we obtain:
where
and
and
Let
We update
as follows:
The updated value of
is then obtained through standardization:
Finally, we iterate this process until convergence.
Appendix A.2. Proofs of Theorems
Appendix A.2.1. Notations
Let the space of Lipschitz continuous functions for any fixed constant c be denoted as . Define as the space of p-th order smooth functions.
Appendix A.2.2. Some Regularity Conditions
(A1) The density function of the random variable is bounded away from 0 on , where is the compact support of X. There exists a constant such that .
(A2) For , the non-parametric function with .
(A3) .
(A4) The matrix is positive definite, and each element of .
(A5) Define
For
and
, it holds that
as
.
(A7) Let be internal points of , where and . Define , , , and . There exists a constant such that and .
Appendix A.2.3. Proof of Theorem 1
Let , and we have , hence the restriction and is equivalent to . To show the consistency of , it is enough to show the consistency of .
Define , , , and , where and correspond to the B-spline coefficients . Similarly, , where corresponds to .
To establish the consistency of
and
, we need to show that
, ∃ a sufficiently large
C, such that:
where
and
is the log-likelihood function defined above. If (
A1) holds, it implies that with probability at least
, there exists a local maximum within the ball
. Hence, there exists a local maximizer such that
.
Since
for
,
for
, and
for
, we have:
Applying Taylor expansion at
and simplifying the bounds for
, we derive that:
where
is the Fisher information matrix. By standard arguments of likelihood theory,
is of the order
,
is of the order
, and for sufficiently large
C,
dominates
uniformly in
. Furthermore, we have
Since , we conclude that is dominated by .
For
and
, we have:
and
Similarly, and are dominated by . Thus, by choosing a sufficiently large C, we obtain . Therefore, the consistency of the penalized least squares estimator is proven.
Appendix A.2.4. Proof of Theorem 2
For (i), for ease of notation, let
, where
and
. Since
, it follows that
for sufficiently large
n. By Theorem 1, it suffices to show that for
:
and for
, there exists a small constant
such that as
, with probability approaching 1, for
, we have:
We start with:
Expanding
at
using a Taylor expansion, we obtain:
where
lies between
and
. After simplification, we derive:
Since and , the sign of is completely determined by . Thus, we conclude for .
For (ii) & (iii), by applying similar arguments as in part (i), it follows that with probability approaching 1,
for
and
for
. Using
and the fact that
, we have:
where
is a constant, and:
This completes the proof.
Appendix A.3. Additional Real Data Analysis Results
Table A1 lists all the varying-effect SNPs selected, along with the gene IDs to which they were mapped. Similarly,
Table A2 provides a summary of all the constant-effect SNPs selected, along with their corresponding gene IDs.
Table A1.
List of SNPs with varying effects.
Table A1.
List of SNPs with varying effects.
| Gene ID |
Gene Symbol |
SNP ID |
| GeneID:440600 |
RBM15-AS1 |
rs6537663 |
| GeneID:2590 |
GALNT2 |
rs6666516 |
| GeneID:729993 |
SHISA9 |
rs1015431 |
| GeneID:54768 |
HYDIN |
rs4788621 |
| GeneID:117532 |
TMC2 |
rs7509377 |
| GeneID:758 |
MPPED1 |
rs5766384 |
| GeneID:23395 |
LARS2 |
rs4311249 |
| GeneID:647107 |
LINC01192 |
rs2404825 |
| GeneID:8633 |
UNC5C |
rs3775049 |
| GeneID:2185 |
PTK2B |
rs6557991 |
| GeneID:4915 |
NTRK2 |
rs6559870 |
| GeneID:19 |
ABCA1 |
rs4742969 |
| GeneID:286205 |
vSCAI |
rs2416996 |
Table A2.
List of SNPs with constant effect.
Table A2.
List of SNPs with constant effect.
| Gene ID |
Gene Symbol |
SNP ID |
| GeneID:114827 |
FHAD1 |
rs3815792 |
| GeneID:2899 |
GRIK3 |
rs12118788 |
| GeneID:260425 |
MAGI3 |
rs11102660 |
| GeneID:9857 |
CEP350 |
rs2293990 |
| GeneID:2590 |
GALNT2 |
rs9308482 |
| GeneID:6934 |
TCF7L2 |
rs7901695 |
| GeneID:55742 |
PARVA |
rs7101596 |
| GeneID:867 |
CBL |
rs4489755 |
| GeneID:10867 |
TSPAN9 |
rs740771 |
| GeneID:57494 |
RIMKLB |
rs11047510 |
| GeneID:196385 |
DNAH10 |
rs11058132 |
| GeneID:64328 |
XPO4 |
rs1961415 |
| GeneID:23348 |
DOCK9 |
rs7326971 |
| GeneID:23348 |
DOCK9 |
rs7991210 |
| GeneID:57099 |
AVEN |
rs16962542 |
| GeneID:11060 |
WWP2 |
rs16970994 |
| GeneID:25780 |
RASGRP3 |
rs6708570 |
| GeneID:100505498 |
LOC100505498 |
rs6730602 |
| GeneID:117532 |
TMC2 |
rs11696526 |
| GeneID:29780 |
PARVB |
rs5765571 |
| GeneID:25814 |
ATXN10 |
rs713999 |
| GeneID:9620 |
CELSR1 |
rs11090812 |
| GeneID:23429 |
RYBP |
rs17009630 |
| GeneID:80254 |
CEP63 |
rs11710699 |
| GeneID:8633 |
UNC5C |
rs10516957 |
| GeneID:157680 |
VPS13B |
rs1788161 |
References
- Carpenter, D. O., Arcaro, K., & Spink, D. C. (2002). Understanding the human health effects of chemical mixtures. Environmental Health Perspectives, 110(Suppl 1), 25. [CrossRef]
- Colditz, G. A., & Hankinson, S. E. (2005). The Nurses’ Health Study: lifestyle and health among women. Nature Reviews Cancer, 5(5), 388-396. [CrossRef]
- Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 1348-1360. [CrossRef]
- Field, A. E., Coakley, E. H., Must, A., Spadano, J. L., Laird, N., Dietz, W. H., Rimm, E., & Colditz, G. A. (2001). Impact of overweight on the risk of developing common chronic diseases during a 10-year period. Archives of Internal Medicine, 161(13), 1581–1586. [CrossRef]
- Frank, L. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109-135. [CrossRef]
- Hu, F. B., Sigal, R. J., Rich-Edwards, J. W., Colditz, G. A., Solomon, C. G., Willett, W. C., & Manson, J. E. (1999). Walking compared with vigorous physical activity and risk of type 2 diabetes in women: a prospective study. Journal of the American Medical Association, 282(15), 1433–1439. [CrossRef]
- Grant, S. F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., ... & Styrkarsdottir, U. (2006). Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genetics, 38(3), 320-323. [CrossRef]
- Guan, S., Zhao, M., Cui, Y. (2023). Variable selection for single-index varying-coefficients models with applications to synergistic G×E interactions. Electronic Journal of Statistics, 17(1): 823–857. [CrossRef]
- Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.
- Kruit, J. K., Wijesekara, N., Westwell-Roper, C., Bruce, J., Zhao, M., Kahn, S. E., & Hayden, M. R. (2012). Loss of ATP-binding cassette transporter A1 in mice impairs glucose tolerance and insulin secretion in vivo. Diabetes, 61(12), 3166-3177.
- Morrison, J. L., Gallas, P. R., Zheng, W., Mohtashami, S., & Wong, T. (2020). UNC5C and its relevance to beta-cell apoptosis. Endocrinology, 161(7), bqaa064.
- Ottman, R. (1996). Gene-environment interaction: definitions and study designs. Preventive medicine, 25(6), 764. [CrossRef]
- Rajpathak, S. N., Crandall, J. P., Wylie-Rosett, J., Kabat, G. C., Rohan, T. E., & Hu, F. B. (2009). The role of iron in type 2 diabetes in humans. Biochimica et Biophysica Acta (BBA) - General Subjects, 1790(7), 671–681. [CrossRef]
- Rimm, E. B., Giovannucci, E. L., Willett, W. C., Colditz, G. A., Ascherio, A., Rosner, B., & Stampfer, M. J. (1991). Prospective study of alcohol consumption and risk of coronary disease in men. The Lancet, 338(8765), 464-468. [CrossRef]
- Sale, M. M., Smith, S. G., Mychaleckyj, J. C., Keene, K. L., Langefeld, C. D., Leak, T. S., ... & Freedman, B. I. (2007). Variants of the transcription factor 7-like 2 (TCF7L2) gene are associated with type 2 diabetes in an African-American population enriched for nephropathy. Diabetes, 56(10), 2638-2642. [CrossRef]
- Salmerón, J., Ascherio, A., Rimm, E. B., Colditz, G. A., Spiegelman, D., Jenkins, D. J., Stampfer, M. J., Wing, A. L., & Willett, W. C. (1997). Dietary fiber, glycemic load, and risk of NIDDM in men. Diabetes Care, 20(4), 545–550. [CrossRef]
- Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 6(2), 461-464.
- Schumaker, L. (2007). Spline functions: basic theory. Cambridge University Press.
- Sexton, K., & Hattis, D. (2007). Assessing cumulative health risks from exposure to environmental mixtures-three fundamental questions. Environmental Health Perspectives, 825-832. [CrossRef]
- Singh, J., Kumar, V., Aneja, A., Gupta, S., & Dhingra, S. (2022). Genetic polymorphisms in ABCA1 (rs2230806 and rs1800977) and LIPC (rs2070895) genes and their association with the risk of type 2 diabetes: A case-control study. International Journal of Diabetes in Developing Countries, 42(2), 227-235. [CrossRef]
- Smith, A. R., Jones, K. M., Patel, D., Thomas, C., & Reid, L. (2019). The role of SCAI in cellular stress pathways. Journal of Cell Science, 132(17), jcs231829.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
- van Berge, L., Keating, D. J., Greber, B., O’Neill, C., & Lang, J. (2019). Mitochondrial dysfunction in metabolic syndrome. Frontiers in Endocrinology, 10, 607.
- Wang, Y., Wang, J., He, Z., Tang, X., Zheng, Y., & Liu, H. (2021). Inflammatory pathways in T2D: A focus on PTK2B. Immunometabolism, 3(1), e210006.
- Wang, X., Zhang, J., Li, Y., Wang, Z., Sun, Q., Zhou, H., & Fan, Y. (2023). RBM15 regulates hepatic insulin sensitivity in GDM offspring. Molecular Medicine, 29(1), 45.
- Weissglas-Volkov, D., Aguilar-Salinas, C. A., Nikkola, E., Tusie-Luna, T., Cruz-Bautista, I., Arellano-Campos, O., & Pajukanta, P. (2011). GALNT2 polymorphisms and their role in lipid and glucose metabolism. Human Molecular Genetics, 20(9), 1632-1640.
- Xu, B., Li, X., Nian, K., Li, J., Li, M., Zheng, W., & Han, X. (2019). BDNF and NTRK2 in metabolic regulation. Nature Medicine, 25(5), 697-707.
- Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. [CrossRef]
- Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429. [CrossRef]
- Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265-286. [CrossRef]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).