Preprint
Article

This version is not peer-reviewed.

Variable Selection for Generalized Single-index Varying-coefficient Models with Applications to Synergistic G×E Interactions

A peer-reviewed article of this preprint also exists.

Submitted:

29 December 2024

Posted:

03 January 2025

You are already at the latest version

Abstract
Complex diseases such as type 2 diabetes are influenced by both environmental and genetic risk factors, leading to a growing interest in identifying gene-environment (G×E) interactions. Guan et al. (2023) proposed a three-step variable selection method for single-index varying-coefficients models. This method selects varying and constant effect genetic predictors, as well as non-zero loading parameters, to identify genetic factors that interact linearly or nonlinearly with a mixture of environmental factors to influence disease risk. In this paper, we extend this approach to a binary response setting given that many complex human diseases are binary traits. We also establish the oracle property for our variable selection method, demonstrating that it performs as well as if the correct sub-model were known in advance. Additionally, we assess the performance of our method through finite sample simulations with both continuous and discrete gene variables. Finally, we apply our approach to a type 2 diabetes dataset, identifying potential genetic factors that interact with a combination of environmental variables, both linearly and nonlinearly, to influence the risk of developing type 2 diabetes.
Keywords: 
;  ;  ;  ;  

1. Introduction

The identification of gene-environment (G×E) interactions for complex traits has been a longstanding challenge in genetic studies. Ottman (1996) defined G×E interaction as “different effect of a genotype on disease risk in persons with different environmental exposures." Traditionally, G×E interactions have been studied using a single environmental factor, as incorporating multiple environmental factors exponentially increases model complexity. This can lead to biased estimates and large standard errors, a phenomenon known as the curse of dimensionality. Moreover, an increasing number of epidemiological studies have shown that disease risk can be influenced by simultaneous exposure to multiple environmental factors (Carpenter et al., 2002; Sexton and Hattis, 2007). However, little is known about how multiple environmental factors, when considered collectively, interact with genetic factors to affect disease outcomes. Investigating this area could provide valuable insights and inform strategies for future disease prevention.
Guan et al. (2023) proposed a single-index varying-coefficients model of the form Y = k = 0 p m k ( X β ) G k + ϵ to address non-linear gene-environment interactions. However, this model is limited to continuous phenotypes. In this paper, we generalize the model to handle binary phenotype data, given that many complex human diseases are binary traits in nature. Specifically, we propose a generalized single-index varying-coefficients model (gSIVCM):
log P ( Y = 1 | X , G ) P ( Y = 0 | X , G ) = k = 0 p m k ( X β ) G k
where m k ( . ) , k = 0 , 1 , , p represents the unknown gene effect of G k , modeled as a non-parametric function modulated by its loading index X β . Here, X represents q-dimensional environmental exposures, and X β captures the joint effect of q environmental factors. The effect of these factors on the response is modeled through the m ( · ) function. In this model, we allow the effect of G k on the risk of Y to vary across multiple X through the m ( · ) function.
If m ( u ) = 0 , then G k has no effect on Y. If m ( u ) = α , where α is a constant, the effect of G k on Y is constant, and there is no interaction between G k and the mixture of X. If m ( u ) α , we conclude that G k interacts with the environmental mixture, and the form of the interaction effect is captured by the non-parametric function m ( · ) .
The unique structure of gSIVCM enables us to capture how a mixture of multiple environmental factors interacts with genetic factors. Moreover, model (1) is highly flexible and can accommodate a wide range of models. For example, when q = 1 and β = 1 , it reduces to a generalized varying-coefficients model; when p = 1 and G = 1 , it becomes a standard generalized single-index model.
When both the number of gene variables and the number of environmental variables X are large, the complexity of model (1) introduces unique challenges for variable selection, particularly due to the nonlinear, non-parametric structure of the function m k ( . ) and its unknown loading parameter β . Guan et al. (2023) recently proposed a three-step variable selection approach for single-index varying-coefficients model (SIVCM), which classifies the non-parametric gene effect into varying, constant, and zero effects, while also identifying non-zero loading parameters. In this paper, we extend this variable selection approach to gSIVCM. Instead of penalizing the squared error loss as in SIVCM, we implement a penalized log-likelihood method tailored to our model.
Variable selection has been a central topic in modern statistical research. The core idea is to add a penalty term to the optimization function. Different choices of penalty functions yield estimators with varying properties. Fan and Li (2001) introduced three key criteria for a penalized estimator: sparsity, unbiasedness, and continuity. They also characterized the oracle property, which ensures that the model performs as well as if the true sub-model were known in advance. This has become a benchmark for evaluating new penalized estimators. Examples of penalized functions include Bridge regression (Frank and Friedman, 1993), the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996), Adaptive LASSO (Zou, 2006), Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li, 2001), and Minimax Concave Penalty (MCP) (Zhang, 2010). While LASSO is well-regarded for its simple formulation and efficient algorithm (e.g., LARS), it does not possess the oracle property. In contrast, Adaptive LASSO, SCAD, and MCP all satisfy the oracle property. For our model, we chose the MCP penalty due to its desirable theoretical and practical properties.
The rest of the paper is organized as follows: Section 2 introduces the proposed variable selection method, including the formulation of the penalized log-likelihood, the iterative optimization procedure, and the selection of tuning parameters and initial values for β . Section 3 discusses the asymptotic properties of the proposed method. In Section 4, we evaluate the performance of our method through several finite-sample simulations. Finally, in Section 5, we apply our method to a Type 2 diabetes dataset, followed by conclusion and discussion.

2. Statistical Method

Throughout the paper, superscript T is used to denote matrix transpose, | | . | | p is used to denote L p norm, log ( a ) is used to denote the natural logarithm of a. For the sake of simplicity, we use constant and non-zero constant interchangeably.

2.1. Model Setup

The generalized single-index varying-coefficients model for the binary case is set up as follows:
log ( P ( Y = 1 | X , G ) P ( Y = 0 | X , G ) ) = k = 0 p m k ( X β ) G k
where Y n × 1 = ( Y 1 , Y 2 , , Y n ) T represents the binary response variable, and n denotes the sample size. The set { m k ( . ) } k = 0 , 1 , , p comprises p + 1 unknown continuous functions. X n × q = ( X 1 , X 2 , , X q ) contains environmental variables, and β q × 1 = ( β 1 , , β q ) T represents the loading parameters of the model. G n × ( p + 1 ) = ( G 0 , G 1 , , G p ) where G 0 = ( 1 , , 1 ) T and G k = ( G 1 k , G 2 k , , G n k ) T is a continuous or discrete vector of length n for k = 1 , 2 , p . Consequently, for k 0 , m k ( X β ) represents the effect of G k on Y on a log-odds scale, while m 0 ( X β ) serves as the intercept term.
For simplicity of notation, we denote μ = k = 0 p m k ( X β ) G k . Thus, model (1) can be rewritten as:
log ( P ( Y = 1 | X , G ) P ( Y = 0 | X , G ) ) = μ

2.2. Estimation Method

Our goal is to estimate and select unknown functions { m k ( . ) } k = 0 , 1 , , p and unknown loading parameter β = ( β 1 , , β q ) T . For the sake of identifiability, we assume | | β | | 2 = 1 , β 1 > 0 , and that m k ( . ) cannot take the form m j ( u ) = α T u β T u + γ T u + c .
We first approximate the non-parametric function m k ( u ) with B-spline basis functions. Without loss of generality, we assume u [ 0 , 1 ] , denote K as the number of internal knots and h as the degree of the B-spline basis function. For instance, h = 1 indicates linear splines, h = 2 represents quadratic splines, and h = 3 represents cubic splines. From standard B-spline theory, let u 1 , u 2 , , u K be internal knots satisfying 0 = u 0 u 1 < u 2 < < u K < u K + 1 = 1 , and let I n t be the left-closed, right-open interval [ u t 1 , u t ) for 1 t K , and I n K + 1 the closed interval [ u K , u K + 1 ] . Let F be a collection of functions f defined on [ 0 , 1 ] satisfying: (i) the restriction of f to I n t is a polynomial of degree h or less for 1 t K + 1 ; (ii) f is h 1 times continuously differentiable on [ 0 , 1 ] .
Let L = K + h + 1 . Then, by Schumaker (1981), we have a normalized B-spline basis function B ˜ ( u ) = ( B 1 ˜ ( u ) , B 2 ˜ ( u ) , , B L ˜ ( u ) ) for F . There exists a linear transformation matrix Π such that:
Π B ˜ ( u ) = ( 1 , B ¯ ( u ) ) = ( 1 , B 2 ( u ) , B 3 ( u ) , , B L ( u ) ) = B ( u ) ,
where each component of B ¯ ( u ) is a function of u. Then, for 0 k p , we approximate m k ( u ) by:
m k ( u ) ( 1 , B 2 ( u ) , B 3 ( u ) , , B L ( u ) ) · ( γ k 1 , γ k 2 , , γ k L ) T = B ( u ) γ k = γ k 1 + B ¯ ( u ) γ k * ,
where γ k * = ( γ k 2 , γ k 3 , , γ k L ) T and γ k = ( γ k 1 , γ k * T ) T . With the B-spline approximation, μ in (2) can be approximated by μ B where:
μ B = k = 0 p [ γ k 1 + B ¯ ( X β ) γ k * ] G k
Hence, the selection of the non-parametric function m k ( · ) is transformed to the selection of its B-spline coefficients γ = { γ k 1 , γ k * } k = 0 , 1 , , p . Note that the transformation Π allows us to separate the constant effect of G k from its varying effect on the response. More specifically:
  • If | | γ k * | | 2 0 , then G k is a varying effect predictor.
  • If | | γ k * | | 2 = 0 and | γ k 1 | 0 , then G k is a constant effect predictor.
  • If | | γ k * | | 2 = 0 and | γ k 1 | = 0 , then G k has no effect on the response.
Given the binary response, we adopt the penalized log-likelihood approach, and the log-likelihood function is defined as:
l ( γ , β ) = i = 1 n Y i μ i B log ( 1 + e μ i B ) ,
where μ i B is the ith subject of μ B . The penalized log-likelihood objective function is then defined as:
M ( β , γ ) = i = 1 n Y i μ i B log ( 1 + e μ i B ) n k = 1 p p λ 1 ( | | γ k * | | 2 ) n k = 1 p p λ 2 ( | γ k 1 | ) I ( | | γ k * | | 2 = 0 ) n d = 2 q p λ 3 ( | β d | ) ,
where p λ 1 ( . ) , p λ 2 ( . ) , p λ 3 ( . ) are penalty functions for γ k * , γ k 1 , and β , respectively. The indicator function I ( . ) equals 1 if the condition in the parentheses is satisfied, and 0 otherwise.
Note that the construction of the penalty term in function (4) reflects an “order" in selecting the effect of G k : first, the model determines whether G k has a varying effect; if not, it then decides whether G k has a non-zero constant effect or no effect at all. Furthermore, we do not penalize the intercept term m 0 ( u ) or β 1 due to model constraints.
For the penalty function, we use the MCP penalty proposed by Zhang (2010):
p ( x , λ ) = λ 0 x 1 s τ λ + d s ,
with regularization parameters τ > 0 and λ > 0 .

2.3. The Estimation Step

To optimize function (4), we follow the approach proposed by Guan et al. (2023) and adopt a three-step iterative method.
Step 1: Given a preliminary estimator of β , denoted by β ^ ( 0 ) , we obtain the first-step estimation of γ , denoted by γ ^ ( 1 ) = { γ ^ k 1 ( 1 ) , γ ^ k * ( 1 ) T } k = 0 , 1 , , p T , via a group penalized regression:
γ ^ ( 1 ) = max γ M 1 ( γ | λ 1 , β ^ ( 0 ) ) ,
where
M 1 ( γ | λ 1 , β ^ ( 0 ) ) = i = 1 n Y i μ i B log ( 1 + e μ i B ) n k = 1 p p λ 1 ( | | γ k * | | 2 ) .
Step 1 classifies m k ( . ) , k = 1 , , p into two categories: varying (V) or non-varying (NV). Specifically, m k ( . ) V if | | γ ^ k * ( 1 ) | | 2 > 0 , and m k ( . ) N V if | | γ ^ k * ( 1 ) | | 2 = 0 .
Step 2: In Step 2, we refine the selection of γ k 1 for functions in the non-varying category identified in Step 1. Specifically, we select the non-zero constant effects and classify the non-parametric functions into constant (C) and zero (0). This is achieved by penalizing γ k 1 only when | | γ ^ k * ( 1 ) | | 2 = 0 for k = 1 , 2 , , p . No penalty is applied to γ 01 . Additionally, γ k * is excluded from the model when | | γ ^ k * ( 1 ) | | 2 = 0 , i.e., γ ^ k * ( 2 ) = 0 .
The Step 2 estimator γ ^ ( 2 ) = { ( γ ^ k 1 ( 2 ) , γ ^ k * ( 2 ) ) k V , ( γ ^ k 1 ( 2 ) ) k N V } is obtained via penalized regression:
γ ^ ( 2 ) = max γ M 2 ( γ | λ 2 , β ( 0 ) , γ ^ ( 1 ) ) ,
where
M 2 ( γ | λ 2 , β ( 0 ) , γ ^ ( 1 ) ) = i = 1 n Y i μ i B ( 2 ) log ( 1 + e μ i B ( 2 ) ) n k = 1 p p λ 2 ( | γ k 1 ( 2 ) | ) I ( | | γ ^ k * ( 1 ) | | 2 = 0 ) ,
and μ i B ( 2 ) is the ith element of μ B ( 2 ) , defined as:
μ B ( 2 ) = k V [ γ k 1 ( 2 ) + B ¯ ( X β ( 0 ) ) γ k * ( 2 ) ] G k + k N V γ k 1 ( 2 ) G k .
After Steps 1 and 2, we obtain the estimator γ ^ based on β ^ ( 0 ) and classify m k ( . ) for k = 1 , , p into V, C, or 0. The next step is to estimate the loading parameter β given γ ^ ( 2 ) .
Step 3: The estimator β ^ is obtained via penalized regression:
β ^ = max | | β | | 2 = 1 M 3 ( β | λ 3 , γ ^ ( 2 ) ) ,
where
M 3 ( β | λ 3 , γ ^ ( 2 ) ) = i = 1 n Y i μ i B ( 3 ) log ( 1 + e μ i B ( 3 ) ) n d = 2 q p λ 3 ( | β d | ) ,
and μ i B ( 3 ) is the ith element of μ B ( 3 ) , defined as:
μ B ( 3 ) = k = 0 p [ γ ^ k 1 ( 2 ) + B ¯ ( X β ) γ ^ k * ( 2 ) ] G k .
Finally, set β ^ ( 0 ) = β ^ and iterate Steps 1 to 3 until convergence.
Remark: For Steps 1 and 2, we use the block coordinate descent algorithm for group penalties. For Step 3, we employ the local quadratic approximation (LQA) algorithm proposed by Fan and Li (2001). For further details, readers are referred to the Appendix. This iterative approach requires selecting tuning parameters λ 1 , λ 2 , λ 3 , the order h, and the number of internal knots K for the B-spline approximation, as well as an appropriate initial value for β .

2.4. Selection of Tuning Parameters

2.4.1. Selection of Tuning Parameters λ 1 , λ 2 , λ 3

We propose using the Bayesian Information Criterion (BIC) (Schwarz, 1978) to select the tuning parameters.
Step 1: We select λ 1 as the minimizer of
B I C ( λ 1 ) = 2 l ( γ ^ λ 1 ( 1 ) , β ^ ( 0 ) ) + log ( n ) · d f λ 1 ,
where γ ^ λ 1 ( 1 ) is the minimizer of M 1 ( γ | λ 1 , β ^ ( 0 ) ) defined above, β ^ ( 0 ) is chosen as the estimator from the previous iteration, and d f λ 1 is the total number of non-zero coefficients when λ 1 is the penalized parameter.
Step 2: We select λ 2 as the minimizer of
B I C ( λ 2 ) = 2 l ( γ ^ λ 2 ( 2 ) , β ^ ( 0 ) ) + log ( n ) · d f λ 2 ,
where γ ^ λ 2 ( 2 ) is the minimizer of M 2 ( γ | λ 2 , β ( 0 ) , γ ^ ( 1 ) ) defined above, β ^ ( 0 ) is chosen as the estimator from the previous iteration, and d f λ 2 is the total number of non-zero coefficients when λ 2 is the penalized parameter.
Step 3: We select λ 3 as the minimizer of
B I C ( λ 3 ) = 2 l ( γ ^ ( 2 ) , β ^ λ 3 ) + log ( n ) · d f λ 3 ,
where β ^ λ 3 is the minimizer of M 3 ( β | λ 3 , γ ^ ( 2 ) ) , γ ^ ( 2 ) is the minimizer of the B-spline coefficients from Step 2, and d f λ 3 is the total number of non-zero β coefficients when λ 3 is the penalized parameter.
The parameters λ 1 , λ 2 , λ 3 are searched over a grid of exponentially decreasing values, with a minimum value of 1 × 10 3 and the maximum value set such that all penalized estimators are zero. We use 100 tuning parameters in the search.

2.4.2. Selection of Order h and Number of Internal Knots K

Since h represents the order of the B-spline basis function, higher degrees introduce more complex interactions and collinearity between environmental factors and genetic predictors. We search for the optimal h in the set h { 2 , 3 , 4 } . For K, only when K = O p ( n 1 2 r + 1 ) (where n is the sample size, r is the smoothness of m k ( . ) , and r > 2 ), the selection approach achieves oracle properties. We therefore search for the optimal K in the neighborhood of n 1 2 r + 1 , denoted by K . In our simulations, K = { 2 , 3 , 4 , 5 } .
We fit the following intercept-only model using the B-spline approximation:
log P ( Y = 1 | X , G ) P ( Y = 0 | X , G ) = m 0 ( X β ) .
Denote its estimator as ( γ ^ 01 , γ ^ 0 * ) and β ^ . Let m ^ 0 ( X β ^ ) = γ ^ 01 + B ¯ ( X β ^ ) γ ^ 0 * . The optimal K and h are selected as the values minimizing
log Y T m ^ 0 ( X β ^ ) log ( 1 + e m ^ 0 ( X β ^ ) ) + log ( n ) n ( K + h + 1 ) .
For the iterative algorithm proposed above, we require a reasonable initial value for β (denoted by β i n i t i a l ) to begin. We obtain β i n i t i a l by fitting model (5) with the selected K and h.

2.5. Theoretical Properties

We study the properties of the penalized likelihood estimator. Let β 0 and m k 0 ( . ) , k = 0 , 1 , , p represent the true values of β and m k ( . ) , k = 0 , 1 , , p , respectively, and let γ 0 denote the true value of the B-spline coefficients γ . Assume without loss of generality that β l 0 0 for l = 1 , , s , β l 0 = 0 for l = s + 1 , , q , m k 0 ( . ) is varying for k = 0 , 1 , , v , non-zero constant for k = v + 1 , , c , and zero for k = c + 1 , , p . The following theorem establishes the consistency of the penalized least square estimators.
Theorem 1. 
Assume that the regularity conditions (A1)–(A7) in the Appendix hold, and that the number of knots satisfies K = O p ( n 1 / ( 2 r + 1 ) ) . Then:
(i) 
| | β ^ β 0 | | = O p ( n r / ( 2 r + 1 ) + a n ) ;
(ii) 
| | m ^ k ( . ) m k 0 ( . ) | | = O p ( n r / ( 2 r + 1 ) + a n ) , k = 1 , , q ,
where a n = max k , l { p λ 1 ( | | γ k * 0 | | 2 ) , p λ 2 ( | | γ k 1 0 | | 2 ) , p λ 3 ( | β l 0 | ) } , γ k * 0 0 , β l 0 0 , k = 0 , , p , l = 1 , , q , r is defined in condition (A2) of the supplemental material, and p λ ( · ) denotes the first derivative of the penalty function p λ ( · ) . Furthermore, under additional regularity conditions outlined in the Appendix, we can show that the estimator possesses sparsity.
Theorem 2. 
Assume the regularity conditions (A1)–(A7) in the Appendix hold and K = O p ( n 1 / ( 2 r + 1 ) ) . Let λ m a X = max { λ 1 , λ 2 , λ 3 } , λ m i n = min { λ 1 , λ 2 , λ 3 } . If λ m a X 0 and n r / ( 2 r + 1 ) λ m i n as n , then with probability approaching 1:
(i)
β ^ j = 0 for j = s + 1 , , q ;
(ii)
m ^ k ( . ) = c k for k = v + 1 , , c , where c k is some non-zero constant;
(iii)
m ^ k ( . ) = 0 for k = c + 1 , , p .
Theorems 1 and 2 indicate that our penalized likelihood estimator is consistent and possesses oracle properties. The proofs are given in the Appendix.

3. Simulation

We evaluated the performance of our model via finite-sample simulations. The performance is assessed in the following ways: (1) classification accuracy of m ( · ) , denoted as the oracle percentage; (2) IMSE of the estimated m-function; (3) selection accuracy of β ; and (4) estimation accuracy of β (MSE). A total of R simulations is conducted for all cases.
The oracle percentage of m ( . ) is defined as the proportion of correct classifications across R simulations. For instance, if m k ( . ) is a varying function and it is classified as varying in g simulations, then the oracle percentage for m k ( . ) is g R × 100 % .
The IMSE of m k ( . ) is given by
IMSE = 1 R r = 1 R 1 n grid j = 1 n grid γ ^ k 1 ( r ) + B ¯ ( u j ) γ ^ k * ( r ) m k ( u j ) 2 ,
where n grid is the number of points for estimating the MSE of the predicted function; γ ^ k * ( r ) and γ ^ k 1 ( r ) are estimators of the B-spline coefficients for the r-th simulation using the proposed approach; β ^ ( r ) is the estimator of the loading parameter β for the r-th simulation; and u j corresponds to the j / n grid × 100 % quantile within the range of X β ^ ( r ) . In our simulations, n grid is set to 100.
The oracle percentage of β is defined as the proportion of correct selections of β across R simulations. For example, if β d 0 and β d is selected as non-zero in g simulations, then the oracle percentage for β d is g R × 100 % .
The MSE of β d is computed as
MSE = 1 R r = 1 R ( β ^ d ( r ) β d ) 2 ,
where β ^ d ( r ) is the estimator for β d in the r-th simulation. This represents the average MSE for β d .
The simulation data was generated based on the model (1). The environmental variable X was drawn from a u n i f ( 0 , 1 ) distribution. For the loading parameter β = ( β 1 , β 2 , , β q ) T , we set β 1 = β 2 = 1 2 and all other β j were set to zero. We evaluated the performance of the proposed approach with both continuous (e.g., gene expressions) and discrete (e.g., SNPs) predictors G.

3.1. Continuous G

In the continuous case, the non-parametric functions m k ( u ) were set as:
m 0 ( u ) = 2 sin ( 2 π u ) , m 1 ( u ) = 2 cos ( π u ) + 2 , m 2 ( u ) = sin ( 2 π u ) + cos ( π u ) + 1 , m 3 ( u ) = 2 , m 4 ( u ) = 2 . 5 , m k ( u ) = 0 for k = 5 , , p .
The predictors G were generated from N ( 0 , 1 ) distribution. We conducted R = 1 , 000 simulations to evaluate the model’s performance under p = 50 , 100 , q = 5 and n = 1000 , 2000 .
Table 1 presents the selection and estimation accuracy for m k ( · ) with continuous predictors. Across all cases, the selection accuracy was close to 100 % for varying, constant, and zero-effect coefficients. The IMSE of the proposed model was in the order of 1 or 2 for varying and constant effect predictors. As the model dimension p increased (from 50 to 100), a slight increase in the model IMSE was observed. Conversely, as the sample size n increased (from 1000 to 2000), both the model IMSE and the oracle IMSE decreased, aligning with the asymptotic properties of the proposed model. These results suggest that the proposed variable selection approach performs well in both selection and estimation accuracy for the non-parametric functions m k ( · ) .
Table 2 presents the selection and estimation accuracy for the loading parameter β . In all scenarios, the selection accuracy for non-zero loading parameters ( β 1 , β 2 ) was nearly 100 % , with MSE values in the order of 2 to 4 . For zero loading parameters ( β 3 , β 4 , β 5 ), the selection accuracy was approximately 97 % when n = 1000 . As the sample size increased to n = 2000 , the oracle percentages improved to 99 % , with MSE values in the order of 4 to 5 . Overall, the MSEs improve as the sample size increases, and the model MSE is very close to the oracle MSE, indicating strong estimation performance for the loading parameters.

3.2. Discrete G

We extended our evaluation to examine the performance of the proposed model with discrete predictors, G. Single nucleotide polymorphism (SNP) data is one of the most commonly used types of genetic data. SNPs take values of 0, 1, and 2, representing the genotypes aa, Aa, and AA, respectively. Additionally, SNPs exhibit a wide range of minor allele frequencies (MAF), making it crucial for our simulations to incorporate these characteristics.
To reflect these properties, G was simulated using the following probability distribution function:
P ( G i j = 0 ) = ( 1 p A ) 2 , P ( G i j = 1 ) = 2 · p A · ( 1 p A ) , P ( G i j = 2 ) = p A 2 ,
where G i j denotes the jth predictor for the ith subject, with i = 1 , , n and j = 1 , , p , and p A is the MAF of the minor allele A.
The gene effect functions, m k ( u ) , were set in Table 3.
In this setup, predictors G k exhibit varying and constant effects with p A = 0.1 , 0.3 , 0.5 . The purpose of setting varying MAFs is to check the selection and estimation performance under different MAFs. For zero-effect predictors G k , their p A values are uniformly distributed within the range ( 0.05 , 0.5 ) . The environmental variables X was generated from a unif ( 0 , 1 ) distribution. Finally, Y was generated according to the model specified in (1). We evaluated the performance of the proposed model through 1 , 000 simulations, considering p = 50 , 100 , n = 500 , 1000 , and q = 5 .
Table 4 presents the selection and estimation accuracy of the non-parametric function m k ( · ) for discrete G. We observed that the sample size n and MAF ( P A ) of G k were the primary factors influencing the performance of the proposed model. To better visualize their impact, we present Figure 1.
As the sample size increased (from 1000 to 2000), the model’s performance improved significantly. For instance, the oracle percentages for m 1 ( · ) , , m 4 ( · ) increased from approximately 80 % to nearly 100 % , and the corresponding IMSE decreased substantially. These results align with the asymptotic theory of the proposed model. Conversely, as the MAF for G k decreased (from 0.5 to 0.1), we observed a decline in performance, reflected in both oracle percentages and model IMSE. For example, in the case where n = 1000 , the oracle percentages for { m 1 ( · ) , m 2 ( · ) } ( P A = 0 . 5 ), { m 3 ( · ) , m 4 ( · ) } ( P A = 0 . 3 ), and { m 5 ( · ) , m 6 ( · ) } ( P A = 0 . 1 ) were approximately 85 % , 80 % , and 23 % , respectively. The IMSE increased correspondingly, from 0.4 ( P A = 0 . 5 ) to 0.5 ( P A = 0 . 3 ) and then to 1.3 ( P A = 0 . 1 ). This is a common phenomenon in genetic studies as smaller MAFs provide less data information to estimate the corresponding function, leading to poor estiamtion and selection performance.
Table 5 demonstrates the selection and estimation results for the loading parameter β . We observed that the sample size n was the primary factor influencing model performance. When the sample size was large ( n = 2000 ), the oracle percentage for non-zero loading covariates ( β 1 , β 2 ) was 100 % , and the oracle percentage for zero loading covariates ( β 3 , β 4 , β 5 ) was approximately 99 % . The MSE for β was in the order of 10 3 to 10 5 . In contrast, when the sample size was relatively small ( n = 1000 ), the oracle percentage for non-zero loading covariates remained at 100 % , but the oracle percentage for zero loading parameters decreased to around 95 % . Comparing the cases where p = 50 and p = 100 , we saw a slight reduction in selection accuracy for zero loading parameters when n = 1000 . This is expected, as model performance typically declines with increased model complexity. As the sample size increased to 2000, this difference in performance was not significant.
Based on simulation results with both continuous and discrete G variables, we observed the following key findings about the proposed model. First, the model demonstrates robust performance with large sample sizes ( n = 1000 or 2000). Second, when n = 1000 , the false positive rate for the loading parameter β was around 5%. This may reflect an inherent limitation of the LQA algorithm that cannot shrink zero parameters to exactly zero when the sample is small. Third, the model exhibits superior performance for SNP variants with larger MAFs (e.g., P A = 0 . 3 or 0.5) compared to SNP variants with smaller MAF ( P A = 0 . 1 ). This enhanced performance likely stems from the more data information content provided by SNPs with higher MAFs. Similar phenomenon is commonly observed in other genetic association studies. Overall, the simulation results demonstrate that the method performs reasonably well under finite sample conditions.

4. A Case Study

We demonstrated the applicability of our proposed model using a type 2 diabetes dataset containing genotypes (SNPs), environmental factors, and the phenotype (presence of type 2 diabetes). This dataset comprises two nested case-control cohort studies: the Nurses’ Health Study (NHS) and the Health Professionals Follow-Up Study (HPFS) from the Gene-Environment Association Studies Consortium (GENVEA). Detailed descriptions of these cohorts can be found in Colditz and Hankinson (2005) and Rimm et al. (1991). Initially, the dataset included 3,391 females (NHS) and 2,599 males (HPFS).
After data cleaning, which involved removing subjects with mismatched genotypes and phenotypes, SNPs with more than 10 % missing data, SNPs with MAF < 0 . 05 , and SNPs deviating from Hardy-Weinberg equilibrium (p-value < 0 . 001 ), the final dataset included 5,865 subjects (2,494 males and 3,371 females), with 2,733 cases and 3,132 controls, and 655,002 SNPs. The dataset also contained 12 continuous environmental factors, such as height, weight, age, and alcohol consumption. We fit a marginal logistic regression model for all 12 factors and, selected five environmental factors for the analysis: total physical activity ( X 1 , denoted as act), BMI ( X 2 ), alcohol intake ( X 3 , denoted as alcohol), heme iron intake ( X 4 , denoted as heme), and glycemic load ( X 5 , denoted as gl).
Based on SNP locations, we mapped all SNPs to known genes and selected genes containing more than 30 SNPs, resulting in a total of 2,178 genes. For each gene, we applied the proposed variable selection approach to identify significant SNPs and their effects. To ensure identifiability, the first element of the loading parameter β was constrained to be a non-zero positive value. We fit the proposed model five times for each gene, varying the environmental factor used as the first element each time. An SNP was deemed significant if it was identified as either varying or constant across all environmental factor orders, indicating a strong signal.
In total, our model identified 13 varying-effect SNPs and 26 constant-effect SNPs. Here, we present one of the selected varying-effect SNPs as an example. Please refer to Table A1 and Table A2 in the Appendix for the complete list of selected varying- and constant-effect SNPs. Previous studies (Sale et al., 2007; Grant et al., 2006) have suggested that the gene TCF7L2 is associated with type 2 diabetes across multiple populations. Specifically, Sale et al. (2007) reported a strong association between type 2 diabetes and SNPs rs7903146 and rs7901695 within this gene. Consistent with these findings, our model identified SNP rs7901695 as a constant-effect predictor, indicating no interaction with the five environmental factors (Note that SNP rs7903146 was not observed in this dataset).
Figure 2 shows plots of the marginal total environmental effect and the interaction effect of the SNP rs6537663 in gene TCF7L2, with heme iron intake as the first loading covariate. As the index increases, we observed that the marginal effect initially decreases, then increases, and subsequently exhibits a rapid decrease as the total effect of the five environmental factors increases. For the interaction effect, it fluctuates around 0 as the total effect of the five environmental variables increases, indicating that the SNP is unresponsive (or insensitive) to changes in these variables. However, as the index X β continues to increase, the SNP reacts to environmental changes, with a dramatic increase in risk for type 2 diabetes beyond a certain threshold. This estimated effect suggests that the genetic sensitivity of the SNP to the total effect of the five environmental variables follows a threshold model. This finding has practical implications: most individuals tolerate daily environmental changes, including dietary variations, without adverse effects. However, when such changes exceed a certain limit, the risk of disease may increase.
Table 6 presents the selection and estimation results of the loading parameters β , with heme iron intake as the first loading covariate. The model selects all loading parameters except for alcohol consumption (alcohol). Notably, we observed that body mass index (BMI) has the largest effect, which aligns with practical knowledge, as BMI is positively associated with type 2 diabetes and is a well-established risk factor for the disease. Notably, the sign of the loading parameters aligns with known associations in the literature. For example, Hu et al. (1999) reported that higher physical activity is linked to a lower risk of type 2 diabetes, while Field et al. (2001) demonstrated that high BMI increases the risk of type 2 diabetes. Additionally, Rajpathak et al. (2009) found that high heme iron intake is associated with an increased risk of type 2 diabetes, and Salmerón et al. (1997) showed that a high glycemic load is linked to a higher risk of type 2 diabetes. The signs of the estimated loading coefficients are consistent with these established findings.

5. Discussion

G×E interactions have been extensively studied in the literature, leading to the development of numerous statistical models. In this paper, we propose a three-step iterative variable selection approach for the generalized single-index varying-coefficients model (gSIVCM) with a binary response. Our goal is to identify varying, constant, and zero-effect genes, as well as to select non-zero environmental factors that interact with varying-effect genes. Biologically, our approach is attractive as it provides a novel perspective on G×E interactions. The flexibility of our model allows for the detection of non-linear interactions, making it particularly suitable when gene effects are non-linearly influenced by simultaneous exposure to multiple environmental factors. Statistically, gSIVCM reduces the dimensionality of the model by treating multiple indices X as a single index, which significantly alleviates the curse of dimensionality when multiple environmental factors interact with gene effects.
Our work builds upon the framework introduced by Guan et al. (2023) with a three-step variable selection approach for SIVCM with continuous responses. We extended our previous work to binary responses, broadening its applicability to a wider range of biological studies. For continuous responses, researchers are often interested in specific quantiles of the response, such as birth weight, rather than the mean. It is a natural progression to extend our variable selection approach to quantile regression settings, which we plan to explore in future studies.
Among the listed genes in Table A1 in the appendix, ABCA1 and NTRK2 show strong evidence of association with type 2 diabetes (T2D), with ABCA1 influencing lipid metabolism and insulin secretion (Kruit et al., 2012; Singh et al., 2022) and NTRK2 regulating energy balance and glucose metabolism via BDNF signaling (Xu et al., 2019). Genes such as GALNT2, PTK2B, and RBM15-AS1 are linked to metabolic pathways or inflammation, indirectly influencing T2D risk (Weissglas-Volkov et al., 2011; Wang et al., 2021; Wang et al., 2023). Other genes, including LARS2, UNC5C, and SCAI, have limited or emerging evidence, often tied to mitochondrial dysfunction, apoptosis, or cellular stress, highlighting the need for further investigation into their roles in T2D pathogenesis (van Berge et al., 2019; Morrison et al., 2020; Smith et al., 2019). For genes listed in Table A2, TCF7L2 stands out as a well-established genetic risk factor for T2D. Variants in TCF7L2 are strongly associated with impaired insulin secretion and glucose metabolism, contributing significantly to T2D susceptibility (Zhou et al., 2017). Another important gene is GALNT2, which has been implicated in lipid and glucose metabolism; polymorphisms in this gene are linked to alterations in glycemic traits and may indirectly influence T2D risk (Weissglas-Volkov et al., 2011). For the other genes, though no strong evidence directly linking them to T2D has been identified in the current literature, they may still play roles in metabolic or cellular pathways relevant to diabetes pathophysiology, warranting further investigation.
In this work, we demonstrated the model’s applicability through a gene-based analysis. Our method can also be extended to pathway-based analysis. In the human genome, pathways typically include a diverse array of genes, with each gene containing hundreds to thousands of SNPs. A pathway-based SNP-level analysis can be conducted by modeling SNPs within a pathway as genetic variables. Alternatively, a pathway-based gene-level analysis can be performed by summarizing the genomic information of each gene into a few principal components (PCs) using principal component analysis (PCA) (Jolliffe, 2002) or sparse principal component analysis (sPCA) (Zou et al., 2006). The proposed method can then be applied to select significant PCs within a pathway, facilitating the identification of key interactions between genes and environmental mixtures. By accounting for the genomic structure, this pathway-based approach has the potential to yield more interpretable and biologically meaningful results.

Acknowledgments

This work was supported in part by the National Institutes of Health (NIH) under award number R21HG010073 (to Y. Cui). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Funding support for the GWAS of Gene and Environment Initiatives in Type 2 Diabetes was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01HG004399). The datasets used for the analyses described in this manuscript were obtained from dbGaP at https://www.ncbi.nlm.nih.gov/gap/throughthedbGaPaccessionnumberphs000091.v2.p1.

Appendix A

Appendix A.1. Computational Algorithm

Step 1 and Step 2 follow directly from the group coordinate descent algorithm shown in the main text, and we omit the details here.
Step 3:
β ^ = max | | β | | 2 = 1 M 3 ( β | λ 3 , γ ^ ( 2 ) ) = max | | β | | 2 = 1 l ( γ ^ ( 2 ) , β ) n d = 2 q p λ 3 ( | β d | )
Since this is a penalized likelihood problem, we adopt the local quadratic approximation (LQA) approach proposed by Fan and Li (2001). Let β ˜ denote the most recent value of β . By applying Taylor expansion at β ˜ , we obtain:
l ( γ ^ ( 2 ) , β ) l ( γ ^ ( 2 ) , β ˜ ) + l ( γ ^ ( 2 ) , β ˜ ) T ( β β ˜ ) + 1 2 ( β β ˜ ) T 2 l ( γ ^ ( 2 ) , β ˜ ) ( β β ˜ )
where
l ( γ ^ ( 2 ) , β ˜ ) = ( l ( γ ^ ( 2 ) , β ) β 1 , , l ( γ ^ ( 2 ) , β ) β q ) T | β = β ˜ is the gradient
2 l ( γ ^ ( 2 ) , β ˜ ) = 2 l ( γ ^ ( 2 ) , β ) β j β l 1 j , l q | β = β ˜ is the hessian matrix ,
and
l ( γ ^ ( 2 ) , β ) β j = ( Y 1 1 + e μ B ( 3 ) ) T μ B ( 3 ) β j
2 l ( γ ^ ( 2 ) , β ) β j β l = ( Y 1 1 + e μ B ( 3 ) ) T μ B ( 3 ) β j β l e μ B ( 3 ) ( 1 + e μ B ( 3 ) ) 2 T μ B ( 3 ) β j · μ B ( 3 ) β l ,
and
μ B ( 3 ) = k = 0 p [ γ ^ k 1 ( 2 ) + B ¯ ( X β ) γ ^ k * ( 2 ) ] · G k
μ B ( 3 ) β j = k = 0 p B ¯ ( X β ) γ ^ k * ( 2 ) · X j · G k
μ B ( 3 ) β j β l = k = 0 p B ¯ ( X β ) γ ^ k * ( 2 ) · X j · X l · G k
Let
Σ λ 3 ( β ˜ ) = d i a G ( 0 , p λ 3 ( β 2 ^ ) | β 2 ^ | , , p λ 3 ( β q ^ ) | β q ^ | )
We update β as follows:
β * = β ˜ 2 l ( γ ^ ( 2 ) , β ˜ ) + n Σ λ 3 ( β ˜ ) 1 l ( γ ^ ( 2 ) , β ˜ ) + n Σ λ 3 ( β ˜ ) β ˜ .
The updated value of β is then obtained through standardization:
β updated = sign ( β 1 * ) β * | | β * | | 2 .
Finally, we iterate this process until convergence.

Appendix A.2. Proofs of Theorems

Appendix A.2.1. Notations

Let the space of Lipschitz continuous functions for any fixed constant c be denoted as L i p ( [ a , b ] , c ) = { f : | f ( x 1 ) f ( x 2 ) | c | x 1 x 2 | , x 1 , x 2 [ a , b ] } . Define C ( p ) [ a , b ] = { f : f ( p ) C [ a , b ] } as the space of p-th order smooth functions.

Appendix A.2.2. Some Regularity Conditions

(A1) The density function f U ( β ) ( . ) of the random variable U ( β ) = X β is bounded away from 0 on Ω = { X β , X X } , where X is the compact support of X. There exists a constant c 1 such that f U ( β ) ( . ) L i p ( [ a , b ] , c 1 ) .
(A2) For k = 0 , 1 , , p , the non-parametric function m k ( . ) C ( r ) with r 2 .
(A3) E ( | | G | | 6 ) < .
(A4) The matrix M ( u ) = E ( G G T | X β = u ) is positive definite, and each element of M ( u ) L i p ( [ a , b ] , c 4 ) .
(A5) Define
b n = max k , l { p λ 1 ( | | γ k * 0 | | 2 ) , p λ 2 ( | γ k 1 0 | ) , p λ 3 ( | β d 0 | ) : γ k * 0 0 , γ k 1 0 0 , β l 0 0 } .
For k = 1 , , p and d = 2 , , q , it holds that b n 0 as n .
(A6)
lim inf n lim inf | | γ k * | | 2 0 + 1 λ 1 | p λ 1 ( | | γ k * | | 2 ) | > 0 for k = v + 1 , , p ,
lim inf n lim inf | γ k 1 | 0 + 1 λ 2 | p λ 2 ( | γ k 1 | ) | > 0 for k = c + 1 , , p ,
lim inf n lim inf | β d | 0 + 1 λ 3 | p λ 3 ( | β d | ) | > 0 for d = s + 1 , , q .
(A7) Let c 1 , , c K be internal points of [ a , b ] , where a = inf { u : u Ω } and b = sup { u : u Ω } . Define c 0 = a , c K + 1 = b , h i = c i c i 1 , and h = max { h i } . There exists a constant C 7 such that h min { h i } < C 7 and max { h i + 1 h i } = o ( K 1 ) .

Appendix A.2.3. Proof of Theorem 1

Let ϕ = ( β 2 , , β q ) T , and we have β = ( 1 | | ϕ | | 2 2 , ϕ T ) T , hence the restriction | | β | | = 1 and β 1 > 0 is equivalent to | | ϕ | | 2 < 1 . To show the consistency of β ^ , it is enough to show the consistency of ϕ ^ .
Define α n = n r / ( 2 r + 1 ) + a n , γ = γ 0 + α n τ 1 , ϕ = ϕ 0 + α n τ 2 , and τ = ( τ 1 , τ 2 ) , where τ 1 = ( τ 01 , τ 0 * , , τ p 1 , τ p * ) and { τ k 1 , τ k * } correspond to the B-spline coefficients γ k 1 , γ k * . Similarly, τ 2 = ( τ 1 ϕ , , τ q 1 ϕ ) , where τ l ϕ corresponds to ϕ l .
To establish the consistency of γ ^ and ϕ ^ , we need to show that ϵ > 0 , ∃ a sufficiently large C, such that:
P sup | | τ | | = C { M ( γ , ϕ ) } < M ( γ 0 , ϕ 0 ) 1 ϵ ,
where
M ( γ , ϕ ) = l ( γ , ϕ ) n k = 1 p p λ 1 ( | | γ k * | | 2 ) n k = 1 p p λ 2 ( | γ k 1 | ) I ( | | γ k * | | 2 = 0 ) n l = 1 q 1 p λ 3 ( | ϕ l | ) ,
and l ( γ , ϕ ) is the log-likelihood function defined above. If (A1) holds, it implies that with probability at least 1 ϵ , there exists a local maximum within the ball { ( γ 0 , ϕ 0 ) + α n τ : | | τ | | C } . Hence, there exists a local maximizer such that | | ( γ ^ , ϕ ^ ) ( γ 0 , ϕ 0 ) | | = O p ( α n ) .
Define
D n ( τ ) = 1 K { M ( γ , ϕ ) M ( γ 0 , ϕ 0 ) } = 1 K { M ( γ 0 + α n τ 1 , ϕ 0 + α n τ 2 ) M ( γ 0 , ϕ 0 ) } = 1 K { l ( γ 0 + α n τ 1 , ϕ 0 + α n τ 2 ) l ( γ 0 , ϕ 0 ) n k = 1 p p λ 1 ( | | γ k * 0 + α n τ k * | | 2 ) p λ 1 ( | | γ k * 0 | | 2 ) n k = 1 p p λ 2 ( | γ k 1 0 + α n τ k 1 | ) I ( | | γ k * 0 + α n τ k * | | 2 = 0 ) p λ 2 ( | γ k 1 0 | ) I ( | | γ k * 0 | | 2 = 0 ) n j = 1 q 1 p λ 3 ( | ϕ j 0 + α n τ j ϕ | ) p λ 3 ( | ϕ j 0 | ) } .
Since p λ 1 ( | | γ k * 0 | | 2 ) = 0 for k = v + 1 , , p , p λ 3 ( | ϕ j 0 | ) = 0 for j = s + 1 , , q 1 , and I ( | | γ k * 0 | | 2 = 0 ) = 0 for k = 1 , , v , we have:
D n ( τ ) 1 K { l ( γ 0 + α n τ 1 , ϕ 0 + α n τ 2 ) l ( γ 0 , ϕ 0 ) n k = 1 v [ p λ 1 ( | | γ k * 0 + α n τ k * | | 2 ) p λ 1 ( | | γ k * 0 | | 2 ) ] n k = v + 1 p p λ 2 ( | γ k 1 0 + α n τ k 1 | ) p λ 2 ( | γ k 1 0 | ) n j = 1 s 1 [ p λ 3 ( | ϕ j 0 + α n τ j ϕ | ) p λ 3 ( | ϕ j 0 | ) ] } .
Applying Taylor expansion at ( γ 0 , ϕ 0 ) and simplifying the bounds for D n ( τ ) , we derive that:
| | ( γ ^ , ϕ ^ ) ( γ 0 , ϕ 0 ) | | = O p ( α n ) .
where I ( γ 0 , ϕ 0 ) is the Fisher information matrix. By standard arguments of likelihood theory, S 1 is of the order O p ( 1 + n r / ( 2 r + 1 ) α n ) | | τ | | , S 2 is of the order O p ( 1 + 2 n r / ( 2 r + 1 ) α n ) | | τ | | 2 , and for sufficiently large C, S 2 dominates S 1 uniformly in | | τ | | = C . Furthermore, we have
S 3 n K α n a n k = 1 v γ k * 0 | | γ k * 0 | | 2 τ k * T + n K α n 2 max k { p λ 1 ( | | γ k * 0 | | 2 ) } k = 1 v τ k * τ k * T n K α n 2 v | | τ | | + n K α n 2 | | τ | | 2 max k { p λ 1 ( | | γ k * 0 | | 2 ) } .
Since max k { p λ 1 ( | | γ k * 0 | | 2 ) } 0 , we conclude that S 3 is dominated by S 2 .
For S 4 and S 5 , we have:
S 4 α n a n n K k = v + 1 p τ k 1 + n K α n 2 max k { p λ 2 ( | γ k 1 0 | ) } k = v + 1 p ( τ k 1 ) 2 n K α n 2 | | τ | | + n K α n 2 | | τ | | 2 max k { p λ 2 ( | γ k 1 0 | ) } ,
and
S 5 α n a n n K l = 1 s 1 τ l ϕ + n K α n 2 max l { p λ 3 ( | ϕ l 0 | ) } l = 1 s 1 ( τ l ϕ ) 2 n K α n 2 | | τ | | + n K α n 2 | | τ | | 2 max l { p λ 3 ( | ϕ l 0 | ) } .
Similarly, S 4 and S 5 are dominated by S 2 . Thus, by choosing a sufficiently large C, we obtain | | ( γ ^ , ϕ ^ ) ( γ 0 , ϕ 0 ) | | = O p ( α n ) . Therefore, the consistency of the penalized least squares estimator ( γ ^ , ϕ ^ ) is proven.

Appendix A.2.4. Proof of Theorem 2

For (i), for ease of notation, let ϕ = ( ϕ n z , ϕ z ) , where ϕ n z = ( ϕ 1 , , ϕ s 1 ) and ϕ z = ( ϕ s , , ϕ q 1 ) . Since λ max 0 , it follows that a n = 0 for sufficiently large n. By Theorem 1, it suffices to show that for ϕ n z :
| | ϕ l ϕ l 0 | | 2 = O p ( n r / ( 2 r + 1 ) ) , l = 1 , , s 1 ,
and for ϕ z , there exists a small constant ϵ = C n r / ( 2 r + 1 ) such that as n , with probability approaching 1, for l = s , , q 1 , we have:
M ( γ , ϕ ) ϕ l < 0 when 0 < ϕ l < ϵ , and M ( γ , ϕ ) ϕ l > 0 when ϵ < ϕ l < 0 .
We start with:
M ( γ , ϕ ) ϕ l = l ( γ , ϕ ) ϕ l n p λ 3 ( | ϕ l | ) sign ( ϕ l ) .
Expanding l ( γ , ϕ ) ϕ l at ϕ 0 using a Taylor expansion, we obtain:
M ( γ , ϕ ) ϕ l = l ( γ , ϕ 0 ) ϕ l + k = 1 q 1 2 l ( γ , ϕ 0 ) ϕ l ϕ k ( ϕ k ϕ k 0 ) + k = 1 q 1 j = 1 q 1 3 l ( γ , ϕ * ) ϕ l ϕ k ϕ j ( ϕ k ϕ k 0 ) ( ϕ j ϕ j 0 ) n p λ 3 ( | ϕ l | ) sign ( ϕ l ) ,
where ϕ * lies between ϕ 0 and ϕ . After simplification, we derive:
M ( γ , ϕ ) ϕ l = n λ 3 1 λ 3 p λ 3 ( | ϕ l | ) sign ( ϕ l ) + O p 1 λ 3 n r / ( 2 r + 1 ) .
Since lim n lim inf ϕ l 0 1 λ 3 p λ 3 ( | ϕ l | ) > 0 and 1 λ 3 n r / ( 2 r + 1 ) 0 , the sign of M ( γ , ϕ ) ϕ l is completely determined by sign ( ϕ l ) . Thus, we conclude β ^ j = 0 for j = s + 1 , , q 1 .
For (ii) & (iii), by applying similar arguments as in part (i), it follows that with probability approaching 1, γ ^ k * = 0 for k = v + 1 , , p and γ ^ k 1 = 0 for k = c + 1 , , p . Using sup u B ( u ) = O ( 1 ) and the fact that m ^ k ( u ) = γ ^ k 0 + B ¯ ( X β ^ ) γ ^ k * , we have:
m ^ k ( u ) = c k , k = v + 1 , , c ,
where c k is a constant, and:
m ^ k ( u ) = 0 , k = c + 1 , , p .
This completes the proof.

Appendix A.3. Additional Real Data Analysis Results

Table A1 lists all the varying-effect SNPs selected, along with the gene IDs to which they were mapped. Similarly, Table A2 provides a summary of all the constant-effect SNPs selected, along with their corresponding gene IDs.
Table A1. List of SNPs with varying effects.
Table A1. List of SNPs with varying effects.
Gene ID Gene Symbol SNP ID
GeneID:440600 RBM15-AS1 rs6537663
GeneID:2590 GALNT2 rs6666516
GeneID:729993 SHISA9 rs1015431
GeneID:54768 HYDIN rs4788621
GeneID:117532 TMC2 rs7509377
GeneID:758 MPPED1 rs5766384
GeneID:23395 LARS2 rs4311249
GeneID:647107 LINC01192 rs2404825
GeneID:8633 UNC5C rs3775049
GeneID:2185 PTK2B rs6557991
GeneID:4915 NTRK2 rs6559870
GeneID:19 ABCA1 rs4742969
GeneID:286205 vSCAI rs2416996
Table A2. List of SNPs with constant effect.
Table A2. List of SNPs with constant effect.
Gene ID Gene Symbol SNP ID
GeneID:114827 FHAD1 rs3815792
GeneID:2899 GRIK3 rs12118788
GeneID:260425 MAGI3 rs11102660
GeneID:9857 CEP350 rs2293990
GeneID:2590 GALNT2 rs9308482
GeneID:6934 TCF7L2 rs7901695
GeneID:55742 PARVA rs7101596
GeneID:867 CBL rs4489755
GeneID:10867 TSPAN9 rs740771
GeneID:57494 RIMKLB rs11047510
GeneID:196385 DNAH10 rs11058132
GeneID:64328 XPO4 rs1961415
GeneID:23348 DOCK9 rs7326971
GeneID:23348 DOCK9 rs7991210
GeneID:57099 AVEN rs16962542
GeneID:11060 WWP2 rs16970994
GeneID:25780 RASGRP3 rs6708570
GeneID:100505498 LOC100505498 rs6730602
GeneID:117532 TMC2 rs11696526
GeneID:29780 PARVB rs5765571
GeneID:25814 ATXN10 rs713999
GeneID:9620 CELSR1 rs11090812
GeneID:23429 RYBP rs17009630
GeneID:80254 CEP63 rs11710699
GeneID:8633 UNC5C rs10516957
GeneID:157680 VPS13B rs1788161

References

  1. Carpenter, D. O., Arcaro, K., & Spink, D. C. (2002). Understanding the human health effects of chemical mixtures. Environmental Health Perspectives, 110(Suppl 1), 25. [CrossRef]
  2. Colditz, G. A., & Hankinson, S. E. (2005). The Nurses’ Health Study: lifestyle and health among women. Nature Reviews Cancer, 5(5), 388-396. [CrossRef]
  3. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 1348-1360. [CrossRef]
  4. Field, A. E., Coakley, E. H., Must, A., Spadano, J. L., Laird, N., Dietz, W. H., Rimm, E., & Colditz, G. A. (2001). Impact of overweight on the risk of developing common chronic diseases during a 10-year period. Archives of Internal Medicine, 161(13), 1581–1586. [CrossRef]
  5. Frank, L. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109-135. [CrossRef]
  6. Hu, F. B., Sigal, R. J., Rich-Edwards, J. W., Colditz, G. A., Solomon, C. G., Willett, W. C., & Manson, J. E. (1999). Walking compared with vigorous physical activity and risk of type 2 diabetes in women: a prospective study. Journal of the American Medical Association, 282(15), 1433–1439. [CrossRef]
  7. Grant, S. F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., ... & Styrkarsdottir, U. (2006). Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genetics, 38(3), 320-323. [CrossRef]
  8. Guan, S., Zhao, M., Cui, Y. (2023). Variable selection for single-index varying-coefficients models with applications to synergistic G×E interactions. Electronic Journal of Statistics, 17(1): 823–857. [CrossRef]
  9. Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.
  10. Kruit, J. K., Wijesekara, N., Westwell-Roper, C., Bruce, J., Zhao, M., Kahn, S. E., & Hayden, M. R. (2012). Loss of ATP-binding cassette transporter A1 in mice impairs glucose tolerance and insulin secretion in vivo. Diabetes, 61(12), 3166-3177.
  11. Morrison, J. L., Gallas, P. R., Zheng, W., Mohtashami, S., & Wong, T. (2020). UNC5C and its relevance to beta-cell apoptosis. Endocrinology, 161(7), bqaa064.
  12. Ottman, R. (1996). Gene-environment interaction: definitions and study designs. Preventive medicine, 25(6), 764. [CrossRef]
  13. Rajpathak, S. N., Crandall, J. P., Wylie-Rosett, J., Kabat, G. C., Rohan, T. E., & Hu, F. B. (2009). The role of iron in type 2 diabetes in humans. Biochimica et Biophysica Acta (BBA) - General Subjects, 1790(7), 671–681. [CrossRef]
  14. Rimm, E. B., Giovannucci, E. L., Willett, W. C., Colditz, G. A., Ascherio, A., Rosner, B., & Stampfer, M. J. (1991). Prospective study of alcohol consumption and risk of coronary disease in men. The Lancet, 338(8765), 464-468. [CrossRef]
  15. Sale, M. M., Smith, S. G., Mychaleckyj, J. C., Keene, K. L., Langefeld, C. D., Leak, T. S., ... & Freedman, B. I. (2007). Variants of the transcription factor 7-like 2 (TCF7L2) gene are associated with type 2 diabetes in an African-American population enriched for nephropathy. Diabetes, 56(10), 2638-2642. [CrossRef]
  16. Salmerón, J., Ascherio, A., Rimm, E. B., Colditz, G. A., Spiegelman, D., Jenkins, D. J., Stampfer, M. J., Wing, A. L., & Willett, W. C. (1997). Dietary fiber, glycemic load, and risk of NIDDM in men. Diabetes Care, 20(4), 545–550. [CrossRef]
  17. Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 6(2), 461-464.
  18. Schumaker, L. (2007). Spline functions: basic theory. Cambridge University Press.
  19. Sexton, K., & Hattis, D. (2007). Assessing cumulative health risks from exposure to environmental mixtures-three fundamental questions. Environmental Health Perspectives, 825-832. [CrossRef]
  20. Singh, J., Kumar, V., Aneja, A., Gupta, S., & Dhingra, S. (2022). Genetic polymorphisms in ABCA1 (rs2230806 and rs1800977) and LIPC (rs2070895) genes and their association with the risk of type 2 diabetes: A case-control study. International Journal of Diabetes in Developing Countries, 42(2), 227-235. [CrossRef]
  21. Smith, A. R., Jones, K. M., Patel, D., Thomas, C., & Reid, L. (2019). The role of SCAI in cellular stress pathways. Journal of Cell Science, 132(17), jcs231829.
  22. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
  23. van Berge, L., Keating, D. J., Greber, B., O’Neill, C., & Lang, J. (2019). Mitochondrial dysfunction in metabolic syndrome. Frontiers in Endocrinology, 10, 607.
  24. Wang, Y., Wang, J., He, Z., Tang, X., Zheng, Y., & Liu, H. (2021). Inflammatory pathways in T2D: A focus on PTK2B. Immunometabolism, 3(1), e210006.
  25. Wang, X., Zhang, J., Li, Y., Wang, Z., Sun, Q., Zhou, H., & Fan, Y. (2023). RBM15 regulates hepatic insulin sensitivity in GDM offspring. Molecular Medicine, 29(1), 45.
  26. Weissglas-Volkov, D., Aguilar-Salinas, C. A., Nikkola, E., Tusie-Luna, T., Cruz-Bautista, I., Arellano-Campos, O., & Pajukanta, P. (2011). GALNT2 polymorphisms and their role in lipid and glucose metabolism. Human Molecular Genetics, 20(9), 1632-1640.
  27. Xu, B., Li, X., Nian, K., Li, J., Li, M., Zheng, W., & Han, X. (2019). BDNF and NTRK2 in metabolic regulation. Nature Medicine, 25(5), 697-707.
  28. Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. [CrossRef]
  29. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429. [CrossRef]
  30. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265-286. [CrossRef]
Figure 1. Selection and estimation accuracy of m k ( · ) for discrete G
Figure 1. Selection and estimation accuracy of m k ( · ) for discrete G
Preprints 144456 g001
Figure 2. Plot of varying effects on a log-odds scale for SNP rs6537663 in Gene TCF7L2.
Figure 2. Plot of varying effects on a log-odds scale for SNP rs6537663 in Gene TCF7L2.
Preprints 144456 g002
Table 1. Selection and estimation accuracy of function m k ( . ) for continuous predictors.
Table 1. Selection and estimation accuracy of function m k ( . ) for continuous predictors.
n m ( · ) p = 50 p = 100
Oracle % IMSEModel IMSEOracle Oracle % IMSEModel IMSEOracle
n = 1000 m 0 ( . ) 100.0% 1.50E-01 1.47E-01 100.0% 1.48E-01 1.31E-01
m 1 ( . ) 99.2% 1.44E-01 2.14E-01 99.7% 1.48E-01 1.66E-01
m 2 ( . ) 99.4% 1.34E-01 1.61E-01 99.7% 1.37E-01 1.43E-01
m 3 ( . ) 100.0% 3.95E-02 3.85E-02 100.0% 4.19E-02 3.33E-02
m 4 ( . ) 99.9% 5.58E-02 5.53E-02 100.0% 5.93E-02 4.86E-02
Zero 99.1% 1.03E-03 0 99.0% 1.16E-03 0
n = 2000 m 0 ( . ) 100.0% 6.85E-02 6.88E-02 100.0% 7.00E-02 7.05E-02
m 1 ( . ) 100.0% 5.17E-02 6.00E-02 100.0% 5.43E-02 6.34E-02
m 2 ( . ) 100.0% 5.46E-02 5.99E-02 100.0% 5.40E-02 5.88E-02
m 3 ( . ) 100.0% 1.40E-02 1.46E-02 100.0% 1.46E-02 1.48E-02
m 4 ( . ) 100.0% 1.79E-02 1.93E-02 100.0% 1.89E-02 1.87E-02
Zero 99.4% 3.33E-04 0 99.4% 3.14E-04 0
Table 2. Selection and estimation accuracy of the loading parameters β for continuous predictors.
Table 2. Selection and estimation accuracy of the loading parameters β for continuous predictors.
n β p = 50 p = 100
Oracle % MSEModel MSEOracle Oracle % MSEModel MSEOracle
n = 1000 β 1 100.0% 7.26E-04 5.91E-04 100.0% 6.75E-04 5.75E-04
β 2 100.0% 1.36E-02 1.11E-02 100.0% 3.28E-03 5.78E-04
β 3 97.3% 2.34E-04 0 96.7% 6.22E-04 0
β 4 97.5% 1.90E-04 0 96.7% 2.72E-04 0
β 5 96.8% 9.01E-04 0 96.6% 2.95E-04 0
n = 2000 β 1 100.0% 2.76E-04 2.68E-04 100.0% 2.77E-04 2.75E-04
β 2 100.0% 2.71E-04 2.66E-04 100.0% 2.76E-04 2.74E-04
β 3 99.1% 5.49E-05 0 99.6% 1.57E-05 0
β 4 99.6% 2.68E-05 0 99.4% 3.16E-05 0
β 5 99.3% 4.58E-05 0 99.4% 2.22E-05 0
Table 3. Function m k ( u ) and the MAF of the associated SNP.
Table 3. Function m k ( u ) and the MAF of the associated SNP.
Function MAF of G k
m 0 ( u ) = 2 s i n ( 2 π u ) -
m 1 ( u ) = 2 c o s ( π u ) + 2 0.5
m 2 ( u ) = s i n ( 2 π u ) + c o s ( π u ) + 1 0.5
m 3 ( u ) = 2 c o s ( π u ) + 2 0.3
m 4 ( u ) = s i n ( 2 π u ) + c o s ( π u ) + 1 0.3
m 5 ( u ) = 2 c o s ( π u ) + 2 0.1
m 6 ( u ) = s i n ( 2 π u ) + c o s ( π u ) + 1 0.1
m 7 ( u ) = 2 0.5
m 8 ( u ) = 2 0.3
m 9 ( u ) = 2 0.1
m k ( u ) = 0 , k > 9 unif(0.05, 0.5)
Table 4. Selection and estimation accuracy of m k ( · ) for discrete predictors.
Table 4. Selection and estimation accuracy of m k ( · ) for discrete predictors.
n m ( · ) p = 50 p = 100
Oracle % IMSEModel IMSEOracle Oracle % IMSEModel IMSEOracle
n = 1000 m 0 ( . ) 100.0% 1.63E-01 2.28E-01 100.0% 1.75E-01 2.29E-01
m 1 ( . ) 83.4% 4.29E-01 5.61E-01 83.1% 4.44E-01 6.08E-01
m 2 ( . ) 87.7% 3.82E-01 4.38E-01 87.9% 3.69E-01 4.50E-01
m 3 ( . ) 76.0% 5.43E-01 5.87E-01 75.5% 5.50E-01 6.17E-01
m 4 ( . ) 83.3% 4.37E-01 4.19E-01 79.6% 5.11E-01 4.50E-01
m 5 ( . ) 23.2% 1.35E+00 1.05E+00 22.1% 1.45E+00 1.18E+00
m 6 ( . ) 25.7% 1.27E+00 8.93E-01 23.7% 1.31E+00 9.22E-01
m 7 ( . ) 100.0% 5.09E-02 6.22E-02 99.9% 5.05E-02 5.86E-02
m 8 ( . ) 99.9% 6.08E-02 7.50E-02 99.9% 5.83E-02 6.75E-02
m 9 ( . ) 100.0% 1.05E-01 1.24E-01 99.9% 1.13E-01 1.21E-01
Zero 99.0% 2.74E-03 0 99.2% 2.15E-03 0
n = 2000 m 0 ( . ) 100.0% 7.47E-02 8.03E-02 100.0% 7.42E-02 8.35E-02
m 1 ( . ) 100.0% 9.46E-02 1.38E-01 99.9% 1.02E-01 1.49E-01
m 2 ( . ) 100.0% 9.52E-02 1.21E-01 99.9% 9.47E-02 1.21E-01
m 3 ( . ) 100.0% 1.07E-01 1.55E-01 99.8% 1.08E-01 1.50E-01
m 4 ( . ) 99.9% 1.05E-01 1.37E-01 99.8% 1.05E-01 1.30E-01
m 5 ( . ) 77.5% 4.84E-01 2.99E-01 75.1% 5.15E-01 3.08E-01
m 6 ( . ) 77.5% 4.93E-01 2.63E-01 73.5% 5.32E-01 2.53E-01
m 7 ( . ) 100.0% 1.89E-02 2.11E-02 100.0% 1.75E-02 1.97E-02
m 8 ( . ) 100.0% 2.17E-02 2.39E-02 100.0% 2.08E-02 2.33E-02
m 9 ( . ) 100.0% 4.47E-02 4.95E-02 100.0% 4.24E-02 4.56E-02
Zero 99.3% 9.15E-04 0 99.5% 6.81E-04 0
Table 5. Selection and estimation accuracy of the loading parameters β for discrete predictors.
Table 5. Selection and estimation accuracy of the loading parameters β for discrete predictors.
n β p = 50 p = 100
Oracle % MSEModel MSEOracle Oracle % MSEModel MSEOracle
n = 1000 β 1 100.0% 6.64E-04 7.28E-04 100.0% 5.84E-04 5.13E-04
β 2 100.0% 7.37E-03 7.32E-03 100.0% 2.76E-03 3.27E-03
β 3 95.2% 3.64E-04 0 95.2% 3.73E-04 0
β 4 97.0% 1.21E-04 0 96.1% 4.18E-04 0
β 5 96.1% 2.04E-04 0 94.7% 5.46E-04 0
n = 2000 β 1 100.0% 2.34E-04 2.22E-04 100.0% 2.20E-04 2.12E-04
β 2 100.0% 2.31E-04 2.20E-04 100.0% 2.30E-03 2.37E-03
β 3 98.9% 5.00E-05 0 98.9% 3.24E-05 0
β 4 98.7% 5.44E-05 0 98.9% 4.49E-05 0
β 5 98.9% 4.37E-05 0 99.0% 5.07E-05 0
Table 6. Estimation and selection result for β .
Table 6. Estimation and selection result for β .
act bmi alcohol heme gl
-0.1832 0.9544 0 0.2157 0.0947
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated