Variable Selection for Generalized Single-index Varying-coefficient Models with Applications to Synergistic G×E Interactions

Shunjie Guan; Xu Liu; Yuehua Cui

doi:10.20944/preprints202412.2644.v1

Submitted:

29 December 2024

Posted:

03 January 2025

You are already at the latest version

Abstract

Complex diseases such as type 2 diabetes are influenced by both environmental and genetic risk factors, leading to a growing interest in identifying gene-environment (G×E) interactions. Guan et al. (2023) proposed a three-step variable selection method for single-index varying-coefficients models. This method selects varying and constant effect genetic predictors, as well as non-zero loading parameters, to identify genetic factors that interact linearly or nonlinearly with a mixture of environmental factors to influence disease risk. In this paper, we extend this approach to a binary response setting given that many complex human diseases are binary traits. We also establish the oracle property for our variable selection method, demonstrating that it performs as well as if the correct sub-model were known in advance. Additionally, we assess the performance of our method through finite sample simulations with both continuous and discrete gene variables. Finally, we apply our approach to a type 2 diabetes dataset, identifying potential genetic factors that interact with a combination of environmental variables, both linearly and nonlinearly, to influence the risk of developing type 2 diabetes.

Keywords:

Gene-environment interaction (G×E)

;

Generalized single-index varying-coefficient models (gSIVCM)

;

Mixture of exposures

;

Nonlinear G×E

;

Synergistic G×E

Subject:

Computer Science and Mathematics - Probability and Statistics

1. Introduction

The identification of gene-environment (G×E) interactions for complex traits has been a longstanding challenge in genetic studies. Ottman (1996) defined G×E interaction as “different effect of a genotype on disease risk in persons with different environmental exposures." Traditionally, G×E interactions have been studied using a single environmental factor, as incorporating multiple environmental factors exponentially increases model complexity. This can lead to biased estimates and large standard errors, a phenomenon known as the curse of dimensionality. Moreover, an increasing number of epidemiological studies have shown that disease risk can be influenced by simultaneous exposure to multiple environmental factors (Carpenter et al., 2002; Sexton and Hattis, 2007). However, little is known about how multiple environmental factors, when considered collectively, interact with genetic factors to affect disease outcomes. Investigating this area could provide valuable insights and inform strategies for future disease prevention.

Guan et al. (2023) proposed a single-index varying-coefficients model of the form

Y = \sum_{k = 0}^{p} m_{k} (X β) G_{k} + ϵ

to address non-linear gene-environment interactions. However, this model is limited to continuous phenotypes. In this paper, we generalize the model to handle binary phenotype data, given that many complex human diseases are binary traits in nature. Specifically, we propose a generalized single-index varying-coefficients model (gSIVCM):

log (\frac{P (Y = 1 | X, G)}{P (Y = 0 | X, G)}) = \sum_{k = 0}^{p} m_{k} (X β) G_{k}

(1)

where

m_{k} (.), k = 0, 1, \dots, p

represents the unknown gene effect of

G_{k}

, modeled as a non-parametric function modulated by its loading index

X β

. Here, X represents q-dimensional environmental exposures, and

X β

captures the joint effect of q environmental factors. The effect of these factors on the response is modeled through the

m (\cdot)

function. In this model, we allow the effect of

G_{k}

on the risk of Y to vary across multiple X through the

m (\cdot)

function.

If

m (u) = 0

, then

G_{k}

has no effect on Y. If

m (u) = α

, where

α

is a constant, the effect of

G_{k}

on Y is constant, and there is no interaction between

G_{k}

and the mixture of X. If

m (u) \neq α

, we conclude that

G_{k}

interacts with the environmental mixture, and the form of the interaction effect is captured by the non-parametric function

m (\cdot)

.

The unique structure of gSIVCM enables us to capture how a mixture of multiple environmental factors interacts with genetic factors. Moreover, model (1) is highly flexible and can accommodate a wide range of models. For example, when

q = 1

and

β = 1

, it reduces to a generalized varying-coefficients model; when

p = 1

and

G = 1

, it becomes a standard generalized single-index model.

When both the number of gene variables and the number of environmental variables X are large, the complexity of model (1) introduces unique challenges for variable selection, particularly due to the nonlinear, non-parametric structure of the function

m_{k} (.)

and its unknown loading parameter

β

. Guan et al. (2023) recently proposed a three-step variable selection approach for single-index varying-coefficients model (SIVCM), which classifies the non-parametric gene effect into varying, constant, and zero effects, while also identifying non-zero loading parameters. In this paper, we extend this variable selection approach to gSIVCM. Instead of penalizing the squared error loss as in SIVCM, we implement a penalized log-likelihood method tailored to our model.

Variable selection has been a central topic in modern statistical research. The core idea is to add a penalty term to the optimization function. Different choices of penalty functions yield estimators with varying properties. Fan and Li (2001) introduced three key criteria for a penalized estimator: sparsity, unbiasedness, and continuity. They also characterized the oracle property, which ensures that the model performs as well as if the true sub-model were known in advance. This has become a benchmark for evaluating new penalized estimators. Examples of penalized functions include Bridge regression (Frank and Friedman, 1993), the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996), Adaptive LASSO (Zou, 2006), Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li, 2001), and Minimax Concave Penalty (MCP) (Zhang, 2010). While LASSO is well-regarded for its simple formulation and efficient algorithm (e.g., LARS), it does not possess the oracle property. In contrast, Adaptive LASSO, SCAD, and MCP all satisfy the oracle property. For our model, we chose the MCP penalty due to its desirable theoretical and practical properties.

The rest of the paper is organized as follows: Section 2 introduces the proposed variable selection method, including the formulation of the penalized log-likelihood, the iterative optimization procedure, and the selection of tuning parameters and initial values for

β

. Section 3 discusses the asymptotic properties of the proposed method. In Section 4, we evaluate the performance of our method through several finite-sample simulations. Finally, in Section 5, we apply our method to a Type 2 diabetes dataset, followed by conclusion and discussion.

2. Statistical Method

Throughout the paper, superscript T is used to denote matrix transpose,

{| | . | |}_{p}

is used to denote

L_{p}

norm,

log (a)

is used to denote the natural logarithm of a. For the sake of simplicity, we use constant and non-zero constant interchangeably.

2.1. Model Setup

The generalized single-index varying-coefficients model for the binary case is set up as follows:

log (\frac{P (Y = 1 | X, G)}{P (Y = 0 | X, G)}) = \sum_{k = 0}^{p} m_{k} (X β) G_{k}

where

Y_{n \times 1} = {(Y_{1}, Y_{2}, \dots, Y_{n})}^{T}

represents the binary response variable, and n denotes the sample size. The set

{m_{k} (.)}_{k = 0, 1, \dots, p}

comprises

p + 1

unknown continuous functions.

X_{n \times q} = (X_{1}, X_{2}, \dots, X_{q})

contains environmental variables, and

β_{q \times 1} = {(β_{1}, \dots, β_{q})}^{T}

represents the loading parameters of the model.

G_{n \times (p + 1)} = (G_{0}, G_{1}, \dots, G_{p})

where

G_{0} = {(1, \dots, 1)}^{T}

and

G_{k} = {(G_{1 k}, G_{2 k}, \dots, G_{n k})}^{T}

is a continuous or discrete vector of length n for

k = 1, 2, \dots p

. Consequently, for

k \neq 0

,

m_{k} (X β)

represents the effect of

G_{k}

on Y on a log-odds scale, while

m_{0} (X β)

serves as the intercept term.

For simplicity of notation, we denote

μ = \sum_{k = 0}^{p} m_{k} (X β) G_{k}

. Thus, model (1) can be rewritten as:

log (\frac{P (Y = 1 | X, G)}{P (Y = 0 | X, G)}) = μ

(2)

2.2. Estimation Method

Our goal is to estimate and select unknown functions

{m_{k} (.)}_{k = 0, 1, \dots, p}

and unknown loading parameter

β = {(β_{1}, \dots, β_{q})}^{T}

. For the sake of identifiability, we assume

{| | β | |}_{2} = 1

,

β_{1} > 0

, and that

m_{k} (.)

cannot take the form

m_{j} (u) = α^{T} u β^{T} u + γ^{T} u + c

.

We first approximate the non-parametric function

m_{k} (u)

with B-spline basis functions. Without loss of generality, we assume

u \in [0, 1]

, denote K as the number of internal knots and h as the degree of the B-spline basis function. For instance,

h = 1

indicates linear splines,

h = 2

represents quadratic splines, and

h = 3

represents cubic splines. From standard B-spline theory, let

u_{1}, u_{2}, \dots, u_{K}

be internal knots satisfying

0 = u_{0} \leq u_{1} < u_{2} < \dots < u_{K} < u_{K + 1} = 1

, and let

I_{n_{t}}

be the left-closed, right-open interval

[u_{t - 1}, u_{t})

for

1 \leq t \leq K

, and

I_{n_{K + 1}}

the closed interval

[u_{K}, u_{K + 1}]

. Let

F

be a collection of functions f defined on

[0, 1]

satisfying: (i) the restriction of f to

I_{n_{t}}

is a polynomial of degree h or less for

1 \leq t \leq K + 1

; (ii) f is

h - 1

times continuously differentiable on

[0, 1]

.

Let

L = K + h + 1

. Then, by Schumaker (1981), we have a normalized B-spline basis function

\tilde{B} (u) = (\tilde{B_{1}} (u), \tilde{B_{2}} (u), \dots, \tilde{B_{L}} (u))

for

F

. There exists a linear transformation matrix

Π

such that:

Π \tilde{B} (u) = (1, \bar{B} (u)) = (1, B_{2} (u), B_{3} (u), \dots, B_{L} (u)) = B (u),

where each component of

\bar{B} (u)

is a function of u. Then, for

0 \leq k \leq p

, we approximate

m_{k} (u)

by:

m_{k} (u) \approx (1, B_{2} (u), B_{3} (u), \dots, B_{L} (u)) \cdot {(γ_{k 1}, γ_{k 2}, \dots, γ_{k L})}^{T} = B (u) γ_{k} = γ_{k 1} + \bar{B} (u) γ_{k *},

(3)

where

γ_{k *} = {(γ_{k 2}, γ_{k 3}, \dots, γ_{k L})}^{T}

and

γ_{k} = {(γ_{k 1}, γ_{k *}^{T})}^{T}

. With the B-spline approximation,

μ

in (2) can be approximated by

μ^{B}

where:

μ^{B} = \sum_{k = 0}^{p} [γ_{k 1} + \bar{B} (X β) γ_{k *}] G_{k}

Hence, the selection of the non-parametric function

m_{k} (\cdot)

is transformed to the selection of its B-spline coefficients

γ = {γ_{k 1}, γ_{k *}}_{k = 0, 1, \dots, p}

. Note that the transformation

Π

allows us to separate the constant effect of

G_{k}

from its varying effect on the response. More specifically:

If $| | γ_{k *} {| |}_{2} \neq 0$ , then $G_{k}$ is a varying effect predictor.
If $| | γ_{k *} {| |}_{2} = 0$ and $| γ_{k 1} | \neq 0$ , then $G_{k}$ is a constant effect predictor.
If $| | γ_{k *} {| |}_{2} = 0$ and $| γ_{k 1} | = 0$ , then $G_{k}$ has no effect on the response.

Given the binary response, we adopt the penalized log-likelihood approach, and the log-likelihood function is defined as:

l (γ, β) = \sum_{i = 1}^{n} (Y_{i} μ_{i}^{B} - log (1 + e^{μ_{i}^{B}})),

where

μ_{i}^{B}

is the ith subject of

μ^{B}

. The penalized log-likelihood objective function is then defined as:

\begin{matrix} M (β, γ) = & \sum_{i = 1}^{n} (Y_{i} μ_{i}^{B} - log (1 + e^{μ_{i}^{B}})) \\ - n \sum_{k = 1}^{p} p_{λ_{1}} (| | γ_{k *} {| |}_{2}) - n \sum_{k = 1}^{p} p_{λ_{2}} (| γ_{k 1} |) I (| | γ_{k *} {| |}_{2} = 0) - n \sum_{d = 2}^{q} p_{λ_{3}} (| β_{d} |), \end{matrix}

(4)

where

p_{λ_{1}} (.), p_{λ_{2}} (.), p_{λ_{3}} (.)

are penalty functions for

γ_{k *}

,

γ_{k 1}

, and

β

, respectively. The indicator function

I (.)

equals 1 if the condition in the parentheses is satisfied, and 0 otherwise.

Note that the construction of the penalty term in function (4) reflects an “order" in selecting the effect of

G_{k}

: first, the model determines whether

G_{k}

has a varying effect; if not, it then decides whether

G_{k}

has a non-zero constant effect or no effect at all. Furthermore, we do not penalize the intercept term

m_{0} (u)

or

β_{1}

due to model constraints.

For the penalty function, we use the MCP penalty proposed by Zhang (2010):

p (x, λ) = λ \int_{0}^{x} {(1 - \frac{s}{τ λ})}_{+} d s,

with regularization parameters

τ > 0

and

λ > 0

.

2.3. The Estimation Step

To optimize function (4), we follow the approach proposed by Guan et al. (2023) and adopt a three-step iterative method.

Step 1: Given a preliminary estimator of

β

, denoted by

{\hat{β}}^{(0)}

, we obtain the first-step estimation of

γ

, denoted by

{\hat{γ}}^{(1)} = {{\hat{γ}}_{k 1}^{(1)}, {\hat{γ}}_{k *}^{(1) T}}_{k = 0, 1, \dots, p}^{T}

, via a group penalized regression:

{\hat{γ}}^{(1)} = max_{γ} M_{1} (γ | λ_{1}, {\hat{β}}^{(0)}),

where

M_{1} (γ | λ_{1}, {\hat{β}}^{(0)}) = \sum_{i = 1}^{n} (Y_{i} μ_{i}^{B} - log (1 + e^{μ_{i}^{B}})) - n \sum_{k = 1}^{p} p_{λ_{1}} (| | γ_{k *} {| |}_{2}) .

Step 1 classifies

m_{k} (.), k = 1, \dots, p

into two categories: varying (V) or non-varying (NV). Specifically,

m_{k} (.) \in V

if

| | {\hat{γ}}_{k *}^{(1)} {| |}_{2} > 0

, and

m_{k} (.) \in N V

if

| | {\hat{γ}}_{k *}^{(1)} {| |}_{2} = 0

.

Step 2: In Step 2, we refine the selection of

γ_{k 1}

for functions in the non-varying category identified in Step 1. Specifically, we select the non-zero constant effects and classify the non-parametric functions into constant (C) and zero (0). This is achieved by penalizing

γ_{k 1}

only when

| | {\hat{γ}}_{k *}^{(1)} {| |}_{2} = 0

for

k = 1, 2, \dots, p

. No penalty is applied to

γ_{01}

. Additionally,

γ_{k *}

is excluded from the model when

| | {\hat{γ}}_{k *}^{(1)} {| |}_{2} = 0

, i.e.,

{\hat{γ}}_{k *}^{(2)} = 0

.

The Step 2 estimator

{\hat{γ}}^{(2)} = {{({\hat{γ}}_{k 1}^{(2)}, {\hat{γ}}_{k *}^{(2)})}_{k \in V}, {({\hat{γ}}_{k 1}^{(2)})}_{k \in N V}}

is obtained via penalized regression:

{\hat{γ}}^{(2)} = max_{γ} M_{2} (γ | λ_{2}, β^{(0)}, {\hat{γ}}^{(1)}),

where

M_{2} (γ | λ_{2}, β^{(0)}, {\hat{γ}}^{(1)}) = \sum_{i = 1}^{n} (Y_{i} μ_{i}^{B^{(2)}} - log (1 + e^{μ_{i}^{B^{(2)}}})) - n \sum_{k = 1}^{p} p_{λ_{2}} (| γ_{k 1}^{(2)} |) I (| | {\hat{γ}}_{k *}^{(1)} {| |}_{2} = 0),

and

μ_{i}^{B^{(2)}}

is the ith element of

μ^{B^{(2)}}

, defined as:

μ^{B^{(2)}} = \sum_{k \in V} [γ_{k 1}^{(2)} + \bar{B} (X β^{(0)}) γ_{k *}^{(2)}] G_{k} + \sum_{k \in N V} γ_{k 1}^{(2)} G_{k} .

After Steps 1 and 2, we obtain the estimator

\hat{γ}

based on

{\hat{β}}^{(0)}

and classify

m_{k} (.)

for

k = 1, \dots, p

into V, C, or 0. The next step is to estimate the loading parameter

β

given

{\hat{γ}}^{(2)}

.

Step 3: The estimator

\hat{β}

is obtained via penalized regression:

\hat{β} = max_{{| | β | |}_{2} = 1} M_{3} (β | λ_{3}, {\hat{γ}}^{(2)}),

where

M_{3} (β | λ_{3}, {\hat{γ}}^{(2)}) = \sum_{i = 1}^{n} (Y_{i} μ_{i}^{B^{(3)}} - log (1 + e^{μ_{i}^{B^{(3)}}})) - n \sum_{d = 2}^{q} p_{λ_{3}} (| β_{d} |),

and

μ_{i}^{B^{(3)}}

is the ith element of

μ^{B^{(3)}}

, defined as:

μ^{B^{(3)}} = \sum_{k = 0}^{p} [{\hat{γ}}_{k 1}^{(2)} + \bar{B} (X β) {\hat{γ}}_{k *}^{(2)}] G_{k} .

Finally, set

{\hat{β}}^{(0)} = \hat{β}

and iterate Steps 1 to 3 until convergence.

Remark: For Steps 1 and 2, we use the block coordinate descent algorithm for group penalties. For Step 3, we employ the local quadratic approximation (LQA) algorithm proposed by Fan and Li (2001). For further details, readers are referred to the Appendix. This iterative approach requires selecting tuning parameters

λ_{1}, λ_{2}, λ_{3}

, the order h, and the number of internal knots K for the B-spline approximation, as well as an appropriate initial value for

β

.

2.4. Selection of Tuning Parameters

2.4.1. Selection of Tuning Parameters $λ_{1}, λ_{2}, λ_{3}$

We propose using the Bayesian Information Criterion (BIC) (Schwarz, 1978) to select the tuning parameters.

Step 1: We select

λ_{1}

as the minimizer of

B I C (λ_{1}) = - 2 l ({\hat{γ}}_{λ_{1}}^{(1)}, {\hat{β}}^{(0)}) + log (n) \cdot d f_{λ_{1}},

where

{\hat{γ}}_{λ_{1}}^{(1)}

is the minimizer of

M_{1} (γ | λ_{1}, {\hat{β}}^{(0)})

defined above,

{\hat{β}}^{(0)}

is chosen as the estimator from the previous iteration, and

d f_{λ_{1}}

is the total number of non-zero coefficients when

λ_{1}

is the penalized parameter.

Step 2: We select

λ_{2}

as the minimizer of

B I C (λ_{2}) = - 2 l ({\hat{γ}}_{λ_{2}}^{(2)}, {\hat{β}}^{(0)}) + log (n) \cdot d f_{λ_{2}},

where

{\hat{γ}}_{λ_{2}}^{(2)}

is the minimizer of

M_{2} (γ | λ_{2}, β^{(0)}, {\hat{γ}}^{(1)})

defined above,

{\hat{β}}^{(0)}

is chosen as the estimator from the previous iteration, and

d f_{λ_{2}}

is the total number of non-zero coefficients when

λ_{2}

is the penalized parameter.

Step 3: We select

λ_{3}

as the minimizer of

B I C (λ_{3}) = - 2 l ({\hat{γ}}^{(2)}, {\hat{β}}_{λ_{3}}) + log (n) \cdot d f_{λ_{3}},

where

{\hat{β}}_{λ_{3}}

is the minimizer of

M_{3} (β | λ_{3}, {\hat{γ}}^{(2)})

,

{\hat{γ}}^{(2)}

is the minimizer of the B-spline coefficients from Step 2, and

d f_{λ_{3}}

is the total number of non-zero

β

coefficients when

λ_{3}

is the penalized parameter.

The parameters

λ_{1}, λ_{2}, λ_{3}

are searched over a grid of exponentially decreasing values, with a minimum value of

1 \times 10^{- 3}

and the maximum value set such that all penalized estimators are zero. We use 100 tuning parameters in the search.

2.4.2. Selection of Order h and Number of Internal Knots K

Since h represents the order of the B-spline basis function, higher degrees introduce more complex interactions and collinearity between environmental factors and genetic predictors. We search for the optimal h in the set

h \in {2, 3, 4}

. For K, only when

K = O_{p} (n^{\frac{1}{2 r + 1}})

(where n is the sample size, r is the smoothness of

m_{k} (.)

, and

r > 2

), the selection approach achieves oracle properties. We therefore search for the optimal K in the neighborhood of

n^{\frac{1}{2 r + 1}}

, denoted by

K

. In our simulations,

K = {2, 3, 4, 5}

.

We fit the following intercept-only model using the B-spline approximation:

log (\frac{P (Y = 1 | X, G)}{P (Y = 0 | X, G)}) = m_{0} (X β) .

(5)

Denote its estimator as

({\hat{γ}}_{01}, {\hat{γ}}_{0 *})

and

\hat{β}

. Let

{\hat{m}}_{0} (X \hat{β}) = {\hat{γ}}_{01} + \bar{B} (X \hat{β}) {\hat{γ}}_{0 *}

. The optimal K and h are selected as the values minimizing

log (Y^{T} {\hat{m}}_{0} (X \hat{β}) - log (1 + e^{{\hat{m}}_{0} (X \hat{β})})) + \frac{log (n)}{n} (K + h + 1) .

For the iterative algorithm proposed above, we require a reasonable initial value for

β

(denoted by

β^{i n i t i a l}

) to begin. We obtain

β^{i n i t i a l}

by fitting model (5) with the selected K and h.

2.5. Theoretical Properties

We study the properties of the penalized likelihood estimator. Let

β^{0}

and

m_{k}^{0} (.), k = 0, 1, \dots, p

represent the true values of

β

and

m_{k} (.), k = 0, 1, \dots, p

, respectively, and let

γ^{0}

denote the true value of the B-spline coefficients

γ

. Assume without loss of generality that

β_{l}^{0} \neq 0

for

l = 1, \dots, s

,

β_{l}^{0} = 0

for

l = s + 1, \dots, q

,

m_{k}^{0} (.)

is varying for

k = 0, 1, \dots, v

, non-zero constant for

k = v + 1, \dots, c

, and zero for

k = c + 1, \dots, p

. The following theorem establishes the consistency of the penalized least square estimators.

Theorem 1.

Assume that the regularity conditions (A1)–(A7) in the Appendix hold, and that the number of knots satisfies

K = O_{p} (n^{1 / (2 r + 1)})

. Then:

(i): $| | \hat{β} - β^{0} | | = O_{p} (n^{- r / (2 r + 1)} + a_{n})$ ;
(ii): $| | {\hat{m}}_{k} (.) - m_{k}^{0} (.) | | = O_{p} (n^{- r / (2 r + 1)} + a_{n}), k = 1, \dots, q$ ,

where

a_{n} = {max}_{k, l} {p_{λ_{1}}^{'} (| | γ_{k *}^{0} {| |}_{2}), p_{λ_{2}}^{'} (| | γ_{k 1}^{0} {| |}_{2}), p_{λ_{3}}^{'} (| β_{l}^{0} |)}, γ_{k *}^{0} \neq 0, β_{l}^{0} \neq 0, k = 0, \dots, p, l = 1, \dots, q

, r is defined in condition (A2) of the supplemental material, and

p_{λ}^{'} (\cdot)

denotes the first derivative of the penalty function

p_{λ} (\cdot)

. Furthermore, under additional regularity conditions outlined in the Appendix, we can show that the estimator possesses sparsity.

Theorem 2.

Assume the regularity conditions (A1)–(A7) in the Appendix hold and

K = O_{p} (n^{1 / (2 r + 1)})

. Let

λ_{m a X} = max {λ_{1}, λ_{2}, λ_{3}}, λ_{m i n} = min {λ_{1}, λ_{2}, λ_{3}}

. If

λ_{m a X} \to 0

and

n^{r / (2 r + 1)} λ_{m i n} \to \infty

as

n \to \infty

, then with probability approaching 1:

(i): ${\hat{β}}_{j} = 0$ for $j = s + 1, \dots, q$ ;
(ii): ${\hat{m}}_{k} (.) = c_{k}$ for $k = v + 1, \dots, c$ , where $c_{k}$ is some non-zero constant;
(iii): ${\hat{m}}_{k} (.) = 0$ for $k = c + 1, \dots, p$ .

Theorems 1 and 2 indicate that our penalized likelihood estimator is consistent and possesses oracle properties. The proofs are given in the Appendix.

3. Simulation

We evaluated the performance of our model via finite-sample simulations. The performance is assessed in the following ways: (1) classification accuracy of

m (\cdot)

, denoted as the oracle percentage; (2) IMSE of the estimated m-function; (3) selection accuracy of

β

; and (4) estimation accuracy of

β

(MSE). A total of R simulations is conducted for all cases.

The oracle percentage of

m (.)

is defined as the proportion of correct classifications across R simulations. For instance, if

m_{k} (.)

is a varying function and it is classified as varying in g simulations, then the oracle percentage for

m_{k} (.)

is

\frac{g}{R} \times 100 %

.

The IMSE of

m_{k} (.)

is given by

IMSE = \frac{1}{R} \sum_{r = 1}^{R} [\frac{1}{n_{grid}} \sum_{j = 1}^{n_{grid}} {({\hat{γ}}_{k 1}^{(r)} + \bar{B} (u_{j}) {\hat{γ}}_{k *}^{(r)} - m_{k} (u_{j}))}^{2}],

where

n_{grid}

is the number of points for estimating the MSE of the predicted function;

{\hat{γ}}_{k *}^{(r)}

and

{\hat{γ}}_{k 1}^{(r)}

are estimators of the B-spline coefficients for the r-th simulation using the proposed approach;

{\hat{β}}^{(r)}

is the estimator of the loading parameter

β

for the r-th simulation; and

u_{j}

corresponds to the

j / n_{grid} \times 100 %

quantile within the range of

X {\hat{β}}^{(r)}

. In our simulations,

n_{grid}

is set to 100.

The oracle percentage of

β

is defined as the proportion of correct selections of

β

across R simulations. For example, if

β_{d} \neq 0

and

β_{d}

is selected as non-zero in g simulations, then the oracle percentage for

β_{d}

is

\frac{g}{R} \times 100 %

.

The MSE of

β_{d}

is computed as

MSE = \frac{1}{R} \sum_{r = 1}^{R} {({\hat{β}}_{d}^{(r)} - β_{d})}^{2},

where

{\hat{β}}_{d}^{(r)}

is the estimator for

β_{d}

in the r-th simulation. This represents the average MSE for

β_{d}

.

The simulation data was generated based on the model (1). The environmental variable X was drawn from a

u n i f (0, 1)

distribution. For the loading parameter

β = {(β_{1}, β_{2}, \dots, β_{q})}^{T}

, we set

β_{1} = β_{2} = \frac{1}{\sqrt{2}}

and all other

β_{j}

were set to zero. We evaluated the performance of the proposed approach with both continuous (e.g., gene expressions) and discrete (e.g., SNPs) predictors G.

3.1. Continuous G

In the continuous case, the non-parametric functions

m_{k} (u)

were set as:

\begin{matrix} m_{0} (u) & = 2 sin (2 π u), & m_{1} (u) & = 2 cos (π u) + 2, & m_{2} (u) & = sin (2 π u) + cos (π u) + 1, \\ m_{3} (u) & = 2, & m_{4} (u) & = 2.5, & m_{k} (u) & = 0 for k = 5, \dots, p . \end{matrix}

The predictors G were generated from

N (0, 1)

distribution. We conducted

R = 1, 000

simulations to evaluate the model’s performance under

p = 50, 100

,

q = 5

and

n = 1000, 2000

.

Table 1 presents the selection and estimation accuracy for

m_{k} (\cdot)

with continuous predictors. Across all cases, the selection accuracy was close to

100 %

for varying, constant, and zero-effect coefficients. The IMSE of the proposed model was in the order of

- 1

or

- 2

for varying and constant effect predictors. As the model dimension p increased (from 50 to 100), a slight increase in the model IMSE was observed. Conversely, as the sample size n increased (from 1000 to 2000), both the model IMSE and the oracle IMSE decreased, aligning with the asymptotic properties of the proposed model. These results suggest that the proposed variable selection approach performs well in both selection and estimation accuracy for the non-parametric functions

m_{k} (\cdot)

.

Table 2 presents the selection and estimation accuracy for the loading parameter

β

. In all scenarios, the selection accuracy for non-zero loading parameters (

β_{1}, β_{2}

) was nearly

100 %

, with MSE values in the order of

- 2

to

- 4

. For zero loading parameters (

β_{3}, β_{4}, β_{5}

), the selection accuracy was approximately

97 %

when

n = 1000

. As the sample size increased to

n = 2000

, the oracle percentages improved to

99 %

, with MSE values in the order of

- 4

to

- 5

. Overall, the MSEs improve as the sample size increases, and the model MSE is very close to the oracle MSE, indicating strong estimation performance for the loading parameters.

3.2. Discrete G

We extended our evaluation to examine the performance of the proposed model with discrete predictors, G. Single nucleotide polymorphism (SNP) data is one of the most commonly used types of genetic data. SNPs take values of 0, 1, and 2, representing the genotypes aa, Aa, and AA, respectively. Additionally, SNPs exhibit a wide range of minor allele frequencies (MAF), making it crucial for our simulations to incorporate these characteristics.

To reflect these properties, G was simulated using the following probability distribution function:

P (G_{i j} = 0) = {(1 - p_{A})}^{2}, P (G_{i j} = 1) = 2 \cdot p_{A} \cdot (1 - p_{A}), P (G_{i j} = 2) = p_{A}^{2},

where

G_{i j}

denotes the jth predictor for the ith subject, with

i = 1, \dots, n

and

j = 1, \dots, p

, and

p_{A}

is the MAF of the minor allele A.

The gene effect functions,

m_{k} (u)

, were set in Table 3.

In this setup, predictors

G_{k}

exhibit varying and constant effects with

p_{A} = 0.1, 0.3, 0.5

. The purpose of setting varying MAFs is to check the selection and estimation performance under different MAFs. For zero-effect predictors

G_{k}

, their

p_{A}

values are uniformly distributed within the range

(0.05, 0.5)

. The environmental variables X was generated from a unif

(0, 1)

distribution. Finally, Y was generated according to the model specified in (1). We evaluated the performance of the proposed model through

1, 000

simulations, considering

p = 50, 100

,

n = 500, 1000

, and

q = 5

.

Table 4 presents the selection and estimation accuracy of the non-parametric function

m_{k} (\cdot)

for discrete G. We observed that the sample size n and MAF (

P_{A}

) of

G_{k}

were the primary factors influencing the performance of the proposed model. To better visualize their impact, we present Figure 1.

As the sample size increased (from 1000 to 2000), the model’s performance improved significantly. For instance, the oracle percentages for

m_{1} (\cdot), \dots, m_{4} (\cdot)

increased from approximately

80 %

to nearly

100 %

, and the corresponding IMSE decreased substantially. These results align with the asymptotic theory of the proposed model. Conversely, as the MAF for

G_{k}

decreased (from 0.5 to 0.1), we observed a decline in performance, reflected in both oracle percentages and model IMSE. For example, in the case where

n = 1000

, the oracle percentages for

{m_{1} (\cdot), m_{2} (\cdot)}

(

P_{A} = 0.5

),

{m_{3} (\cdot), m_{4} (\cdot)}

(

P_{A} = 0.3

), and

{m_{5} (\cdot), m_{6} (\cdot)}

(

P_{A} = 0.1

) were approximately

85 %

,

80 %

, and

23 %

, respectively. The IMSE increased correspondingly, from 0.4 (

P_{A} = 0.5

) to 0.5 (

P_{A} = 0.3

) and then to 1.3 (

P_{A} = 0.1

). This is a common phenomenon in genetic studies as smaller MAFs provide less data information to estimate the corresponding function, leading to poor estiamtion and selection performance.

Table 5 demonstrates the selection and estimation results for the loading parameter

β

. We observed that the sample size n was the primary factor influencing model performance. When the sample size was large (

n = 2000

), the oracle percentage for non-zero loading covariates (

β_{1}, β_{2}

) was

100 %

, and the oracle percentage for zero loading covariates (

β_{3}, β_{4}, β_{5}

) was approximately

99 %

. The MSE for

β

was in the order of

10^{- 3}

to

10^{- 5}

. In contrast, when the sample size was relatively small (

n = 1000

), the oracle percentage for non-zero loading covariates remained at

100 %

, but the oracle percentage for zero loading parameters decreased to around

95 %

. Comparing the cases where

p = 50

and

p = 100

, we saw a slight reduction in selection accuracy for zero loading parameters when

n = 1000

. This is expected, as model performance typically declines with increased model complexity. As the sample size increased to 2000, this difference in performance was not significant.

Based on simulation results with both continuous and discrete G variables, we observed the following key findings about the proposed model. First, the model demonstrates robust performance with large sample sizes (

n = 1000

or 2000). Second, when

n = 1000

, the false positive rate for the loading parameter

β

was around 5%. This may reflect an inherent limitation of the LQA algorithm that cannot shrink zero parameters to exactly zero when the sample is small. Third, the model exhibits superior performance for SNP variants with larger MAFs (e.g.,

P_{A} = 0.3

or 0.5) compared to SNP variants with smaller MAF (

P_{A} = 0.1

). This enhanced performance likely stems from the more data information content provided by SNPs with higher MAFs. Similar phenomenon is commonly observed in other genetic association studies. Overall, the simulation results demonstrate that the method performs reasonably well under finite sample conditions.

4. A Case Study

We demonstrated the applicability of our proposed model using a type 2 diabetes dataset containing genotypes (SNPs), environmental factors, and the phenotype (presence of type 2 diabetes). This dataset comprises two nested case-control cohort studies: the Nurses’ Health Study (NHS) and the Health Professionals Follow-Up Study (HPFS) from the Gene-Environment Association Studies Consortium (GENVEA). Detailed descriptions of these cohorts can be found in Colditz and Hankinson (2005) and Rimm et al. (1991). Initially, the dataset included 3,391 females (NHS) and 2,599 males (HPFS).

After data cleaning, which involved removing subjects with mismatched genotypes and phenotypes, SNPs with more than

10 %

missing data, SNPs with MAF

< 0.05

, and SNPs deviating from Hardy-Weinberg equilibrium (p-value

< 0.001

), the final dataset included 5,865 subjects (2,494 males and 3,371 females), with 2,733 cases and 3,132 controls, and 655,002 SNPs. The dataset also contained 12 continuous environmental factors, such as height, weight, age, and alcohol consumption. We fit a marginal logistic regression model for all 12 factors and, selected five environmental factors for the analysis: total physical activity (

X_{1}

, denoted as act), BMI (

X_{2}

), alcohol intake (

X_{3}

, denoted as alcohol), heme iron intake (

X_{4}

, denoted as heme), and glycemic load (

X_{5}

, denoted as gl).

Based on SNP locations, we mapped all SNPs to known genes and selected genes containing more than 30 SNPs, resulting in a total of 2,178 genes. For each gene, we applied the proposed variable selection approach to identify significant SNPs and their effects. To ensure identifiability, the first element of the loading parameter

β

was constrained to be a non-zero positive value. We fit the proposed model five times for each gene, varying the environmental factor used as the first element each time. An SNP was deemed significant if it was identified as either varying or constant across all environmental factor orders, indicating a strong signal.

In total, our model identified 13 varying-effect SNPs and 26 constant-effect SNPs. Here, we present one of the selected varying-effect SNPs as an example. Please refer to Table A1 and Table A2 in the Appendix for the complete list of selected varying- and constant-effect SNPs. Previous studies (Sale et al., 2007; Grant et al., 2006) have suggested that the gene TCF7L2 is associated with type 2 diabetes across multiple populations. Specifically, Sale et al. (2007) reported a strong association between type 2 diabetes and SNPs rs7903146 and rs7901695 within this gene. Consistent with these findings, our model identified SNP rs7901695 as a constant-effect predictor, indicating no interaction with the five environmental factors (Note that SNP rs7903146 was not observed in this dataset).

Figure 2 shows plots of the marginal total environmental effect and the interaction effect of the SNP rs6537663 in gene TCF7L2, with heme iron intake as the first loading covariate. As the index increases, we observed that the marginal effect initially decreases, then increases, and subsequently exhibits a rapid decrease as the total effect of the five environmental factors increases. For the interaction effect, it fluctuates around 0 as the total effect of the five environmental variables increases, indicating that the SNP is unresponsive (or insensitive) to changes in these variables. However, as the index

X β

continues to increase, the SNP reacts to environmental changes, with a dramatic increase in risk for type 2 diabetes beyond a certain threshold. This estimated effect suggests that the genetic sensitivity of the SNP to the total effect of the five environmental variables follows a threshold model. This finding has practical implications: most individuals tolerate daily environmental changes, including dietary variations, without adverse effects. However, when such changes exceed a certain limit, the risk of disease may increase.

Table 6 presents the selection and estimation results of the loading parameters

β

, with heme iron intake as the first loading covariate. The model selects all loading parameters except for alcohol consumption (alcohol). Notably, we observed that body mass index (BMI) has the largest effect, which aligns with practical knowledge, as BMI is positively associated with type 2 diabetes and is a well-established risk factor for the disease. Notably, the sign of the loading parameters aligns with known associations in the literature. For example, Hu et al. (1999) reported that higher physical activity is linked to a lower risk of type 2 diabetes, while Field et al. (2001) demonstrated that high BMI increases the risk of type 2 diabetes. Additionally, Rajpathak et al. (2009) found that high heme iron intake is associated with an increased risk of type 2 diabetes, and Salmerón et al. (1997) showed that a high glycemic load is linked to a higher risk of type 2 diabetes. The signs of the estimated loading coefficients are consistent with these established findings.

5. Discussion

G×E interactions have been extensively studied in the literature, leading to the development of numerous statistical models. In this paper, we propose a three-step iterative variable selection approach for the generalized single-index varying-coefficients model (gSIVCM) with a binary response. Our goal is to identify varying, constant, and zero-effect genes, as well as to select non-zero environmental factors that interact with varying-effect genes. Biologically, our approach is attractive as it provides a novel perspective on G×E interactions. The flexibility of our model allows for the detection of non-linear interactions, making it particularly suitable when gene effects are non-linearly influenced by simultaneous exposure to multiple environmental factors. Statistically, gSIVCM reduces the dimensionality of the model by treating multiple indices X as a single index, which significantly alleviates the curse of dimensionality when multiple environmental factors interact with gene effects.

Our work builds upon the framework introduced by Guan et al. (2023) with a three-step variable selection approach for SIVCM with continuous responses. We extended our previous work to binary responses, broadening its applicability to a wider range of biological studies. For continuous responses, researchers are often interested in specific quantiles of the response, such as birth weight, rather than the mean. It is a natural progression to extend our variable selection approach to quantile regression settings, which we plan to explore in future studies.

Among the listed genes in Table A1 in the appendix, ABCA1 and NTRK2 show strong evidence of association with type 2 diabetes (T2D), with ABCA1 influencing lipid metabolism and insulin secretion (Kruit et al., 2012; Singh et al., 2022) and NTRK2 regulating energy balance and glucose metabolism via BDNF signaling (Xu et al., 2019). Genes such as GALNT2, PTK2B, and RBM15-AS1 are linked to metabolic pathways or inflammation, indirectly influencing T2D risk (Weissglas-Volkov et al., 2011; Wang et al., 2021; Wang et al., 2023). Other genes, including LARS2, UNC5C, and SCAI, have limited or emerging evidence, often tied to mitochondrial dysfunction, apoptosis, or cellular stress, highlighting the need for further investigation into their roles in T2D pathogenesis (van Berge et al., 2019; Morrison et al., 2020; Smith et al., 2019). For genes listed in Table A2, TCF7L2 stands out as a well-established genetic risk factor for T2D. Variants in TCF7L2 are strongly associated with impaired insulin secretion and glucose metabolism, contributing significantly to T2D susceptibility (Zhou et al., 2017). Another important gene is GALNT2, which has been implicated in lipid and glucose metabolism; polymorphisms in this gene are linked to alterations in glycemic traits and may indirectly influence T2D risk (Weissglas-Volkov et al., 2011). For the other genes, though no strong evidence directly linking them to T2D has been identified in the current literature, they may still play roles in metabolic or cellular pathways relevant to diabetes pathophysiology, warranting further investigation.

In this work, we demonstrated the model’s applicability through a gene-based analysis. Our method can also be extended to pathway-based analysis. In the human genome, pathways typically include a diverse array of genes, with each gene containing hundreds to thousands of SNPs. A pathway-based SNP-level analysis can be conducted by modeling SNPs within a pathway as genetic variables. Alternatively, a pathway-based gene-level analysis can be performed by summarizing the genomic information of each gene into a few principal components (PCs) using principal component analysis (PCA) (Jolliffe, 2002) or sparse principal component analysis (sPCA) (Zou et al., 2006). The proposed method can then be applied to select significant PCs within a pathway, facilitating the identification of key interactions between genes and environmental mixtures. By accounting for the genomic structure, this pathway-based approach has the potential to yield more interpretable and biologically meaningful results.

Acknowledgments

This work was supported in part by the National Institutes of Health (NIH) under award number R21HG010073 (to Y. Cui). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Funding support for the GWAS of Gene and Environment Initiatives in Type 2 Diabetes was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01HG004399). The datasets used for the analyses described in this manuscript were obtained from dbGaP at https://www.ncbi.nlm.nih.gov/gap/throughthedbGaPaccessionnumberphs000091.v2.p1.

Appendix A

Appendix A.1. Computational Algorithm

Step 1 and Step 2 follow directly from the group coordinate descent algorithm shown in the main text, and we omit the details here.

Step 3:

\hat{β} = max_{{| | β | |}_{2} = 1} M_{3} (β | λ_{3}, {\hat{γ}}^{(2)}) = max_{{| | β | |}_{2} = 1} (l ({\hat{γ}}^{(2)}, β) - n \sum_{d = 2}^{q} p_{λ_{3}} (| β_{d} |))

Since this is a penalized likelihood problem, we adopt the local quadratic approximation (LQA) approach proposed by Fan and Li (2001). Let

\tilde{β}

denote the most recent value of

β

. By applying Taylor expansion at

\tilde{β}

, we obtain:

l ({\hat{γ}}^{(2)}, β) \approx l ({\hat{γ}}^{(2)}, \tilde{β}) + ▽ l {({\hat{γ}}^{(2)}, \tilde{β})}^{T} (β - \tilde{β}) + \frac{1}{2} {(β - \tilde{β})}^{T} ▽^{2} l ({\hat{γ}}^{(2)}, \tilde{β}) (β - \tilde{β})

where

▽ l ({\hat{γ}}^{(2)}, \tilde{β}) = {(\frac{\partial l ({\hat{γ}}^{(2)}, β)}{\partial β_{1}}, \dots, \frac{\partial l ({\hat{γ}}^{(2)}, β)}{\partial β_{q}})}^{T} |_{β = \tilde{β}} is the gradient

▽^{2} l ({\hat{γ}}^{(2)}, \tilde{β}) = {[\frac{\partial^{2} l ({\hat{γ}}^{(2)}, β)}{\partial β_{j} \partial β_{l}}]}_{1 \leq j, l \leq q} |_{β = \tilde{β}} is the hessian matrix,

and

\frac{\partial l ({\hat{γ}}^{(2)}, β)}{\partial β_{j}} = {(Y - \frac{1}{1 + e^{- μ^{B^{(3)}}}})}^{T} \frac{\partial μ^{B^{(3)}}}{\partial β_{j}}

\frac{\partial^{2} l ({\hat{γ}}^{(2)}, β)}{\partial β_{j} \partial β_{l}} = {(Y - \frac{1}{1 + e^{- μ^{B^{(3)}}}})}^{T} \frac{\partial μ^{B^{(3)}}}{\partial β_{j} \partial β_{l}} - {(\frac{e^{- μ^{B^{(3)}}}}{{(1 + e^{- μ^{B^{(3)}}})}^{2}})}^{T} \frac{\partial μ^{B^{(3)}}}{\partial_{β_{j}}} \cdot \frac{\partial μ^{B^{(3)}}}{\partial_{β_{l}}},

and

μ^{B^{(3)}} = \sum_{k = 0}^{p} [{\hat{γ}}_{k 1}^{(2)} + \bar{B} (X β) {\hat{γ}}_{k *}^{(2)}] \cdot G_{k}

\frac{\partial μ^{B^{(3)}}}{\partial β_{j}} = \sum_{k = 0}^{p} ({\bar{B}}^{'} (X β) {\hat{γ}}_{k *}^{(2)} \cdot X_{j} \cdot G_{k})

\frac{\partial μ^{B^{(3)}}}{\partial β_{j} \partial β_{l}} = \sum_{k = 0}^{p} ({\bar{B}}^{''} (X β) {\hat{γ}}_{k *}^{(2)} \cdot X_{j} \cdot X_{l} \cdot G_{k})

Let

Σ_{λ_{3}} (\tilde{β}) = d i a G (0, \frac{p_{λ_{3}}^{'} (\hat{β_{2}})}{| \hat{β_{2}} |}, \dots, \frac{p_{λ_{3}}^{'} (\hat{β_{q}})}{| \hat{β_{q}} |})

We update

β

as follows:

β^{*} = \tilde{β} - {[▽^{2} l ({\hat{γ}}^{(2)}, \tilde{β}) + n Σ_{λ_{3}} (\tilde{β})]}^{- 1} [▽ l ({\hat{γ}}^{(2)}, \tilde{β}) + n Σ_{λ_{3}} (\tilde{β}) \tilde{β}] .

The updated value of

β

is then obtained through standardization:

β^{updated} = sign (β_{1}^{*}) \frac{β^{*}}{| | β^{*} {| |}_{2}} .

Finally, we iterate this process until convergence.

Appendix A.2. Proofs of Theorems

Appendix A.2.1. Notations

Let the space of Lipschitz continuous functions for any fixed constant c be denoted as

L i p ([a, b], c) = {f : | f (x_{1}) - f (x_{2}) | \leq c | x_{1} - x_{2} |, \forall x_{1}, x_{2} \in [a, b]}

. Define

C^{(p)} [a, b] = {f : f^{(p)} \in C [a, b]}

as the space of p-th order smooth functions.

Appendix A.2.2. Some Regularity Conditions

(A1) The density function

f_{U (β)} (.)

of the random variable

U (β) = X β

is bounded away from 0 on

Ω = {X β, X \in X}

, where

X

is the compact support of X. There exists a constant

c_{1}

such that

f_{U (β)} (.) \in L i p ([a, b], c_{1})

.

(A2) For

k = 0, 1, \dots, p

, the non-parametric function

m_{k} (.) \in C^{(r)}

with

r \geq 2

.

(A3)

E (| | G | |^{6}) < \infty

.

(A4) The matrix

M (u) = E (G G^{T} | X β = u)

is positive definite, and each element of

M (u) \in L i p ([a, b], c_{4})

.

(A5) Define

b_{n} = max_{k, l} {p_{λ_{1}}^{''} (| | γ_{k *}^{0} {| |}_{2}), p_{λ_{2}}^{''} (| γ_{k 1}^{0} |), p_{λ_{3}}^{''} (| β_{d}^{0} |) : γ_{k *}^{0} \neq 0, γ_{k 1}^{0} \neq 0, β_{l}^{0} \neq 0} .

For

k = 1, \dots, p

and

d = 2, \dots, q

, it holds that

b_{n} \to 0

as

n \to \infty

.

(A6)

\underset{n \to \infty}{lim inf} \underset{| | γ_{k *} {| |}_{2} \to 0^{+}}{lim inf} \frac{1}{λ_{1}} | p_{λ_{1}}^{'} (| | γ_{k *} {| |}_{2}) | > 0 for k = v + 1, \dots, p,

\underset{n \to \infty}{lim inf} \underset{| γ_{k 1} | \to 0^{+}}{lim inf} \frac{1}{λ_{2}} | p_{λ_{2}}^{'} (| γ_{k 1} |) | > 0 for k = c + 1, \dots, p,

\underset{n \to \infty}{lim inf} \underset{| β_{d} | \to 0^{+}}{lim inf} \frac{1}{λ_{3}} | p_{λ_{3}}^{'} (| β_{d} |) | > 0 for d = s + 1, \dots, q .

(A7) Let

c_{1}, \dots, c_{K}

be internal points of

[a, b]

, where

a = inf {u : u \in Ω}

and

b = sup {u : u \in Ω}

. Define

c_{0} = a

,

c_{K + 1} = b

,

h_{i} = c_{i} - c_{i - 1}

, and

h = max {h_{i}}

. There exists a constant

C_{7}

such that

\frac{h}{min {h_{i}}} < C_{7}

and

max {h_{i + 1} h_{i}} = o (K^{- 1})

.

Appendix A.2.3. Proof of Theorem 1

Let

ϕ = {(β_{2}, \dots, β_{q})}^{T}

, and we have

β = {(\sqrt{1 - {| | ϕ | |}_{2}^{2}}, ϕ^{T})}^{T}

, hence the restriction

| | β | | = 1

and

β_{1} > 0

is equivalent to

{| | ϕ | |}_{2} < 1

. To show the consistency of

\hat{β}

, it is enough to show the consistency of

\hat{ϕ}

.

Define

α_{n} = n^{- r / (2 r + 1)} + a_{n}

,

γ = γ^{0} + α_{n} τ_{1}

,

ϕ = ϕ^{0} + α_{n} τ_{2}

, and

τ = (τ_{1}, τ_{2})

, where

τ_{1} = (τ_{01}, τ_{0 *}, \dots, τ_{p 1}, τ_{p *})

and

{τ_{k 1}, τ_{k *}}

correspond to the B-spline coefficients

γ_{k 1}, γ_{k *}

. Similarly,

τ_{2} = (τ_{1}^{ϕ}, \dots, τ_{q - 1}^{ϕ})

, where

τ_{l}^{ϕ}

corresponds to

ϕ_{l}

.

To establish the consistency of

\hat{γ}

and

\hat{ϕ}

, we need to show that

\forall ϵ > 0

, ∃ a sufficiently large C, such that:

P (sup_{| | τ | | = C} {M (γ, ϕ)} < M (γ^{0}, ϕ^{0})) \geq 1 - ϵ,

(A1)

where

M (γ, ϕ) = l (γ, ϕ) - n \sum_{k = 1}^{p} p_{λ_{1}} (| | γ_{k *} {| |}_{2}) - n \sum_{k = 1}^{p} p_{λ_{2}} (| γ_{k 1} |) I (| | γ_{k *} {| |}_{2} = 0) - n \sum_{l = 1}^{q - 1} p_{λ_{3}} (| ϕ_{l} |),

and

l (γ, ϕ)

is the log-likelihood function defined above. If (A1) holds, it implies that with probability at least

1 - ϵ

, there exists a local maximum within the ball

{(γ^{0}, ϕ^{0}) + α_{n} τ : | | τ | | \leq C}

. Hence, there exists a local maximizer such that

| | (\hat{γ}, \hat{ϕ}) - (γ^{0}, ϕ^{0}) | | = O_{p} (α_{n})

.

Define

\begin{matrix} D_{n} (τ) & = \frac{1}{K} {M (γ, ϕ) - M (γ^{0}, ϕ^{0})} = \frac{1}{K} {M (γ^{0} + α_{n} τ_{1}, ϕ^{0} + α_{n} τ_{2}) - M (γ^{0}, ϕ^{0})} \\ = \frac{1}{K} {[l (γ^{0} + α_{n} τ_{1}, ϕ^{0} + α_{n} τ_{2}) - l (γ^{0}, ϕ^{0})] - n \sum_{k = 1}^{p} [p_{λ_{1}} (| | γ_{k *}^{0} + α_{n} τ_{k *} {| |}_{2}) - p_{λ_{1}} (| | γ_{k *}^{0} {| |}_{2})] \\ - n \sum_{k = 1}^{p} [p_{λ_{2}} (| γ_{k 1}^{0} + α_{n} τ_{k 1} |) I (| | γ_{k *}^{0} + α_{n} τ_{k *} {| |}_{2} = 0) - p_{λ_{2}} (| γ_{k 1}^{0} |) I (| | γ_{k *}^{0} {| |}_{2} = 0)] \\ - n \sum_{j = 1}^{q - 1} [p_{λ_{3}} (| ϕ_{j}^{0} + α_{n} τ_{j}^{ϕ} |) - p_{λ_{3}} (| ϕ_{j}^{0} |)]} . \end{matrix}

Since

p_{λ_{1}} (| | γ_{k *}^{0} {| |}_{2}) = 0

for

k = v + 1, \dots, p

,

p_{λ_{3}} (| ϕ_{j}^{0} |) = 0

for

j = s + 1, \dots, q - 1

, and

I (| | γ_{k *}^{0} | |_{2} = 0) = 0

for

k = 1, \dots, v

, we have:

\begin{matrix} D_{n} (τ) & \leq \frac{1}{K} {l (γ^{0} + α_{n} τ_{1}, ϕ^{0} + α_{n} τ_{2}) - l (γ^{0}, ϕ^{0}) \\ - n \sum_{k = 1}^{v} [p_{λ_{1}} (| | γ_{k *}^{0} + α_{n} τ_{k *} {| |}_{2}) - p_{λ_{1}} (| | γ_{k *}^{0} {| |}_{2})] \\ - n \sum_{k = v + 1}^{p} [p_{λ_{2}} (| γ_{k 1}^{0} + α_{n} τ_{k 1} |) - p_{λ_{2}} (| γ_{k 1}^{0} |)] \\ - n \sum_{j = 1}^{s - 1} [p_{λ_{3}} (| ϕ_{j}^{0} + α_{n} τ_{j}^{ϕ} |) - p_{λ_{3}} (| ϕ_{j}^{0} |)]} . \end{matrix}

Applying Taylor expansion at

(γ^{0}, ϕ^{0})

and simplifying the bounds for

D_{n} (τ)

, we derive that:

| | (\hat{γ}, \hat{ϕ}) - (γ^{0}, ϕ^{0}) | | = O_{p} (α_{n}) .

where

I (γ^{0}, ϕ^{0})

is the Fisher information matrix. By standard arguments of likelihood theory,

S_{1}

is of the order

O_{p} (1 + n^{r / (2 r + 1)} α_{n}) | | τ | |

,

S_{2}

is of the order

O_{p} (1 + 2 n^{r / (2 r + 1)} α_{n}) {| | τ | |}^{2}

, and for sufficiently large C,

S_{2}

dominates

S_{1}

uniformly in

| | τ | | = C

. Furthermore, we have

\begin{matrix} S_{3} & \leq \frac{n}{K} α_{n} a_{n} \sum_{k = 1}^{v} \frac{γ_{k *}^{0}}{| | γ_{k *}^{0} {| |}_{2}} τ_{k *}^{T} + \frac{n}{K} α_{n}^{2} max_{k} {p_{λ_{1}}^{''} (| | γ_{k *}^{0} {| |}_{2})} \sum_{k = 1}^{v} τ_{k *} τ_{k *}^{T} \\ \leq \frac{n}{K} α_{n}^{2} \sqrt{v} | | τ | | + \frac{n}{K} α_{n}^{2} {| | τ | |}^{2} max_{k} {p_{λ_{1}}^{''} (| | γ_{k *}^{0} {| |}_{2})} . \end{matrix}

Since

{max}_{k} {p_{λ_{1}}^{''} (| | γ_{k *}^{0} {| |}_{2})} \to 0

, we conclude that

S_{3}

is dominated by

S_{2}

.

For

S_{4}

and

S_{5}

, we have:

\begin{matrix} S_{4} & \leq & α_{n} a_{n} \frac{n}{K} \sum_{k = v + 1}^{p} τ_{k 1} + \frac{n}{K} α_{n}^{2} max_{k} {p_{λ_{2}}^{''} (| γ_{k 1}^{0} |)} \sum_{k = v + 1}^{p} {(τ_{k 1})}^{2} \\ \leq & \frac{n}{K} α_{n}^{2} | | τ | | + \frac{n}{K} α_{n}^{2} {| | τ | |}^{2} max_{k} {p_{λ_{2}}^{''} (| γ_{k 1}^{0} |)}, \end{matrix}

and

\begin{matrix} S_{5} & \leq & α_{n} a_{n} \frac{n}{K} \sum_{l = 1}^{s - 1} τ_{l}^{ϕ} + \frac{n}{K} α_{n}^{2} max_{l} {p_{λ_{3}}^{''} (| ϕ_{l}^{0} |)} \sum_{l = 1}^{s - 1} {(τ_{l}^{ϕ})}^{2} \\ \leq & \frac{n}{K} α_{n}^{2} | | τ | | + \frac{n}{K} α_{n}^{2} {| | τ | |}^{2} max_{l} {p_{λ_{3}}^{''} (| ϕ_{l}^{0} |)} . \end{matrix}

Similarly,

S_{4}

and

S_{5}

are dominated by

S_{2}

. Thus, by choosing a sufficiently large C, we obtain

| | (\hat{γ}, \hat{ϕ}) - (γ^{0}, ϕ^{0}) | | = O_{p} (α_{n})

. Therefore, the consistency of the penalized least squares estimator

(\hat{γ}, \hat{ϕ})

is proven.

Appendix A.2.4. Proof of Theorem 2

For (i), for ease of notation, let

ϕ = (ϕ^{n z}, ϕ^{z})

, where

ϕ^{n z} = (ϕ_{1}, \dots, ϕ_{s - 1})

and

ϕ^{z} = (ϕ_{s}, \dots, ϕ_{q - 1})

. Since

λ_{\max} \to 0

, it follows that

a_{n} = 0

for sufficiently large n. By Theorem 1, it suffices to show that for

ϕ^{n z}

:

| | ϕ_{l} - ϕ_{l}^{0} {| |}_{2} = O_{p} (n^{- r / (2 r + 1)}), l = 1, \dots, s - 1,

and for

ϕ^{z}

, there exists a small constant

ϵ = C n^{- r / (2 r + 1)}

such that as

n \to \infty

, with probability approaching 1, for

l = s, \dots, q - 1

, we have:

\frac{\partial M (γ, ϕ)}{\partial ϕ_{l}} < 0 when 0 < ϕ_{l} < ϵ, and \frac{\partial M (γ, ϕ)}{\partial ϕ_{l}} > 0 when - ϵ < ϕ_{l} < 0 .

We start with:

\frac{\partial M (γ, ϕ)}{\partial ϕ_{l}} = \frac{\partial l (γ, ϕ)}{\partial ϕ_{l}} - n p_{λ_{3}}^{'} (| ϕ_{l} |) sign (ϕ_{l}) .

Expanding

\frac{\partial l (γ, ϕ)}{\partial ϕ_{l}}

at

ϕ^{0}

using a Taylor expansion, we obtain:

\begin{matrix} \frac{\partial M (γ, ϕ)}{\partial ϕ_{l}} & = \frac{\partial l (γ, ϕ^{0})}{\partial ϕ_{l}} + \sum_{k = 1}^{q - 1} \frac{\partial^{2} l (γ, ϕ^{0})}{\partial ϕ_{l} \partial ϕ_{k}} (ϕ_{k} - ϕ_{k}^{0}) \\ + \sum_{k = 1}^{q - 1} \sum_{j = 1}^{q - 1} \frac{\partial^{3} l (γ, ϕ^{*})}{\partial ϕ_{l} \partial ϕ_{k} \partial ϕ_{j}} (ϕ_{k} - ϕ_{k}^{0}) (ϕ_{j} - ϕ_{j}^{0}) - n p_{λ_{3}}^{'} (| ϕ_{l} |) sign (ϕ_{l}), \end{matrix}

where

ϕ^{*}

lies between

ϕ^{0}

and

ϕ

. After simplification, we derive:

\frac{\partial M (γ, ϕ)}{\partial ϕ_{l}} = n λ_{3} \{- \frac{1}{λ_{3}} p_{λ_{3}}^{'} (| ϕ_{l} |) sign (ϕ_{l}) + O_{p} (\frac{1}{λ_{3}} n^{- r / (2 r + 1)})\} .

Since

{lim}_{n \to \infty} {lim inf}_{ϕ_{l} \to 0} \frac{1}{λ_{3}} p_{λ_{3}}^{'} (| ϕ_{l} |) > 0

and

\frac{1}{λ_{3}} n^{- r / (2 r + 1)} \to 0

, the sign of

\frac{\partial M (γ, ϕ)}{\partial ϕ_{l}}

is completely determined by

sign (ϕ_{l})

. Thus, we conclude

{\hat{β}}_{j} = 0

for

j = s + 1, \dots, q - 1

.

For (ii) & (iii), by applying similar arguments as in part (i), it follows that with probability approaching 1,

{\hat{γ}}_{k *} = 0

for

k = v + 1, \dots, p

and

{\hat{γ}}_{k 1} = 0

for

k = c + 1, \dots, p

. Using

{sup}_{u} B (u) = O (1)

and the fact that

{\hat{m}}_{k} (u) = {\hat{γ}}_{k 0} + \bar{B} (X \hat{β}) {\hat{γ}}_{k *}

, we have:

{\hat{m}}_{k} (u) = c_{k}, k = v + 1, \dots, c,

where

c_{k}

is a constant, and:

{\hat{m}}_{k} (u) = 0, k = c + 1, \dots, p .

This completes the proof.

Appendix A.3. Additional Real Data Analysis Results

Table A1 lists all the varying-effect SNPs selected, along with the gene IDs to which they were mapped. Similarly, Table A2 provides a summary of all the constant-effect SNPs selected, along with their corresponding gene IDs.

Table A1. List of SNPs with varying effects.

Gene ID	Gene Symbol	SNP ID
GeneID:440600	RBM15-AS1	rs6537663
GeneID:2590	GALNT2	rs6666516
GeneID:729993	SHISA9	rs1015431
GeneID:54768	HYDIN	rs4788621
GeneID:117532	TMC2	rs7509377
GeneID:758	MPPED1	rs5766384
GeneID:23395	LARS2	rs4311249
GeneID:647107	LINC01192	rs2404825
GeneID:8633	UNC5C	rs3775049
GeneID:2185	PTK2B	rs6557991
GeneID:4915	NTRK2	rs6559870
GeneID:19	ABCA1	rs4742969
GeneID:286205	vSCAI	rs2416996

Table A2. List of SNPs with constant effect.

Gene ID	Gene Symbol	SNP ID
GeneID:114827	FHAD1	rs3815792
GeneID:2899	GRIK3	rs12118788
GeneID:260425	MAGI3	rs11102660
GeneID:9857	CEP350	rs2293990
GeneID:2590	GALNT2	rs9308482
GeneID:6934	TCF7L2	rs7901695
GeneID:55742	PARVA	rs7101596
GeneID:867	CBL	rs4489755
GeneID:10867	TSPAN9	rs740771
GeneID:57494	RIMKLB	rs11047510
GeneID:196385	DNAH10	rs11058132
GeneID:64328	XPO4	rs1961415
GeneID:23348	DOCK9	rs7326971
GeneID:23348	DOCK9	rs7991210
GeneID:57099	AVEN	rs16962542
GeneID:11060	WWP2	rs16970994
GeneID:25780	RASGRP3	rs6708570
GeneID:100505498	LOC100505498	rs6730602
GeneID:117532	TMC2	rs11696526
GeneID:29780	PARVB	rs5765571
GeneID:25814	ATXN10	rs713999
GeneID:9620	CELSR1	rs11090812
GeneID:23429	RYBP	rs17009630
GeneID:80254	CEP63	rs11710699
GeneID:8633	UNC5C	rs10516957
GeneID:157680	VPS13B	rs1788161

References

Carpenter, D. O., Arcaro, K., & Spink, D. C. (2002). Understanding the human health effects of chemical mixtures. Environmental Health Perspectives, 110(Suppl 1), 25. [CrossRef]
Colditz, G. A., & Hankinson, S. E. (2005). The Nurses’ Health Study: lifestyle and health among women. Nature Reviews Cancer, 5(5), 388-396. [CrossRef]
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 1348-1360. [CrossRef]
Field, A. E., Coakley, E. H., Must, A., Spadano, J. L., Laird, N., Dietz, W. H., Rimm, E., & Colditz, G. A. (2001). Impact of overweight on the risk of developing common chronic diseases during a 10-year period. Archives of Internal Medicine, 161(13), 1581–1586. [CrossRef]
Frank, L. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109-135. [CrossRef]
Hu, F. B., Sigal, R. J., Rich-Edwards, J. W., Colditz, G. A., Solomon, C. G., Willett, W. C., & Manson, J. E. (1999). Walking compared with vigorous physical activity and risk of type 2 diabetes in women: a prospective study. Journal of the American Medical Association, 282(15), 1433–1439. [CrossRef]
Grant, S. F., Thorleifsson, G., Reynisdottir, I., Benediktsson, R., Manolescu, A., Sainz, J., ... & Styrkarsdottir, U. (2006). Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genetics, 38(3), 320-323. [CrossRef]
Guan, S., Zhao, M., Cui, Y. (2023). Variable selection for single-index varying-coefficients models with applications to synergistic G×E interactions. Electronic Journal of Statistics, 17(1): 823–857. [CrossRef]
Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.
Kruit, J. K., Wijesekara, N., Westwell-Roper, C., Bruce, J., Zhao, M., Kahn, S. E., & Hayden, M. R. (2012). Loss of ATP-binding cassette transporter A1 in mice impairs glucose tolerance and insulin secretion in vivo. Diabetes, 61(12), 3166-3177.
Morrison, J. L., Gallas, P. R., Zheng, W., Mohtashami, S., & Wong, T. (2020). UNC5C and its relevance to beta-cell apoptosis. Endocrinology, 161(7), bqaa064.
Ottman, R. (1996). Gene-environment interaction: definitions and study designs. Preventive medicine, 25(6), 764. [CrossRef]
Rajpathak, S. N., Crandall, J. P., Wylie-Rosett, J., Kabat, G. C., Rohan, T. E., & Hu, F. B. (2009). The role of iron in type 2 diabetes in humans. Biochimica et Biophysica Acta (BBA) - General Subjects, 1790(7), 671–681. [CrossRef]
Rimm, E. B., Giovannucci, E. L., Willett, W. C., Colditz, G. A., Ascherio, A., Rosner, B., & Stampfer, M. J. (1991). Prospective study of alcohol consumption and risk of coronary disease in men. The Lancet, 338(8765), 464-468. [CrossRef]
Sale, M. M., Smith, S. G., Mychaleckyj, J. C., Keene, K. L., Langefeld, C. D., Leak, T. S., ... & Freedman, B. I. (2007). Variants of the transcription factor 7-like 2 (TCF7L2) gene are associated with type 2 diabetes in an African-American population enriched for nephropathy. Diabetes, 56(10), 2638-2642. [CrossRef]
Salmerón, J., Ascherio, A., Rimm, E. B., Colditz, G. A., Spiegelman, D., Jenkins, D. J., Stampfer, M. J., Wing, A. L., & Willett, W. C. (1997). Dietary fiber, glycemic load, and risk of NIDDM in men. Diabetes Care, 20(4), 545–550. [CrossRef]
Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 6(2), 461-464.
Schumaker, L. (2007). Spline functions: basic theory. Cambridge University Press.
Sexton, K., & Hattis, D. (2007). Assessing cumulative health risks from exposure to environmental mixtures-three fundamental questions. Environmental Health Perspectives, 825-832. [CrossRef]
Singh, J., Kumar, V., Aneja, A., Gupta, S., & Dhingra, S. (2022). Genetic polymorphisms in ABCA1 (rs2230806 and rs1800977) and LIPC (rs2070895) genes and their association with the risk of type 2 diabetes: A case-control study. International Journal of Diabetes in Developing Countries, 42(2), 227-235. [CrossRef]
Smith, A. R., Jones, K. M., Patel, D., Thomas, C., & Reid, L. (2019). The role of SCAI in cellular stress pathways. Journal of Cell Science, 132(17), jcs231829.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
van Berge, L., Keating, D. J., Greber, B., O’Neill, C., & Lang, J. (2019). Mitochondrial dysfunction in metabolic syndrome. Frontiers in Endocrinology, 10, 607.
Wang, Y., Wang, J., He, Z., Tang, X., Zheng, Y., & Liu, H. (2021). Inflammatory pathways in T2D: A focus on PTK2B. Immunometabolism, 3(1), e210006.
Wang, X., Zhang, J., Li, Y., Wang, Z., Sun, Q., Zhou, H., & Fan, Y. (2023). RBM15 regulates hepatic insulin sensitivity in GDM offspring. Molecular Medicine, 29(1), 45.
Weissglas-Volkov, D., Aguilar-Salinas, C. A., Nikkola, E., Tusie-Luna, T., Cruz-Bautista, I., Arellano-Campos, O., & Pajukanta, P. (2011). GALNT2 polymorphisms and their role in lipid and glucose metabolism. Human Molecular Genetics, 20(9), 1632-1640.
Xu, B., Li, X., Nian, K., Li, J., Li, M., Zheng, W., & Han, X. (2019). BDNF and NTRK2 in metabolic regulation. Nature Medicine, 25(5), 697-707.
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894-942. [CrossRef]
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429. [CrossRef]
Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265-286. [CrossRef]

Figure 1. Selection and estimation accuracy of

m_{k} (\cdot)

for discrete G

Figure 1. Selection and estimation accuracy of

m_{k} (\cdot)

for discrete G

Figure 2. Plot of varying effects on a log-odds scale for SNP rs6537663 in Gene TCF7L2.

Table 1. Selection and estimation accuracy of function

m_{k} (.)

for continuous predictors.

Table 1. Selection and estimation accuracy of function

m_{k} (.)

for continuous predictors.

n	$m (\cdot)$	$p = 50$			$p = 100$
n	$m (\cdot)$	Oracle %	IMSE_Model	IMSE_Oracle	Oracle %	IMSE_Model	IMSE_Oracle
$n = 1000$	$m_{0} (.)$	100.0%	1.50E-01	1.47E-01	100.0%	1.48E-01	1.31E-01
	$m_{1} (.)$	99.2%	1.44E-01	2.14E-01	99.7%	1.48E-01	1.66E-01
	$m_{2} (.)$	99.4%	1.34E-01	1.61E-01	99.7%	1.37E-01	1.43E-01
	$m_{3} (.)$	100.0%	3.95E-02	3.85E-02	100.0%	4.19E-02	3.33E-02
	$m_{4} (.)$	99.9%	5.58E-02	5.53E-02	100.0%	5.93E-02	4.86E-02
	Zero	99.1%	1.03E-03	0	99.0%	1.16E-03	0
$n = 2000$	$m_{0} (.)$	100.0%	6.85E-02	6.88E-02	100.0%	7.00E-02	7.05E-02
	$m_{1} (.)$	100.0%	5.17E-02	6.00E-02	100.0%	5.43E-02	6.34E-02
	$m_{2} (.)$	100.0%	5.46E-02	5.99E-02	100.0%	5.40E-02	5.88E-02
	$m_{3} (.)$	100.0%	1.40E-02	1.46E-02	100.0%	1.46E-02	1.48E-02
	$m_{4} (.)$	100.0%	1.79E-02	1.93E-02	100.0%	1.89E-02	1.87E-02
	Zero	99.4%	3.33E-04	0	99.4%	3.14E-04	0

Table 2. Selection and estimation accuracy of the loading parameters

β

for continuous predictors.

Table 2. Selection and estimation accuracy of the loading parameters

β

for continuous predictors.

n	$β$	$p = 50$			$p = 100$
n	$β$	Oracle %	MSE_Model	MSE_Oracle	Oracle %	MSE_Model	MSE_Oracle
$n = 1000$	$β_{1}$	100.0%	7.26E-04	5.91E-04	100.0%	6.75E-04	5.75E-04
	$β_{2}$	100.0%	1.36E-02	1.11E-02	100.0%	3.28E-03	5.78E-04
	$β_{3}$	97.3%	2.34E-04	0	96.7%	6.22E-04	0
	$β_{4}$	97.5%	1.90E-04	0	96.7%	2.72E-04	0
	$β_{5}$	96.8%	9.01E-04	0	96.6%	2.95E-04	0
$n = 2000$	$β_{1}$	100.0%	2.76E-04	2.68E-04	100.0%	2.77E-04	2.75E-04
	$β_{2}$	100.0%	2.71E-04	2.66E-04	100.0%	2.76E-04	2.74E-04
	$β_{3}$	99.1%	5.49E-05	0	99.6%	1.57E-05	0
	$β_{4}$	99.6%	2.68E-05	0	99.4%	3.16E-05	0
	$β_{5}$	99.3%	4.58E-05	0	99.4%	2.22E-05	0

Table 3. Function

m_{k} (u)

and the MAF of the associated SNP.

Table 3. Function

m_{k} (u)

and the MAF of the associated SNP.

Function	MAF of $G_{k}$
$m_{0} (u) = 2 s i n (2 π u)$	-
$m_{1} (u) = 2 c o s (π u) + 2$	0.5
$m_{2} (u) = s i n (2 π u) + c o s (π u) + 1$	0.5
$m_{3} (u) = 2 c o s (π u) + 2$	0.3
$m_{4} (u) = s i n (2 π u) + c o s (π u) + 1$	0.3
$m_{5} (u) = 2 c o s (π u) + 2$	0.1
$m_{6} (u) = s i n (2 π u) + c o s (π u) + 1$	0.1
$m_{7} (u) = 2$	0.5
$m_{8} (u) = 2$	0.3
$m_{9} (u) = 2$	0.1
$m_{k} (u) = 0, k > 9$	unif(0.05, 0.5)

Table 4. Selection and estimation accuracy of

m_{k} (\cdot)

for discrete predictors.

Table 4. Selection and estimation accuracy of

m_{k} (\cdot)

for discrete predictors.

n	$m (\cdot)$	$p = 50$			$p = 100$
n	$m (\cdot)$	Oracle %	IMSE_Model	IMSE_Oracle	Oracle %	IMSE_Model	IMSE_Oracle
$n = 1000$	$m_{0} (.)$	100.0%	1.63E-01	2.28E-01	100.0%	1.75E-01	2.29E-01
	$m_{1} (.)$	83.4%	4.29E-01	5.61E-01	83.1%	4.44E-01	6.08E-01
	$m_{2} (.)$	87.7%	3.82E-01	4.38E-01	87.9%	3.69E-01	4.50E-01
	$m_{3} (.)$	76.0%	5.43E-01	5.87E-01	75.5%	5.50E-01	6.17E-01
	$m_{4} (.)$	83.3%	4.37E-01	4.19E-01	79.6%	5.11E-01	4.50E-01
	$m_{5} (.)$	23.2%	1.35E+00	1.05E+00	22.1%	1.45E+00	1.18E+00
	$m_{6} (.)$	25.7%	1.27E+00	8.93E-01	23.7%	1.31E+00	9.22E-01
	$m_{7} (.)$	100.0%	5.09E-02	6.22E-02	99.9%	5.05E-02	5.86E-02
	$m_{8} (.)$	99.9%	6.08E-02	7.50E-02	99.9%	5.83E-02	6.75E-02
	$m_{9} (.)$	100.0%	1.05E-01	1.24E-01	99.9%	1.13E-01	1.21E-01
	Zero	99.0%	2.74E-03	0	99.2%	2.15E-03	0
$n = 2000$	$m_{0} (.)$	100.0%	7.47E-02	8.03E-02	100.0%	7.42E-02	8.35E-02
	$m_{1} (.)$	100.0%	9.46E-02	1.38E-01	99.9%	1.02E-01	1.49E-01
	$m_{2} (.)$	100.0%	9.52E-02	1.21E-01	99.9%	9.47E-02	1.21E-01
	$m_{3} (.)$	100.0%	1.07E-01	1.55E-01	99.8%	1.08E-01	1.50E-01
	$m_{4} (.)$	99.9%	1.05E-01	1.37E-01	99.8%	1.05E-01	1.30E-01
	$m_{5} (.)$	77.5%	4.84E-01	2.99E-01	75.1%	5.15E-01	3.08E-01
	$m_{6} (.)$	77.5%	4.93E-01	2.63E-01	73.5%	5.32E-01	2.53E-01
	$m_{7} (.)$	100.0%	1.89E-02	2.11E-02	100.0%	1.75E-02	1.97E-02
	$m_{8} (.)$	100.0%	2.17E-02	2.39E-02	100.0%	2.08E-02	2.33E-02
	$m_{9} (.)$	100.0%	4.47E-02	4.95E-02	100.0%	4.24E-02	4.56E-02
	Zero	99.3%	9.15E-04	0	99.5%	6.81E-04	0

Table 5. Selection and estimation accuracy of the loading parameters

β

for discrete predictors.

Table 5. Selection and estimation accuracy of the loading parameters

β

for discrete predictors.

n	$β$	$p = 50$			$p = 100$
n	$β$	Oracle %	MSE_Model	MSE_Oracle	Oracle %	MSE_Model	MSE_Oracle
$n = 1000$	$β_{1}$	100.0%	6.64E-04	7.28E-04	100.0%	5.84E-04	5.13E-04
	$β_{2}$	100.0%	7.37E-03	7.32E-03	100.0%	2.76E-03	3.27E-03
	$β_{3}$	95.2%	3.64E-04	0	95.2%	3.73E-04	0
	$β_{4}$	97.0%	1.21E-04	0	96.1%	4.18E-04	0
	$β_{5}$	96.1%	2.04E-04	0	94.7%	5.46E-04	0
n = 2000	$β_{1}$	100.0%	2.34E-04	2.22E-04	100.0%	2.20E-04	2.12E-04
	$β_{2}$	100.0%	2.31E-04	2.20E-04	100.0%	2.30E-03	2.37E-03
	$β_{3}$	98.9%	5.00E-05	0	98.9%	3.24E-05	0
	$β_{4}$	98.7%	5.44E-05	0	98.9%	4.49E-05	0
	$β_{5}$	98.9%	4.37E-05	0	99.0%	5.07E-05	0

Table 6. Estimation and selection result for

β

.

Table 6. Estimation and selection result for

β

.

act	bmi	alcohol	heme	gl
-0.1832	0.9544	0	0.2157	0.0947

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Variable Selection for Generalized Single-index Varying-coefficient Models with Applications to Synergistic G×E Interactions

Abstract

Keywords:

Subject:

1. Introduction

2. Statistical Method

2.1. Model Setup

2.2. Estimation Method

2.3. The Estimation Step

2.4. Selection of Tuning Parameters

2.4.1. Selection of Tuning Parameters $λ_{1}, λ_{2}, λ_{3}$

2.4.2. Selection of Order h and Number of Internal Knots K

2.5. Theoretical Properties

3. Simulation

3.1. Continuous G

3.2. Discrete G

4. A Case Study

5. Discussion

Acknowledgments

Appendix A

Appendix A.1. Computational Algorithm

Appendix A.2. Proofs of Theorems

Appendix A.2.1. Notations

Appendix A.2.2. Some Regularity Conditions

Appendix A.2.3. Proof of Theorem 1

Appendix A.2.4. Proof of Theorem 2

Appendix A.3. Additional Real Data Analysis Results

References

MDPI Initiatives

Important Links

Subscribe

Variable Selection for Generalized Single-index Varying-coefficient Models with Applications to Synergistic G×E Interactions

Abstract

Keywords:

Subject:

1. Introduction

2. Statistical Method

2.1. Model Setup

2.2. Estimation Method

2.3. The Estimation Step

2.4. Selection of Tuning Parameters

2.4.1. Selection of Tuning Parameters λ 1 , λ 2 , λ 3

2.4.2. Selection of Order h and Number of Internal Knots K

2.5. Theoretical Properties

3. Simulation

3.1. Continuous G

3.2. Discrete G

4. A Case Study

5. Discussion

Acknowledgments

Appendix A

Appendix A.1. Computational Algorithm

Appendix A.2. Proofs of Theorems

Appendix A.2.1. Notations

Appendix A.2.2. Some Regularity Conditions

Appendix A.2.3. Proof of Theorem 1

Appendix A.2.4. Proof of Theorem 2

Appendix A.3. Additional Real Data Analysis Results

References

MDPI Initiatives

Important Links

Subscribe

2.4.1. Selection of Tuning Parameters $λ_{1}, λ_{2}, λ_{3}$