1. Introduction
Measurement error data inevitably exists in applications and has raised significant concerns in various fields including biology, medicine, epidemiology, economics, finance and remote sensing. So far, there have been a wealth of research achievements on classical low-dimensional measurement error regression models under various assumptions. Numerous studies focus on parameter estimation for low-dimensional measurement error regression models, with primary techniques listed below: (1) corrected regression estimation methods [
1]; (2) Simulation-Extrapolation (SIMEX) estimation methods [
2,
3]; (3) deconvolution methods [
4]; (4) corrected empirical likelihood methods [
5,
6]. For more detailed discussions on other estimation and hypothesis testing methods for classical low-dimensional measurement error models, please refer to the literature [
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29], as well as the monographs [
30,
31,
32,
33,
34,
35].
As one of the most popular research fields in statistics, high-dimensional regression has been widely used in various fields including genetics, economics, medical imaging, meteorology and sensor networks. Over the past two decades, various high-dimensional regression methods have been widely proposed such as Lasso [
36], smoothly clipped absolute deviation (SCAD) [
37], Elastic Net [
38], Adaptive Lasso [
39], Dantzig Selector [
40], smooth integration of counting and absolute deviation (SICA) [
41], minimax concave penalty (MCP) [
42], and among many others. These methods have been widely applied to estimate regression coefficients while also achieving the goal of variable selection by adding penalties to objective functions, please refer to the literature review [
43,
44,
45], as well as the monographs [
46,
47,
48].
For the variable screening methods of ultrahigh-dimensional regression models where dimension
p and sample size
n satisfying
, Fan and Lv [
49] proposed the sure independence screening (SIS) method, which is a pioneering method in this field. For the estimation and variable selection of ultrahigh-dimensional regression models, it is suggested applying SIS method for variable screening first. Then, based on the variables screened in the first step, we can utilize regularization methods with penalties to estimate the regression coefficients and identify the significant variables simultaneously. Due to the operability and effectiveness of SIS method in applications, numerous works have been done to extend the method, see [
50,
51,
52,
53,
54,
55,
56,
57,
58,
59].
However, most of the aforementioned theories and applications for high-dimensional regression models focused on clean data. In the era of big data, researchers frequently collect high-dimensional data with measurement errors. Typical instances include gene expression data [
61] and sensor network data [
60]. The imprecise measurements are the result of poorly managed and defective data collection processes, as well as the imprecise measuring instruments. It is well known that ignoring the influence of measurement errors will result in biased estimators and erroneous conclusions. Therefore, developing statistical inference methods for high-dimensional measurement error regression models have drawn a lot of interest.
Based on the types of measurement errors, research on high-dimensional measurement error regression models can be divided into the following three categories: covariates containing measurement errors; response variables containing measurement errors; both covariates and response variables containing measurement errors. In this paper, we mainly focus on the category that covariates contain measurement errors. When the dimension
p is larger than the sample size
n, parameter estimation can be challenging due to the nonconvexity of penalized objective function caused by correction for the bias. This further makes it impossible to obtain the optimal solution of optimization problem. We utilize the following linear regression model to illustrate this problem
where
is the
response vector,
is the
fixed design matrix with
,
is the sparse regression coefficient vector with only
s nonzero components, and assume that model error vector
is independent of
. In order to obtain a sparse estimator of the true regression coefficient vector
, we can minimize the following penalized least square objective function
which is equivalent to minimizing
where
,
,
is a penalty function with regularization parameter
. If the covariates matrix
can be precisely measured, the penalized objective functions (
2) and (
3) are convex. Thus, we can obtain a sparse estimator of
by minimizing the penalized objective function (
2) or (
3).
However, it is common that the covariates matrix
cannot be accurately observed in practice. Let
be the observed covariates matrix with additive measurement errors satisfying
, where
is the matrix of measurement errors,
follows a sub-Gaussian distribution with mean zero and covariance matrix
, and it is assumed to be independent of
. To reduce the influence of measurement errors, Loh and Wainwright [
62] proposed to replace
and
in the penalized objective function (
3) by their consistent estimators
and
, respectively. Then we can obtain the sparse estimator of
by minimizing the following penalized objective function
Note that when the dimension
p is fixed or smaller than the sample size
n, it can be guaranteed that
is a positive definite or semi positive-definite matrix. It further ensures that the penalized objective function (
4) remains convex. Thus, the global optimal solution of
can be obtained by minimizing the penalized objective function (
4).
However, for high-dimensional or ultrahigh-dimensional regression models, i.e.,
or
, there are two key problems: (i) the penalized objective function (
4) is no longer convex and unbounded from below because the corrected estimator
of
is no longer a semi-positive definite matrix. This further makes it impossible to obtain the estimator of
by minimizing the penalty objective function (
4); (ii) In order to construct an objective function similar to that of standard Lasso and solve the corresponding optimization problem using R package “glmnet” or “lars”, it is necessary to decompose
by Cholesky decomposition method and obtain the substitution of response vector and covariates matrix. However, this process results in an error accumulation and makes it challenging to guarantee valid theoretical results, and please see the detailed discussions in [
63,
64].
For problem (i), Loh and Wainwright [
62] changed the unconstrained optimization problem into a constrained optimization problem by adding restrictions to
. They suggested applying the projected gradient descent algorithm to solve the restricted optimization problem and acquire the global optimal solution of true regression coefficient vector
. Nevertheless, the penalized objective function of the optimization problem is still nonconvex. To address this issue, Datta and Zou [
63] suggested substituting
by its semi-positive definite projection matrix
, and they proposed convex conditioned Lasso (CoCoLasso). Further, Zheng et al. [
64] introduced a balanced estimation that prevented overfitting while maintaining the estimation accuracy by combining
and concave penalty. Tao et al. [
65] constructed a modified least-squares loss function using a semi-positive definite projection matrix for estimated covariance matrix and proposed calibrated zero-norm regularized least squares (CaZnRLS) estimation of regression coefficients. Rosenbaum and Tsybakov [
66,
67] proposed a matrix uncertainty (MU) selector and its improved version compensated MU selector for high-dimensional linear models with additive measurement errors in covariates. Sørensen et al. [
68] extended MU selector to generalized linear models and developed the generalized matrix uncertainty (GMU) selector. Sørensen et al. [
69] showed the theoretical results of relevant variable selection methods. Based on MU selector, Belloni et al. [
70] introduced an estimator that can achieve the minimax efficiency bound. They proved that the corresponding optimization problem can be converted into a second-order cone programming problem, which can be solved in polynomial time. Romeo and Thoresen [
71] evaluated the performance of MU selector in [
66], nonconvex Lasso in [
62], and CoCoLasso in [
63] using simulation studies. Brown et al. [
72] proposed a path-following iterative algorithm called Measurement Error Boosting (MEBoost), which is a computationally effective method for variable selection in high-dimensional measurement error regression models. Nghiem and Potgieter [
73] introduced a new estimation method called simulation-selection-extrapolation (SIMSELEX), which used Lasso in the simulation step and group Lasso in the selection step. Jiang and Ma [
74] drew on the idea of nonconvex Lasso in [
62] and proposed an estimator of the regression coefficients for high-dimensional Poisson models with measurement errors. Byrd and McGee [
75] developed an iterative estimation method for high-dimensional generalized linear models with additive measurement errors based on the imputation-regularized optimization (IRO) algorithm in [
76]. However, the error accumulation issue mentioned in problem (ii) has not been addressed in the literature.
The aforementioned works place more emphasis on estimation and variable selection problems rather than hypothesis testing. For high-dimensional regression models with clean data, research on hypothesis testing problems has made significant progress under various settings in [
77,
78,
79,
80,
81,
82,
83,
84]. For high-dimensional measurement error models, the hypothesis testing methods are equally crucial. However, the bias and instability caused by measurement errors make hypothesis testing extremely difficult. Recently, some progress has been achieved in statistical inference methods. Based on multiplier bootstrap, Belloni [
85] constructed simultaneous confidence intervals for the target parameters in high-dimensional linear measurement error models. Focused on the case where a fixed number of covariates contain measurement errors, Li et al. [
86] proposed a corrected decorrelated score test for parameters corresponding to the error-prone covariates and created asymptotic confidence intervals for them. Huang et al. [
87] proposed a new variable selection method based on debiased CoCoLasso and proved that it can achieve false discovery rate (FDR) control. Jiang et al. [
88] developed Wald and score tests for high-dimensional Poisson measurement error models.
Compared to the above estimation and hypothesis testing methods, the screening techniques for ultrahigh-dimensional measurement error models is relatively few. Nghiem et al. [
89] introduced two screening methods named corrected penalized marginal screening (PMSc) and corrected sure independence screening (SISc) for ultrahigh-dimensional linear measurement error models.
This paper gives an overview of the estimation and hypothesis testing methods for high-dimensional measurement error regression models, as well as the variable screening methods for ultrahigh-dimensional measurement error models. The rest of this paper is organized as follows. In
Section 2, we review some estimation methods for linear models. We survey the estimation methods for generalized linear models in
Section 3.
Section 4 presents the recent advances in hypothesis testing methods for high-dimensional measurement error models.
Section 5 introduces the variable screening techniques for ultrahigh-dimensional linear measurement error models. We conclude the paper with some discussions in
Section 6.
Notations. Let be the set of all real symmetric matrices and be the subset of containing all positive semi-definite matrix in . We use to denote the cardinality of set . Let be the index set of nonzero parameters. For a vector , let denote its norm, and write . Denote the subvector of with index set . Denote by the vector of all ones. For a matrix , let and For constants a and b, define . We use c and C to denote positive constants that may vary throughout the paper. Finally, let denote convergence in distribution.