Yimin Xiong is currently a master’s student under the supervision of Professor Weiping Zhang at the University of Science and Technology of China. His research is focused on variable selection
Weiping Zhang received his Ph.D. degree from the University of Science and Technology of China (USTC). He is currently a Professor at the USTC. His research interests mainly focus on longitudinal data analysis and Bayesian analysis
Extremile regression proposed in recent years not only retains the advantage of quantile regression that can fully show the information of sample data by setting different quantiles, but also has its own superiority compared with quantile regression and expectile regression, due to its explicit expression and conservativeness in estimating. Here, we propose a linear extremile regression model and introduce a variable selection method using a penalty called a quasi elastic net (QEN) to solve high-dimensional problems. Moreover, we propose an EM algorithm and establish corresponding theoretical properties under some mild conditions. In numerical studies, we compare the QEN penalty with the L0, L1, L2 and elastic net penalties, and the results show that the proposed method is effective and has certain advantages in analysis.
Graphical Abstract
Relationship between the MSE of estimators in QEN penalized extremile regression and samplesize n with τ = 0.5 (left) and TP and FP in different penalized extremile regressions with high-dimensional and grouped data (right).
Abstract
Extremile regression proposed in recent years not only retains the advantage of quantile regression that can fully show the information of sample data by setting different quantiles, but also has its own superiority compared with quantile regression and expectile regression, due to its explicit expression and conservativeness in estimating. Here, we propose a linear extremile regression model and introduce a variable selection method using a penalty called a quasi elastic net (QEN) to solve high-dimensional problems. Moreover, we propose an EM algorithm and establish corresponding theoretical properties under some mild conditions. In numerical studies, we compare the QEN penalty with the L0, L1, L2 and elastic net penalties, and the results show that the proposed method is effective and has certain advantages in analysis.
Public Summary
We propose a quasi elastic net penalized linear extremile regression to deal with high-dimensional data, which leads to a sparse solution as well as being suitable for strongly collinear situations.
We adopt an EM algorithm to solve the L0 approximation problem efficiently, and further solve the quasi elastic net penalized optimization problem.
We prove that the proposed quasi elastic net penalized linear extremile regression model is effective through numerical studies.
Although ordinary least squares (OLS) regression has been one of the most important methods in regression analysis by estimating the mean value, it sometimes shows poor robustness to outliers. Koenker and Bassett[1] proposed least absolute deviation regression (LADR) to estimate quantiles and can effectively address this problem and analyze sample data by setting different quantiles. Due to the nondifferentiability of the absolute value function, it is not convenient to solve the optimization problems. Then Newey and Powell[2] proposed asymmetric least squares (ALS) regression leading to expectiles. Considering that expectile regression lacks an explicit solution, Daouia et al.[3] proposed a weighted least square regression and defined extremiles. From their discussion, extremiles show their conceptual simplicity, convenient calculation and good properties, and they are more appropriate to be a risk protection method in tail analysis than quantiles and expectiles.
Consider a response
Y satisfying
E|Y|<∞ and its cumulative distribution function
F. For
τ∈(0,1), the unconditional
τth extremile of
Y is defined as
ξτ=argminθE[Jτ(F(Y))⋅(Y−θ)2],
(1)
where
Jτ(⋅)=K′τ(⋅) and
Kτ(t)={1−(1−t)s(τ),0<τ⩽12;ts(1−τ),12<τ<1,
(2)
with
s(τ)=log(1/2)log(1−τ). Under this definition, the unconditional
τth quantile of
Y can be obtained from
We can obtain the same result when
0<τ⩽12, which means (3) is equivalent to the quantile under the definition of
K′τ(⋅).
As we know, for the random variable
Y, the special case
τ=12 of quantile and expectile leads to its median and mean, respectively. Then if we define a random variable
Z with its cumulative distribution function
FZ=Kτ(F), it can be seen that the
τth quantile and
τth extremile represent the median and mean of
Z, respectively, similar to the central behavior of quantiles and expectiles.
Simultaneously, with the development of the internet and related industries, high-dimensional data are becoming increasingly common with an explosion of computations in data analysis, so it is supposed to find effective methods to reduce the computation if we apply extremile regression under this background. Variable selection is one of the effective measures, and there have been different methods for selecting variables.
First, there are methods based on the criteria, such as the PRESS Criterion[4],
Cp Criterion[5], Akaike Information Criterion (AIC)[6] and Bayesian Information Criterion (BIC)[7]. Those methods select variables through Selection/Estimation steps and the deviation of the Selection step will affect the result of the Estimation step due to the unknowns of the correct variables. Then, Geisser and Eddy[8] proposed the cross validation (CV) method to select variables without assumptions in parameter estimation, which has spawned hold-out validation[9],
5×2 cross validation[10] and other methods. Moreover, Candes and Tao[11] proposed the Dantzig selector (DS) method. Although the DS method does not rely on a specific function, it needs to select variables based on the assumption of sparsity of unknown coefficients. Because the DS method has the same compression for each unknown coefficient, it may lead to the overcompression of important coefficients. Dicker and Lin[12] proposed the ADS method with different weights. Later, James et al.[13] proposed the DS method of generalized linear models, and Antoniadis et al.[14] extended the DS method to Cox models. Furthermore, Fan and Lv[15] proposed a sure independence screening (SIS) method to deal with ultrahigh dimensional data, while Fan and Li[16] proposed an improved iterative sure independence screening (ISIS) method to deal with the collinearity between predictors. These methods can ensure selecting important predictors and independently measure the correlations between each predictor and the response. They can apply the marginal regression method to variable coefficient models and additive models. Finally, the variable selection method based on penalties has been widely used, and it solves corresponding optimization problems to achieve the goal. It can improve the calculation efficiency and increase the accuracy of estimation. The commonly used penalties include
Lp,p∈(0,1)[17, 18], lasso[19], SCAD[20], MCP[21] and elastic net[22]. Generally, the
L0 penalty is the most essential sparsity measure that penalizes the number of nonzero parameters directly, but it is NP-hard to solve an optimization problem with the
L0 penalty. Although
L1 is one of the common replacements for its convexity in the vast majority of cases, there exist some limitations, such as it might sometimes choose the wrong model[23].
In this paper, we use the quasi elastic net, which combines
L0 and
L2 penalties, to select variables in linear extremile regression and propose an efficient EM algorithm based on Liu and Li[24], which can approximately solve the optimization problem with the
L0 penalty. Significantly, if the predictors are highly correlated, the
L0 penalty would lead to poor performance; thus, we introduce ridge regression[25]. With the properties of
L2, the quasi elastic net is able to solve those highly correlated predictors and encourages a grouping effect. Moreover, we also establish several theoretical properties under some mild conditions.
The major contributions in this paper are as follows. First, we apply the proposed model to analyze high-dimensional data and retain its conceptual simplicity and convenient calculation. Second, we propose the quasi elastic net combining
L0 and
L2 penalties, which can both lead to a sparse solution and deal with highly correlated predictors. Third, we propose an efficient EM algorithm to solve the optimization problem with the
L0 penalty approximately and establish the theoretical properties.
The remainder of this paper is organized as follows. Section 2 shows our model and explains the method. Section 3 provides the theoretical properties of this method. Sections 4 and 5 present the comparisons between different methods in simulations and real data. Section 6 presents the conclusion. The appendix provides all proofs of theorems and lemmas.
2.
Linear extremile regression with QEN
Considering an
n×1 response
{\boldsymbol{y}} = (y_{1}, \cdots, y_{n})^{\rm T} and
n \times p predictor matrix
X = ({\boldsymbol{x}}^{\rm T}_{1}, \cdots, {\boldsymbol{x}}^{\rm T}_{n})^{\rm T}, the linear
\tau th extremile regression with a quasi elastic net penalty is defined as
where
\hat{F}(\cdot|X) is an estimated distribution function of
{\boldsymbol{y}} depends on
X , and we prefer nonparametric and semiparametric methods, especially a partial-linear single-index model[26], when dealing with high-dimensional data.
{\boldsymbol{\beta}}_{\tau} = (\beta_{\tau, 1}, \cdots, \beta_{\tau, p})^{\rm T} \in \mathbb{R}^{p \times 1} are unknown coefficients, and
\lambda_{1}, \lambda_{2}>0 are tuning parameters.
Let
R\subset \{ 1, \cdots, p\} be the subset index with
\beta_{j}\ne 0 and
O\subset \{ 1, \cdots, p\} be the subset index with
\beta_{j} = 0 , then we have
R\cup O = \{ 1, \cdots, p\}. Denoting
W = {\rm diag}\big(J_{\tau}(\hat{F}(y_{1}|X))/n, \cdots, J_{\tau} (\hat{F}(y_{n}|X))/n\big),
{\boldsymbol{y}}^{*} = ({\boldsymbol{y}}^{\rm T}W^{1/2}, {\bf{0}})^{\rm T} \in \mathbb{R}^{(n+p)\times 1},
X^{*} = (1+\lambda_{2})^{-1/2}\cdot (X^{\rm T}W^{1/2}, \lambda_{2}^{1/2}I)^{T}\in \mathbb{R}^{(n+p) \times p},
{\boldsymbol{\beta}}^{*}_{\tau} = (1+\lambda_{2})^{1/2}{\boldsymbol{\beta}}_{\tau} , the optimization problem can be rewritten as
Note that the solution of (7) satisfies
\hat{{\boldsymbol{\beta}}}_{\tau} = (1+\lambda_{2})^{-1/2}\hat{{\boldsymbol{\beta}}}^{*}_{\tau} . Then, we focus on the optimization problem (8) and obtain the following two equations:
For the first equation,
\tilde{L}({\boldsymbol{\beta}}^{*}_{\tau}) is a quadratic function of
{\boldsymbol{\beta}}^{*}_{\tau} when
{\boldsymbol{\delta}} is known, so when
j \in R the first-order partial derivative is as follows:
It is clear that these two equations can be treated as the M-step and E-step of the EM algorithm. Naturally, we can obtain the solution of (7) by
\hat{{\boldsymbol{\beta}}}_{\tau} = (1+\lambda_{2})^{-1/2}\hat{{\boldsymbol{\beta}}}^{*}_{\tau} . Unlike elastic net, our
L_{0} penalty will not be affected by
\hat{{\boldsymbol{\beta}}}_{\tau} = (1+\lambda_{2})^{-1/2}\hat{{\boldsymbol{\beta}}}^{*}_{\tau} as
L_{1} penalty, which means
\hat{{\boldsymbol{\beta}}}_{\tau} will not incur a double amount of shrinkage, and there is no need to correct the estimator
\hat{{\boldsymbol{\beta}}}_{\tau} . This is one of the reasons why we choose the
L_{0} penalty rather than
L_{1} .
According to the content described above, it indicates that the last two equations can be processed by the EM algorithm. Based on this, the EM algorithm is summarized as Algorithm 1.
Algorithm 1 The EM algorithm
Given
\lambda_{1} ,
\lambda_{2} , small number
\varepsilon , and training data
\{ X, {\boldsymbol{y}} \}
Note that the initialization of
{\boldsymbol{\beta}} is not the only one specified, and we tend to choose the solution of Eq. (7) with
\lambda_{1} = 0 which provides a faster convergence rate. However, remarkably, there also exists a trivial solution
{\boldsymbol{\beta}} = {\bf{0}} of Eq. (14).
Determination of
\lambda_{1} and
\lambda_{2} . To implement the algorithm, the tuning parameters
\lambda_{1} and
\lambda_{2} must be chosen. Because of the huge advantage of
L_{0} , we can directly determine
\lambda_{1} by the variable selection criteria AIC (
n>p ) or BIC (
n\ll p ) with
\lambda_{1} = 2 or
\log{n} under the optimization problem (8). Then, we pick a grid of values for
\lambda_{2} , such as a logspace sequence
(0. 01, 0. 1, 1, 10,100) , and compare those
(\lambda_{1}, \lambda_{2}) by
k -fold (usually tenfold) cross validation under the optimization problem (7).
3.
Theoretical properties
In this section, we establish theoretical properties of the linear extremile regression with QEN, including the convergency of the proposed EM algorithm, grouping effect and consistency of the estimator.
3.1
Convergency of the algorithm
This subsection pays attention to the conditions the proposed EM algorithm should satisfy, and we have the following Theorem 3.1 and Lemma 3.1 to ensure its convergency.
Theorem 3.1. Assume the response
{\boldsymbol{y}} , the predictors
X , initialize solution
{\boldsymbol{\beta}}^{*(0)}_{\tau} = (\beta^{*(0)}_{\tau, 1}, \cdots, \beta^{*(0)}_{\tau, p}). Let
D = {\rm diag} (\beta^{*(0)}_{\tau, 1}, \cdots, \beta^{*(0)}_{\tau, p}). When the tuning parameters
\lambda_{1} and
\lambda_{2} satisfy
the estimator in (9) will converge to an optimal solution.
Lemma 3.1. Assume
\tilde{{\boldsymbol{x}}}^{\rm T}_{i}W\tilde{{\boldsymbol{x}}}_{j} = 0, \forall i \neq j \in \{1, \cdots, p\}, the maximum value of
\lambda_{1} will meet the condition
Note that both Theorem 3.1 and Lemma 3.1 provide mild restrictions of
\lambda_{1} and
\lambda_{2} . Moreover, Theorem 3.1 indicates that the EM algorithm will lead the estimator to converge to an optimal solution under proper
\lambda_{1} ,
\lambda_{2} and initial solution
{\boldsymbol{\beta}}^{*(0)}_{\tau} . Note that the initial solution should avoid the trivial solution
{\bf{0}} , and we generally choose the solution of extremile regression with the
L_{2} penalty as an initial solution to accelerate convergence.
Theorem 3.2 shows the relationship between the final solution of the proposed EM algorithm and the optimization problem with the
L_{0} penalty.
Theorem 3.2. Given proper
{\boldsymbol{\beta}}^{*(0)}_{\tau} , the final solution of the proposed EM algorithm is an optimal solution to the
L_{0} approximation problem that minimizes (8).
3.2
Grouping effect
In
n \ll p problems[27], the grouped variables are sometimes important, which has been discussed by Zou and Hastie[22]. There have been many methods in the literature, for instance, using principal component analysis to find a set of highly correlated genes[28], using supervised learning methods to select groups of predictive genes[29], and using regularized regression to find the grouped genes[30]. The proposed QEN encourages a grouping effect as elastic net does under certain conditions, which means the regression coefficients tend to be equal if the corresponding observation predictors are highly correlated. There is the following lemma.
Lemma 3.2. Consider weighted optimization problem as
where
w_{i} > 0, \forall i = 1, \cdots, m, are the weight values,
\lambda>0 is a tuning parameter,
f(\cdot) is a positive function for
{\boldsymbol{\beta}} \ne {\bf{0}} .
Assuming
\tilde{{\boldsymbol{x}}}_{i} = \tilde{{\boldsymbol{x}}}_{j} ,
i\neq j \in \{1, \cdots, p\}, where
\tilde{{\boldsymbol{x}}}_{i} is the
i th column of
X = ({\boldsymbol{x}}^{\rm T}_{1}, \cdots, {\boldsymbol{x}}^{\rm T}_{n})^{\rm T}, we have these two inferences:
(a) If
f(\cdot) is strictly convex, then
\hat{\beta}_{i} = \hat{\beta}_{j} ,
\forall \lambda>0 .
(b) If
f({\boldsymbol{\beta}}) = \| {\boldsymbol{\beta}} \|_{0} , then
\hat{\beta}_{i} \hat{\beta}_{j}\geqslant 0 and the solution of (17) will be unique if and only if
\hat{\beta}_{i} = \hat{\beta}_{j} = 0 . Furthermore, if
\hat{\beta}_{i} \hat{\beta}_{j}>0 ,
\hat{{\boldsymbol{\beta}}}^{*} is another minimizer, where
for any
s\in (0, 1) and if there is one, and only one 0 in
\hat{\beta}_{i} and
\hat{\beta}_{j} , we can set
s to 0 in (18) to obtain another minimizer.
(c) If
f(\cdot) is the QEN penalty, then
\hat{\beta}_{i} \hat{\beta}_{j}\geqslant 0. Furthermore, if
\hat{\beta}_{i} \hat{\beta}_{j}>0 , then
\hat{\beta}_{i} = \hat{\beta}_{j} .
This lemma shows the motivation to choose the QEN penalty rather than
L_{0} , as
L_{0} penalty may not lead to a unique solution. Although the proposed QEN penalty is not strictly convex, it can guarantee the grouping effect if the unknown coefficients are nonzero. Then, Theorem 3.3 indicates the relationship between unknown coefficients.
Theorem 3.3. Assume the response
{\boldsymbol{y}} satisfies
{\boldsymbol{y}}^{\rm T}W{\boldsymbol{y}} = 1 and the observation variables matrix
X satisfies
where
\rho_{ij} \in (-1, 1) . Supposing
\hat{{\boldsymbol{\beta}}}_{\tau} = (\hat{\beta}_{\tau, 1}, \cdots, \hat{\beta}_{\tau, p})^{\rm T} is the minimizer of the optimization problem (7), the following inequality can be obtained when
\hat{\beta}_{\tau, i} \hat{\beta}_{\tau, j}\neq 0, i\neq j \in \{1, \cdots, p\}, with probability tending to 1,
The distance function
D(i, j) indicates the difference between
\hat{\beta}_{\tau, i} and
\hat{\beta}_{\tau, j} , especially when
\rho_{ij} approaches 1, the difference between
\hat{\beta}_{\tau, i} and
\hat{\beta}_{\tau, j} is almost 0, and
\hat{\beta}_{\tau, i} and
-\hat{\beta}_{\tau, j} are almost equal when
\rho_{ij} is close to
-1 . It is clear that only tuning parameter
\lambda_{2} of
\| {\boldsymbol{\beta}}_{\tau}\|^{2}_{2} can affect the upper bound of
D(i, j) , which indicates the grouping effect of the
L_{2} penalty.
It is necessary to note that
\tilde{{\boldsymbol{x}}}^{\rm T}_{i}W(-\tilde{{\boldsymbol{x}}}_{j}) = -\rho_{ij} is close to 1 when
\rho_{ij} is close to
-1 and the corresponding coefficient of
-\tilde{{\boldsymbol{x}}}_{j} should be
-\hat{\beta}_{\tau, j} by
This means that
\hat{\beta}_{\tau, i} and
-\hat{\beta}_{\tau, j} are almost equal, as
| \hat{\beta}_{\tau, i}+\hat{\beta}_{\tau, j} | is close to 0 when
\rho_{ij} is close to
-1 .
3.3
Properties of the estimator
In this subsection, we propose the properties of the estimator. In the optimization problem (8), the sample size is consistently larger than the number of predictors, as
n+p>p holds. This means that (8) will always be a low-dimensional problem even if (7) is high-dimensional, and it helps reduce the regularization conditions.
Let
{\boldsymbol{\beta}}_{\tau 0} be the true value,
O = \{1\leqslant j\leqslant p: \beta_{\tau 0, j} = 0\}\subset \{ 1, \cdots, p\}. The following conditions are essential for theoretical properties.
(C1)
\log p = o(n), n\to \infty .
(C2) Given
\tau ,
\epsilon_{i} = y_{i}-{\boldsymbol{x}}_{i}{\boldsymbol{\beta}}_{\tau 0}, i = 1, \cdots, n, are independent identically distributed random variables with mean 0 and variance
\sigma^{2}<\infty .
(C3) There exists a constant
K>0 such that
\lambda_{\rm max}\Big(\dfrac{X^{\rm T}WX+\lambda_{2} I}{(1+\lambda_{2})(n+p)}\Big)\leqslant K < \infty, where
\lambda_{\rm max}(A) represents the largest eigenvalue of matrix
A .
(C4)
\dfrac{{\rm max}_{1\leqslant j\leqslant p}\tilde{{\boldsymbol{x}}}_{j}^{\rm T}W\tilde{{\boldsymbol{x}}}_{j}+\lambda_{2}}{\sqrt{(1+\lambda_{2})(n+p)}} = O(\sqrt{\log{p(n+p)}}) or
O(1) as
n, p\to \infty .
These conditions are relatively mild. (C1) shows the relationship between sample size
n and characteristic number
p , and the proposed method applies to ultrahigh dimensional cases such as
p = \exp{n^{\alpha}} for
0<\alpha<1 . (C2) restricts the distribution of errors. (C3) is a standard linear regression condition. (C4) is a condition following Liu and Li[24]. (C5) indicates that the model is sparse. Under those conditions, we establish the consistency of the proposed estimator
\hat{{\boldsymbol{\beta}}}_{\tau} as follows:
Theorem 3.4. Assume that conditions (C1)–(C5) hold. Given proper
\lambda_{2} . Let
and
\hat{{\boldsymbol{\beta}}}_{\tau} is the minimizer of (7) under the restriction
\|{\boldsymbol{\beta}}_{\tau}\|_{0}\leqslant n(\nu) with
\lambda_{1} . Then,
(b) With probability tending to 1,
\hat{{\boldsymbol{\beta}}}_{\tau, j} = 0, \forall j\in O .
Theorem 3.3 indicates that under mild conditions (C1)–(C5), the proposed estimator has consistency and guarantees that the components are zeros if the true values of the corresponding components are zeros.
4.
Simulation study
In this section, we first use one example to test the efficiency of the proposed algorithm for penalized linear extremile regression, while the next one shows the relationship between the MSE of the proposed QEN penalty and the sample size
n , and the last two show the advantages of the QEN. It should also be noted that the estimated coefficients will be treated as zero if their absolute value is smaller than
10^{-6} . We simulate data from the true model
y = \boldsymbol{x}{\boldsymbol{\beta}}+\epsilon.
(24)
The components of
\boldsymbol{x} were standard normal in Examples 4.1, 4.2 and 4.3. Within each example, our simulated data consist of a training set, an independent tuning data set and a testing set. Denote the sample size of the training data set by
n , while an independent tuning data set and testing data set of size
n and
100n , respectively, were generated in the same way, and we repeated the process 100 times to compute the average value. Here are the details of the three scenarios.
Example 4.1. We generated data from the model (24) with
n = 20 . We set
{\boldsymbol{\beta}} = (3, 1.5, 0, 0, 2, 0, 0, 0) and
\epsilon \sim N(0, 1) . The pairwise correlation between
\tilde{{\boldsymbol{x}}}_{i} and
\tilde{{\boldsymbol{x}}}_{j} was set to
r^{|i-j|} ,
r = 0, 0.5, 0.9 and let
\tau = 0.5 .
Example 4.2. We generated data from the model (24) with
n = 10,100, 1000 . We set
{\boldsymbol{\beta}} = (3, 1.5, 0, 0, 2, 0, 0, 0) and
\epsilon \sim N(0, 1) . The pairwise correlation between
\tilde{{\boldsymbol{x}}}_{i} and
\tilde{{\boldsymbol{x}}}_{j} was set to
r^{|i-j|} ,
r = 0, 0.5, 0.9 and let
\tau = 0.5 .
Example 4.3. We generated data from the model (24) with
n = 20 . We set
and
\epsilon \sim t(3) . The pairwise correlation between
\tilde{{\boldsymbol{x}}}_{i} and
\tilde{{\boldsymbol{x}}}_{j} was set to
r^{|i-j|} ,
r = 0.5, 0.9 and let
\tau = 0.9, 0.97 .
Example 4.4. We generated data from the model (24) with
n = 20 . We set
where
\epsilon_{i}^{x}\stackrel{\text{i.i.d}}{\sim}N(0, 0.01), \;i = 1, \cdots, 15. Let
\tau = 0.9, 0.97 . In this model, we have three equally important groups with 5 predictors and 985 pure noise. An ideal method would select only the 15 true features and set the coefficients of the 985 noise to 0.
In those examples, we consider the following indicators: #SF, the average value of the number of selected predictors; P, the percentage of selecting the true model; MSE, the average mean squared error; CPU-time, the average CPU time in seconds; TP, the average value of the number of estimated nonzero components while those components are nonzero in true value; FP, the average value of the number of estimated nonzero components while those components are zeros in true value.
Table 1 shows the performance of extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, QEN penalties under
\tau = 0.5 , which means OLS regression with different penalties, and indicates that the proposed algorithm is efficient. It is obvious that
L_{0} and QEN have advantages in CPU time, and QEN has a better performance in correction rate than EN. Moreover,
L_{0} might be superior to QEN because
r is small, but QNE is more robust with the increase of
r because of the
L_{2} penalty.
Table
1.
Results in extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties with different
r settings,
\tau = 0.5 , 100 repetitions (standard deviations are shown in parentheses).
Table 2 shows the performance of extremile regression with the QEN penalty under different sample sizes and
\tau = 0.5 . Although the result obtained by the QEN penalty is better when
r is small, #SF and frequencies of selecting the true model will increase with increasing
n , and the corresponding MSE will be significantly reduced when
r is fixed.
Table
2.
Results in extremile regression with QEN penalty with different
r ,
n settings,
\tau = 0.5 , 100 repetitions (standard deviations are shown in parentheses).
Tables 3 and 4 show the performance of extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties in general and high-dimensional situations with
\tau = 0.9, 0.97 . Although
L_{1} and EN are able to select more variables than
L_{0} and QEN, the value of QEN’s TP/FP is much larger, which means that the QEN is able to select correct variables more effectively. Simultaneously, EN and QEN are more robust as
\tau increases, and they also perform better when
r becomes larger. In the grouped variable situation, EN and QEN perform better, similarly affirming the grouping effect in theoretical properties.
Table
3.
Results in extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties with different
r ,
\tau settings, 100 repetitions (standard deviations are shown in parentheses).
Table
4.
Results in extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties with different
\tau settings, 100 repetitions (standard deviations are shown in parentheses).
These simulations all indicate that our QEN penalty is competent in variable selection because of its
L_{0} penalty, so this is an advantage of the proposed method when we face high-dimensional problems. On the other hand, although
L_{0} performs no worse than QEN when the correlations of the predictors are weak, the proposed QEN performs better as the correlations increase and its estimator always has lower MSE. Moreover, the proposed QEN also tends to choose more predictors than the
L_{0} penalty, especially in tail analysis, which means that we can obtain more information.
5.
Real data analysis
Our real data about communities and crime are downloaded from UCI①. The variables in the dataset involving various community data were picked if there was any plausible connection to crime (
p = 122 ), with the goal attribute PerCapitaViolentCrimes (total number of violent crimes per 100k population), and all numeric data were normalized into the decimal range 0.00–1.00 using an unsupervised equal-interval binning method. Redmond and Baveja[31] adopted this dataset to present an artificial-intelligence software crime similarity system (CSS) to help police departments develop a strategic viewpoint toward decision-making and utilize the socioeconomic, crime and enforcement profiles of cities to generate a list of communities that are the best candidates to cooperate and share experiences. We aim to apply the QEN penalized linear extremile regression to select variables that are strongly correlated to PerCapitaViolentCrimes under different extreme
\tau .
In each repetition, we randomly choose 20 samples from 319 complete samples as a tuning set and 20 samples as a training set, while the remaining 279 groups are used as the testing set to compute the test error. The performance of 100 repetitions of penalized extremile regression with different penalties and different
\tau is summarized in Table 5, where test error represents the error of estimation under the testing set and #SF means the average value of the number of selected predictors.
Table
5.
Results of the Communities and Crime Data in extremile regression with
L_{0} ,
L_{1} , EN, and QEN penalties with different
\tau settings, 100 repetitions (standard deviations are shown in parentheses).
Table 5 shows the performance of extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties 100 repetitions with
\tau = 0.9, 0.95, 0.97 under the Communities and Crime Data. This indicates that the test errors of different penalties have little difference, but QEN selects fewer variables with smaller standard deviations than
L_{0} and
L_{1} and has more robustness.
6.
Conclusions
In this paper, we propose a linear extremile regression model with the QEN penalty to select variables in high-dimensional problems. The proposed QEN penalty retains the advantages of
L_{0} and
L_{2} penalties, which implies that it is able to obtain sparse solutions and deal with highly correlated predictors. Simultaneously, we propose an EM algorithm to solve optimization problems with the
L_{0} penalty approximately and establish corresponding theoretical properties under mild conditions. To summarize, the proposed method has important implications for extending the traditional extremile regression theory.
Future work can be extended to the following aspects. First, the extremile regression with penalty can be extended to nonparametric and semiparametric models. Second, the proposed EM algorithm can be applied to the
L_{p} penalty,
p\in (0, 2) , so we can consider optimization problems with different penalties. Finally, we seek methods to determine the true value of unknown coefficients under different
\tau to improve the evaluation standards of variable selection methods in extremile regression.
Acknowledgements
We thank the reviewers for their perceptive comments and suggestions. This work was supported by the National Natural Science Foundation of China (12171450).
Conflict of interest
The authors declare that they have no conflict of interest.
Conflict of Interest
The authors declare that they have no conflict of interest.
We propose a quasi elastic net penalized linear extremile regression to deal with high-dimensional data, which leads to a sparse solution as well as being suitable for strongly collinear situations.
We adopt an EM algorithm to solve the L0 approximation problem efficiently, and further solve the quasi elastic net penalized optimization problem.
We prove that the proposed quasi elastic net penalized linear extremile regression model is effective through numerical studies.
Newey W K, Powell J L. Asymmetric least squares estimation and testing. Econometrica,1987, 55: 819–847. DOI: 10.2307/1911031
[3]
Daouia A, Gijbels I, Stupfler G. Extremiles: A new perspective on asymmetric least squares. Journal of the American Statistical Association,2019, 114 (527): 1366–1381. DOI: 10.1080/01621459.2018.1498348
[4]
Allen D M. The relationship between variable selection and data agumentation and a method for prediction. Technometrics,1974, 16 (1): 125–127. DOI: 10.1080/00401706.1974.10489157
Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control,1974, 19 (6): 716–723. DOI: 10.1109/TAC.1974.1100705
[7]
Schwarz G. Estimating the dimension of a model. The Annals of Statistics,1978, 6 (2): 461–464. DOI: 10.1214/aos/1176344136
[8]
Geisser S, Eddy W F. A predictive approach to model selection. Journal of the American Statistical Association,1979, 74 (365): 153–160. DOI: 10.1080/01621459.1979.10481632
[9]
Devroye L, Wagner T. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory,1979, 25 (5): 601–604. DOI: 10.1109/TIT.1979.1056087
[10]
Dietterich T G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation,1998, 10 (7): 1895–1923. DOI: 10.1162/089976698300017197
[11]
Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics,2007, 35 (6): 2313–2351. DOI: 10.1214/009053606000001523
[12]
Dicker L, Lin X. Parallelism, uniqueness, and large-sample asymptotics for the Dantzig selector. Canadian Journal of Statistics,2013, 41 (1): 23–35. DOI: 10.1002/cjs.11151
[13]
James G M, Radchenko P, Lv J. DASSO: connections between the Dantzig selector and lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2009, 71 (1): 127–142. DOI: 10.1111/j.1467-9868.2008.00668.x
[14]
Antoniadis A, Fryzlewicz P, Letué F. The Dantzig selector in Cox’s proportional hazards model. Scandinavian Journal of Statistics,2010, 37 (4): 531–552. DOI: 10.1111/j.1467-9469.2009.00685.x
[15]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2008, 70 (5): 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x
[16]
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association,2011, 106 (494): 544–557. DOI: 10.1198/jasa.2011.tm09779
[17]
Liu Z, Lin S, Tan M. Sparse support vector machines with L p penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2008, 7 (1): 100–107. DOI: 10.1109/TCBB.2008.17
[18]
Mazumder R, Friedman J H, Hastie T. SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association,2011, 106 (495): 1125–1138. DOI: 10.1198/jasa.2011.tm09738
[19]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological),1996, 58 (1): 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x
[20]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association,2001, 96 (456): 1348–1360. DOI: 10.1198/016214501753382273
[21]
Zhang C H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics,2010, 38 (2): 894–942. DOI: 10.1214/09-AOS729
[22]
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2005, 67 (2): 301–320. DOI: 10.1111/j.1467-9868.2005.00503.x
[23]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association,2006, 101 (476): 1418–1429. DOI: 10.1198/016214506000000735
[24]
Liu Z, Li G. Efficient regularized regression with penalty for variable selection and network construction. Computational and Mathematical Methods in Medicine,2016, 2016: 3456153. DOI: 10.1155/2016/3456153
[25]
Tihonov A N. Solution of incorrectly formulated problems and the regularization method. Soviet Math.,1963, 4: 1035–1038.
[26]
Wang J, Xue L, Zhu L, et al. Estimation for a partial-linear single-index model. The Annals of Statistics,2010, 38 (1): 246–274. DOI: 10.1214/09-AOS712
[27]
West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences,2001, 98 (20): 11462–11467. DOI: 10.1073/pnas.201162998
[28]
Hastie T, Tibshirani R, Eisen M B, et al. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology,2000, 1: research0003.1. DOI: 10.1186/gb-2000-1-2-research0003
[29]
Hastie T, Tibshirani R, Botstein D, et al. Supervised harvesting of expression trees. Genome Biology,2001, 2: research0003.1. DOI: 10.1186/gb-2001-2-1-research0003
[30]
Segal M S, Dahlquist K D, Conklin B R. Regression approaches for microarray data analysis. Journal of Computational Biology,2003, 10 (6): 961–980. DOI: 10.1089/106652703322756177
[31]
Redmond M, Baveja A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research,2002, 141 (3): 660–678. DOI: 10.1016/S0377-2217(01)00264-8
Newey W K, Powell J L. Asymmetric least squares estimation and testing. Econometrica,1987, 55: 819–847. DOI: 10.2307/1911031
[3]
Daouia A, Gijbels I, Stupfler G. Extremiles: A new perspective on asymmetric least squares. Journal of the American Statistical Association,2019, 114 (527): 1366–1381. DOI: 10.1080/01621459.2018.1498348
[4]
Allen D M. The relationship between variable selection and data agumentation and a method for prediction. Technometrics,1974, 16 (1): 125–127. DOI: 10.1080/00401706.1974.10489157
Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control,1974, 19 (6): 716–723. DOI: 10.1109/TAC.1974.1100705
[7]
Schwarz G. Estimating the dimension of a model. The Annals of Statistics,1978, 6 (2): 461–464. DOI: 10.1214/aos/1176344136
[8]
Geisser S, Eddy W F. A predictive approach to model selection. Journal of the American Statistical Association,1979, 74 (365): 153–160. DOI: 10.1080/01621459.1979.10481632
[9]
Devroye L, Wagner T. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory,1979, 25 (5): 601–604. DOI: 10.1109/TIT.1979.1056087
[10]
Dietterich T G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation,1998, 10 (7): 1895–1923. DOI: 10.1162/089976698300017197
[11]
Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics,2007, 35 (6): 2313–2351. DOI: 10.1214/009053606000001523
[12]
Dicker L, Lin X. Parallelism, uniqueness, and large-sample asymptotics for the Dantzig selector. Canadian Journal of Statistics,2013, 41 (1): 23–35. DOI: 10.1002/cjs.11151
[13]
James G M, Radchenko P, Lv J. DASSO: connections between the Dantzig selector and lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2009, 71 (1): 127–142. DOI: 10.1111/j.1467-9868.2008.00668.x
[14]
Antoniadis A, Fryzlewicz P, Letué F. The Dantzig selector in Cox’s proportional hazards model. Scandinavian Journal of Statistics,2010, 37 (4): 531–552. DOI: 10.1111/j.1467-9469.2009.00685.x
[15]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2008, 70 (5): 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x
[16]
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association,2011, 106 (494): 544–557. DOI: 10.1198/jasa.2011.tm09779
[17]
Liu Z, Lin S, Tan M. Sparse support vector machines with L p penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2008, 7 (1): 100–107. DOI: 10.1109/TCBB.2008.17
[18]
Mazumder R, Friedman J H, Hastie T. SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association,2011, 106 (495): 1125–1138. DOI: 10.1198/jasa.2011.tm09738
[19]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological),1996, 58 (1): 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x
[20]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association,2001, 96 (456): 1348–1360. DOI: 10.1198/016214501753382273
[21]
Zhang C H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics,2010, 38 (2): 894–942. DOI: 10.1214/09-AOS729
[22]
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology),2005, 67 (2): 301–320. DOI: 10.1111/j.1467-9868.2005.00503.x
[23]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association,2006, 101 (476): 1418–1429. DOI: 10.1198/016214506000000735
[24]
Liu Z, Li G. Efficient regularized regression with penalty for variable selection and network construction. Computational and Mathematical Methods in Medicine,2016, 2016: 3456153. DOI: 10.1155/2016/3456153
[25]
Tihonov A N. Solution of incorrectly formulated problems and the regularization method. Soviet Math.,1963, 4: 1035–1038.
[26]
Wang J, Xue L, Zhu L, et al. Estimation for a partial-linear single-index model. The Annals of Statistics,2010, 38 (1): 246–274. DOI: 10.1214/09-AOS712
[27]
West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences,2001, 98 (20): 11462–11467. DOI: 10.1073/pnas.201162998
[28]
Hastie T, Tibshirani R, Eisen M B, et al. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology,2000, 1: research0003.1. DOI: 10.1186/gb-2000-1-2-research0003
[29]
Hastie T, Tibshirani R, Botstein D, et al. Supervised harvesting of expression trees. Genome Biology,2001, 2: research0003.1. DOI: 10.1186/gb-2001-2-1-research0003
[30]
Segal M S, Dahlquist K D, Conklin B R. Regression approaches for microarray data analysis. Journal of Computational Biology,2003, 10 (6): 961–980. DOI: 10.1089/106652703322756177
[31]
Redmond M, Baveja A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research,2002, 141 (3): 660–678. DOI: 10.1016/S0377-2217(01)00264-8
Table
1.
Results in extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties with different
r settings,
\tau = 0.5 , 100 repetitions (standard deviations are shown in parentheses).
Table
2.
Results in extremile regression with QEN penalty with different
r ,
n settings,
\tau = 0.5 , 100 repetitions (standard deviations are shown in parentheses).
Table
3.
Results in extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties with different
r ,
\tau settings, 100 repetitions (standard deviations are shown in parentheses).
Table
4.
Results in extremile regression without penalty and with
L_{0} ,
L_{1} ,
L_{2} , EN, and QEN penalties with different
\tau settings, 100 repetitions (standard deviations are shown in parentheses).
Table
5.
Results of the Communities and Crime Data in extremile regression with
L_{0} ,
L_{1} , EN, and QEN penalties with different
\tau settings, 100 repetitions (standard deviations are shown in parentheses).