Variable selection in high-dimensional extremile regression via the quasi elastic net

Yimin Xiong; Zhi Zheng; Weiping Zhang

doi:10.52396/JUSTC-2022-0099

JUSTC > 2023 > 53(2): 1-1-1-11. > DOI: 10.52396/JUSTC-2022-0099 CSTR: 32290.14.JUSTC-2022-0099

PDF (1026 KB)

Open Access JUSTC Mathematics Article 21 November 2022

Variable selection in high-dimensional extremile regression via the quasi elastic net

Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei 230026, China

Cite this: JUSTC, 2023, 53(2): 1

https://doi.org/10.52396/JUSTC-2022-0099

CSTR: 32290.14.JUSTC-2022-0099

More Information

Author Bio:
Yimin Xiong is currently a master’s student under the supervision of Professor Weiping Zhang at the University of Science and Technology of China. His research is focused on variable selection

Weiping Zhang received his Ph.D. degree from the University of Science and Technology of China (USTC). He is currently a Professor at the USTC. His research interests mainly focus on longitudinal data analysis and Bayesian analysis
Corresponding author:
Weiping Zhang, E-mail: zwp@ustc.edu.cn
Received Date: July 06, 2022
Accepted Date: October 07, 2022
Available Online: November 21, 2022

Full text PDF

Abstract

Abstract

Extremile regression proposed in recent years not only retains the advantage of quantile regression that can fully show the information of sample data by setting different quantiles, but also has its own superiority compared with quantile regression and expectile regression, due to its explicit expression and conservativeness in estimating. Here, we propose a linear extremile regression model and introduce a variable selection method using a penalty called a quasi elastic net (QEN) to solve high-dimensional problems. Moreover, we propose an EM algorithm and establish corresponding theoretical properties under some mild conditions. In numerical studies, we compare the QEN penalty with the $L_{0}$ , $L_{1}$ , $L_{2}$ and elastic net penalties, and the results show that the proposed method is effective and has certain advantages in analysis.

Graphical Abstract

Relationship between the MSE of estimators in QEN penalized extremile regression and samplesize n with τ = 0.5 (left) and TP and FP in different penalized extremile regressions with high-dimensional and grouped data (right).

Abstract

Extremile regression proposed in recent years not only retains the advantage of quantile regression that can fully show the information of sample data by setting different quantiles, but also has its own superiority compared with quantile regression and expectile regression, due to its explicit expression and conservativeness in estimating. Here, we propose a linear extremile regression model and introduce a variable selection method using a penalty called a quasi elastic net (QEN) to solve high-dimensional problems. Moreover, we propose an EM algorithm and establish corresponding theoretical properties under some mild conditions. In numerical studies, we compare the QEN penalty with the $L_{0}$ , $L_{1}$ , $L_{2}$ and elastic net penalties, and the results show that the proposed method is effective and has certain advantages in analysis.
Public Summary
- We propose a quasi elastic net penalized linear extremile regression to deal with high-dimensional data, which leads to a sparse solution as well as being suitable for strongly collinear situations.
- We adopt an EM algorithm to solve the L₀ approximation problem efficiently, and further solve the quasi elastic net penalized optimization problem.
- We prove that the proposed quasi elastic net penalized linear extremile regression model is effective through numerical studies.

FullText(HTML)

1. Introduction

Although ordinary least squares (OLS) regression has been one of the most important methods in regression analysis by estimating the mean value, it sometimes shows poor robustness to outliers. Koenker and Bassett^[1] proposed least absolute deviation regression (LADR) to estimate quantiles and can effectively address this problem and analyze sample data by setting different quantiles. Due to the nondifferentiability of the absolute value function, it is not convenient to solve the optimization problems. Then Newey and Powell^[2] proposed asymmetric least squares (ALS) regression leading to expectiles. Considering that expectile regression lacks an explicit solution, Daouia et al.^[3] proposed a weighted least square regression and defined extremiles. From their discussion, extremiles show their conceptual simplicity, convenient calculation and good properties, and they are more appropriate to be a risk protection method in tail analysis than quantiles and expectiles.

Consider a response $Y$ satisfying $E|Y|<\infty$ and its cumulative distribution function $F$ . For $\tau \in (0, 1)$ , the unconditional $\tau$ th extremile of $Y$ is defined as

$\xi_{\tau} = {\rm argmin}_{\theta} E[J_{\tau}(F(Y))\cdot (Y-\theta)^{2}],$

(1)

where $J_{\tau}(\cdot) = K_{\tau}^{\prime}(\cdot)$ and

${K_\tau }(t) = \left\{ {\begin{array}{*{20}{c}} {1 - {{(1 - t)}^{s(\tau )}},}&{0 < \tau \leqslant \dfrac{1}{2}};\\ {{t^{s(1 - \tau )}},}&{\dfrac{1}{2} < \tau < 1,} \end{array}} \right.$

(2)

with $s(\tau) = \dfrac{\log{(1/2)}}{\log{(1-\tau)}}$ . Under this definition, the unconditional $\tau$ th quantile of $Y$ can be obtained from

$q_{\tau}\in {\rm argmin}_{\theta} E[J_{\tau}(F(Y))\cdot |Y-\theta|].$

(3)

For (3), we can consider its integral form

$\begin{array}{l} q_{\tau}\in {\rm argmin}_{\theta} \displaystyle \int_{-\infty}^{\infty} J_{\tau}(F(Y))\cdot |Y-\theta|{\rm d}F(Y) =\\ {\rm argmin}_{\theta} [\displaystyle \int_{-\infty}^{\theta} J_{\tau}(F(Y))\cdot (\theta - Y){\rm d}F(Y) + \displaystyle \int_{\theta}^{\infty} J_{\tau}(F(Y))\cdot (Y - \theta){\rm d}F(Y)], \end{array}$

(4)

which means $q_{\tau}$ shall satisfy

$\begin{split} &-\displaystyle \int_{-\infty}^{q_{\tau}} J_{\tau}(F(Y)){\rm d}F(Y)+\displaystyle \int_{q_{\tau}}^{\infty} J_{\tau}(F(Y)){\rm d}F(Y) = 0\Rightarrow K_{\tau}(F(q_{\tau})) =\\& \dfrac{1}{2}(K_{\tau}(F(-\infty))+K_{\tau}(F(\infty))) = \dfrac{1}{2}(K_{\tau}(0)+K_{\tau}(1)) = \dfrac{1}{2}, \end{split}$

(5)

when $\dfrac{1}{2} < \tau < 1$ , that is,

$(F(q_{\tau}))^{s(1-\tau)} = (F(q_{\tau}))^{\tfrac{\log{(1/2)}}{\log{\tau}}} = \dfrac{1}{2}\Rightarrow F(q_{\tau}) = \tau.$

(6)

We can obtain the same result when $0 < \tau \leqslant \dfrac{1}{2}$ , which means (3) is equivalent to the quantile under the definition of $K_{\tau}^{\prime}(\cdot)$ .

As we know, for the random variable $Y$ , the special case $\tau = \dfrac{1}{2}$ of quantile and expectile leads to its median and mean, respectively. Then if we define a random variable $Z$ with its cumulative distribution function $F_{Z} = K_{\tau}(F)$ , it can be seen that the $\tau$ th quantile and $\tau$ th extremile represent the median and mean of $Z$ , respectively, similar to the central behavior of quantiles and expectiles.

Simultaneously, with the development of the internet and related industries, high-dimensional data are becoming increasingly common with an explosion of computations in data analysis, so it is supposed to find effective methods to reduce the computation if we apply extremile regression under this background. Variable selection is one of the effective measures, and there have been different methods for selecting variables.

First, there are methods based on the criteria, such as the PRESS Criterion^[4], $C_{p}$ Criterion^[5], Akaike Information Criterion (AIC)^[6] and Bayesian Information Criterion (BIC)^[7]. Those methods select variables through Selection/Estimation steps and the deviation of the Selection step will affect the result of the Estimation step due to the unknowns of the correct variables. Then, Geisser and Eddy^[8] proposed the cross validation (CV) method to select variables without assumptions in parameter estimation, which has spawned hold-out validation^[9], $5\times 2$ cross validation^[10] and other methods. Moreover, Candes and Tao^[11] proposed the Dantzig selector (DS) method. Although the DS method does not rely on a specific function, it needs to select variables based on the assumption of sparsity of unknown coefficients. Because the DS method has the same compression for each unknown coefficient, it may lead to the overcompression of important coefficients. Dicker and Lin^[12] proposed the ADS method with different weights. Later, James et al.^[13] proposed the DS method of generalized linear models, and Antoniadis et al.^[14] extended the DS method to Cox models. Furthermore, Fan and Lv^[15] proposed a sure independence screening (SIS) method to deal with ultrahigh dimensional data, while Fan and Li^[16] proposed an improved iterative sure independence screening (ISIS) method to deal with the collinearity between predictors. These methods can ensure selecting important predictors and independently measure the correlations between each predictor and the response. They can apply the marginal regression method to variable coefficient models and additive models. Finally, the variable selection method based on penalties has been widely used, and it solves corresponding optimization problems to achieve the goal. It can improve the calculation efficiency and increase the accuracy of estimation. The commonly used penalties include $L_{p}, p\in (0, 1)$ ^{[17, 18]}, lasso^[19], SCAD^[20], MCP^[21] and elastic net^[22]. Generally, the $L_{0}$ penalty is the most essential sparsity measure that penalizes the number of nonzero parameters directly, but it is NP-hard to solve an optimization problem with the $L_{0}$ penalty. Although $L_{1}$ is one of the common replacements for its convexity in the vast majority of cases, there exist some limitations, such as it might sometimes choose the wrong model^[23].

In this paper, we use the quasi elastic net, which combines $L_{0}$ and $L_{2}$ penalties, to select variables in linear extremile regression and propose an efficient EM algorithm based on Liu and Li^[24], which can approximately solve the optimization problem with the $L_{0}$ penalty. Significantly, if the predictors are highly correlated, the $L_{0}$ penalty would lead to poor performance; thus, we introduce ridge regression^[25]. With the properties of $L_{2}$ , the quasi elastic net is able to solve those highly correlated predictors and encourages a grouping effect. Moreover, we also establish several theoretical properties under some mild conditions.

The major contributions in this paper are as follows. First, we apply the proposed model to analyze high-dimensional data and retain its conceptual simplicity and convenient calculation. Second, we propose the quasi elastic net combining $L_{0}$ and $L_{2}$ penalties, which can both lead to a sparse solution and deal with highly correlated predictors. Third, we propose an efficient EM algorithm to solve the optimization problem with the $L_{0}$ penalty approximately and establish the theoretical properties.

The remainder of this paper is organized as follows. Section 2 shows our model and explains the method. Section 3 provides the theoretical properties of this method. Sections 4 and 5 present the comparisons between different methods in simulations and real data. Section 6 presents the conclusion. The appendix provides all proofs of theorems and lemmas.

2. Linear extremile regression with QEN

Considering an $n\times 1$ response ${\boldsymbol{y}} = (y_{1}, \cdots, y_{n})^{\rm T}$ and $n \times p$ predictor matrix $X = ({\boldsymbol{x}}^{\rm T}_{1}, \cdots, {\boldsymbol{x}}^{\rm T}_{n})^{\rm T}$ , the linear $\tau$ th extremile regression with a quasi elastic net penalty is defined as

$\begin{split} &\hat{{\boldsymbol{\beta}}}_{\tau} = {\rm{argmin}}_{{\boldsymbol{\beta}}_{\tau}} L({\boldsymbol{\beta}}_{\tau}) = \\&{\rm{argmin}}_{{\boldsymbol{\beta}}_{\tau}} \dfrac{1}{n} \sum\limits_{i = 1}^{n}J_{\tau}(\hat{F}(y_{i}|X))(y_{i}-{\boldsymbol{x}}_{i} {\boldsymbol{\beta}}_{\tau})^{2}+ \lambda_{1} \|{\boldsymbol{\beta}}_{\tau}\|_{0}+ \lambda_{2} \|{\boldsymbol{\beta}}_{\tau}\|_{2}^{2}, \end{split}$

(7)

where $\hat{F}(\cdot|X)$ is an estimated distribution function of ${\boldsymbol{y}}$ depends on $X$ , and we prefer nonparametric and semiparametric methods, especially a partial-linear single-index model^[26], when dealing with high-dimensional data. ${\boldsymbol{\beta}}_{\tau} = (\beta_{\tau, 1}, \cdots, \beta_{\tau, p})^{\rm T} \in \mathbb{R}^{p \times 1}$ are unknown coefficients, and $\lambda_{1}, \lambda_{2}>0$ are tuning parameters.

Let $R\subset \{ 1, \cdots, p\}$ be the subset index with $\beta_{j}\ne 0$ and $O\subset \{ 1, \cdots, p\}$ be the subset index with $\beta_{j} = 0$ , then we have $R\cup O = \{ 1, \cdots, p\}$ . Denoting $W = {\rm diag}\big(J_{\tau}(\hat{F}(y_{1}|X))/n, \cdots, J_{\tau} (\hat{F}(y_{n}|X))/n\big)$ , ${\boldsymbol{y}}^{*} = ({\boldsymbol{y}}^{\rm T}W^{1/2}, {\bf{0}})^{\rm T} \in \mathbb{R}^{(n+p)\times 1}$ , $X^{*} = (1+\lambda_{2})^{-1/2}\cdot (X^{\rm T}W^{1/2}, \lambda_{2}^{1/2}I)^{T}\in \mathbb{R}^{(n+p) \times p}$ , ${\boldsymbol{\beta}}^{*}_{\tau} = (1+\lambda_{2})^{1/2}{\boldsymbol{\beta}}_{\tau}$ , the optimization problem can be rewritten as

$\begin{split} \hat{{\boldsymbol{\beta}}}^{*}_{\tau} = & {\rm{argmin}}_{{\boldsymbol{\beta}}^{*}_{\tau} } \tilde{L}({\boldsymbol{\beta}}^{*}_{\tau}) =\\& {\rm{argmin}}_{{\boldsymbol{\beta}}^{*}_{\tau} } ({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})^{\rm T}({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})+ \lambda_{1}\|{\boldsymbol{\beta}}^{*}_{\tau}\|_{0}. \end{split}$

(8)

Note that the solution of (7) satisfies $\hat{{\boldsymbol{\beta}}}_{\tau} = (1+\lambda_{2})^{-1/2}\hat{{\boldsymbol{\beta}}}^{*}_{\tau}$ . Then, we focus on the optimization problem (8) and obtain the following two equations:

$\begin{array}{l}\tilde{L}({\boldsymbol{\beta}}^{*}_{\tau}) = ({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})^{T}({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})+ \lambda_{1} \displaystyle \sum\limits_{j\in R}\dfrac{\beta^{*2}_{\tau, j}}{\delta_{j}^{2}}, \\\ {\boldsymbol{\delta}} = {\boldsymbol{\beta}}^{*}_{\tau}. \end{array}$

(9)

For the first equation, $\tilde{L}({\boldsymbol{\beta}}^{*}_{\tau})$ is a quadratic function of ${\boldsymbol{\beta}}^{*}_{\tau}$ when ${\boldsymbol{\delta}}$ is known, so when $j \in R$ the first-order partial derivative is as follows:

$\dfrac{\partial{\tilde{L}({\boldsymbol{\beta}}^{*}_{\tau})}}{\partial{\beta^{*}_{\tau , j}}} = -2{\boldsymbol{x}}_{j}^{\rm *T}({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})+2\lambda_{1} \dfrac{\beta^{*}_{\tau, j}}{\delta_{j}^{2}} = 0,$

(10)

where ${\boldsymbol{x}}_{j}^{*}$ is the $j$ th column of $X^{*}$ . We can rewrite (10) as

$\delta_{j}^{2} {\boldsymbol{x}}_{j}^{\rm *T}({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})-\lambda_{1} \beta^{*}_{\tau, j} = 0.$

(11)

In addition, when $j \in O$ , whick means $\beta^{*}_{\tau, j} = \delta_{j} = 0$ , we also have

$\delta_{j}^{2} {\boldsymbol{x}}_{j}^{\rm *T}({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})-\lambda_{1} \beta^{*}_{\tau, j} = 0.$

(12)

Hence, when we let $D = {\rm diag}(\delta_{1}^{2}, \cdots, \delta_{p}^{2})$ , the equations above are equivalent to the following matrix form:

$DX^{\rm *T}({\boldsymbol{y}}^{*}-X^{*}{\boldsymbol{\beta}}^{*}_{\tau})-\lambda_{1} {\boldsymbol{\beta}}^{*}_{\tau} = {\bf{0}}.$

(13)

Finally, Eq. (13) leads to the solution

$\begin{split} {\boldsymbol{\beta}}^{*}_{\tau} =& (DX^{\rm *T}X^{*}+\lambda_{1} I)^{-1}DX^{\rm *T}{\boldsymbol{y}}^{*} =\\& (1+\lambda_{2})^{\frac{1}{2}}(D(X^{\rm T}WX+\lambda_{2} I)+\lambda_{1}(1+\lambda_{2}) I)^{-1} (DX^{\rm T}W{\boldsymbol{y}}), \\ {\boldsymbol{\delta}} = & {\boldsymbol{\beta}}^{*}_{\tau}. \end{split}$

(14)

It is clear that these two equations can be treated as the M-step and E-step of the EM algorithm. Naturally, we can obtain the solution of (7) by $\hat{{\boldsymbol{\beta}}}_{\tau} = (1+\lambda_{2})^{-1/2}\hat{{\boldsymbol{\beta}}}^{*}_{\tau}$ . Unlike elastic net, our $L_{0}$ penalty will not be affected by $\hat{{\boldsymbol{\beta}}}_{\tau} = (1+\lambda_{2})^{-1/2}\hat{{\boldsymbol{\beta}}}^{*}_{\tau}$ as $L_{1}$ penalty, which means $\hat{{\boldsymbol{\beta}}}_{\tau}$ will not incur a double amount of shrinkage, and there is no need to correct the estimator $\hat{{\boldsymbol{\beta}}}_{\tau}$ . This is one of the reasons why we choose the $L_{0}$ penalty rather than $L_{1}$ .

According to the content described above, it indicates that the last two equations can be processed by the EM algorithm. Based on this, the EM algorithm is summarized as Algorithm 1.

Algorithm 1 The EM algorithm

Given

$\lambda_{1}$ ,

$\lambda_{2}$ , small number

$\varepsilon$ , and training data

$\{ X, {\boldsymbol{y}} \}$

Initialize

${\boldsymbol{\beta} }=(1+\lambda_{2})^{1/2}(X^{\rm T}WX+\lambda_{2} I)^{-1} (X^{\rm T}W{\boldsymbol{y} }), {\boldsymbol{\delta} } ={\bf{0} }$

while

$\| {\boldsymbol{\beta} }-{\boldsymbol{\delta} } \|_{2} \geqslant \varepsilon$

　　E-step:

${\boldsymbol{\delta}} ={\boldsymbol{\beta}}$

$D={\rm diag}(\delta^{2}_{1}, \cdots, \delta^{2}_{p})$

　　M-step:

${\boldsymbol{\beta} } =(1+\lambda_{2})^{1/2}(D(X^{\rm T}WX+\lambda_{2} I)+\lambda_{1}(1+\lambda_{2}) I)^{-1} (DX^{\rm T}W{\boldsymbol{y} })$

end while

return

${\boldsymbol{\beta}} =(1+\lambda_{2})^{-1/2}{\boldsymbol{\beta}}$

Note that the initialization of ${\boldsymbol{\beta}}$ is not the only one specified, and we tend to choose the solution of Eq. (7) with $\lambda_{1} = 0$ which provides a faster convergence rate. However, remarkably, there also exists a trivial solution ${\boldsymbol{\beta}} = {\bf{0}}$ of Eq. (14).

Determination of $\lambda_{1}$ and $\lambda_{2}$ . To implement the algorithm, the tuning parameters $\lambda_{1}$ and $\lambda_{2}$ must be chosen. Because of the huge advantage of $L_{0}$ , we can directly determine $\lambda_{1}$ by the variable selection criteria AIC ( $n>p$ ) or BIC ( $n\ll p$ ) with $\lambda_{1} = 2$ or $\log{n}$ under the optimization problem (8). Then, we pick a grid of values for $\lambda_{2}$ , such as a logspace sequence $(0. 01, 0. 1, 1, 10,100)$ , and compare those $(\lambda_{1}, \lambda_{2})$ by $k$ -fold (usually tenfold) cross validation under the optimization problem (7).

3. Theoretical properties

In this section, we establish theoretical properties of the linear extremile regression with QEN, including the convergency of the proposed EM algorithm, grouping effect and consistency of the estimator.

3.1 Convergency of the algorithm

This subsection pays attention to the conditions the proposed EM algorithm should satisfy, and we have the following Theorem 3.1 and Lemma 3.1 to ensure its convergency.

Theorem 3.1. Assume the response ${\boldsymbol{y}}$ , the predictors $X$ , initialize solution ${\boldsymbol{\beta}}^{*(0)}_{\tau} = (\beta^{*(0)}_{\tau, 1}, \cdots, \beta^{*(0)}_{\tau, p})$ . Let $D = {\rm diag} (\beta^{*(0)}_{\tau, 1}, \cdots, \beta^{*(0)}_{\tau, p})$ . When the tuning parameters $\lambda_{1}$ and $\lambda_{2}$ satisfy

$2\lambda_{1}(1+\lambda_{2})^{\tfrac{3}{2}}\| (D(X^{\rm T}WX+\lambda_{2} I)+\lambda_{1}(1+\lambda_{2}) I)^{-2}\|_{\infty}\| D^{\frac{1}{2}}X^{\rm T}W{\boldsymbol{y}}\|_{\infty}<1,$

(15)

the estimator in (9) will converge to an optimal solution.

Lemma 3.1. Assume $\tilde{{\boldsymbol{x}}}^{\rm T}_{i}W\tilde{{\boldsymbol{x}}}_{j} = 0, \forall i \neq j \in \{1, \cdots, p\}$ , the maximum value of $\lambda_{1}$ will meet the condition

$\lambda_{1_{\rm max}} = {\rm min}_{j}\left\{ { \dfrac{(\tilde{{\boldsymbol{x}}}^{\rm T}_{j}W{\boldsymbol{y}})^{2}}{4(\tilde{{\boldsymbol{x}}}^{\rm T}_{j}W\tilde{{\boldsymbol{x}}}_{j}+\lambda_{2})}} \right\}.$

(16)

Note that both Theorem 3.1 and Lemma 3.1 provide mild restrictions of $\lambda_{1}$ and $\lambda_{2}$ . Moreover, Theorem 3.1 indicates that the EM algorithm will lead the estimator to converge to an optimal solution under proper $\lambda_{1}$ , $\lambda_{2}$ and initial solution ${\boldsymbol{\beta}}^{*(0)}_{\tau}$ . Note that the initial solution should avoid the trivial solution ${\bf{0}}$ , and we generally choose the solution of extremile regression with the $L_{2}$ penalty as an initial solution to accelerate convergence.

Theorem 3.2 shows the relationship between the final solution of the proposed EM algorithm and the optimization problem with the $L_{0}$ penalty.

Theorem 3.2. Given proper ${\boldsymbol{\beta}}^{*(0)}_{\tau}$ , the final solution of the proposed EM algorithm is an optimal solution to the $L_{0}$ approximation problem that minimizes (8).

3.2 Grouping effect

In $n \ll p$ problems^[27], the grouped variables are sometimes important, which has been discussed by Zou and Hastie^[22]. There have been many methods in the literature, for instance, using principal component analysis to find a set of highly correlated genes^[28], using supervised learning methods to select groups of predictive genes^[29], and using regularized regression to find the grouped genes^[30]. The proposed QEN encourages a grouping effect as elastic net does under certain conditions, which means the regression coefficients tend to be equal if the corresponding observation predictors are highly correlated. There is the following lemma.

Lemma 3.2. Consider weighted optimization problem as

$\hat{{\boldsymbol{\beta}}} = {\rm argmin}_{{\boldsymbol{\beta}}} \sum\limits_{i = 1}^{n} w_{i} \left| y_{i}-{\boldsymbol{x}}_{i} {\boldsymbol{\beta}} \right|^{2}+ \lambda f({\boldsymbol{\beta}}),$

(17)

where $w_{i} > 0, \forall i = 1, \cdots, m$ , are the weight values, $\lambda>0$ is a tuning parameter, $f(\cdot)$ is a positive function for ${\boldsymbol{\beta}} \ne {\bf{0}}$ .

Assuming $\tilde{{\boldsymbol{x}}}_{i} = \tilde{{\boldsymbol{x}}}_{j}$ , $i\neq j \in \{1, \cdots, p\}$ , where $\tilde{{\boldsymbol{x}}}_{i}$ is the $i$ th column of $X = ({\boldsymbol{x}}^{\rm T}_{1}, \cdots, {\boldsymbol{x}}^{\rm T}_{n})^{\rm T}$ , we have these two inferences:

(a) If $f(\cdot)$ is strictly convex, then $\hat{\beta}_{i} = \hat{\beta}_{j}$ , $\forall \lambda>0$ .

(b) If $f({\boldsymbol{\beta}}) = \| {\boldsymbol{\beta}} \|_{0}$ , then $\hat{\beta}_{i} \hat{\beta}_{j}\geqslant 0$ and the solution of (17) will be unique if and only if $\hat{\beta}_{i} = \hat{\beta}_{j} = 0$ . Furthermore, if $\hat{\beta}_{i} \hat{\beta}_{j}>0$ , $\hat{{\boldsymbol{\beta}}}^{*}$ is another minimizer, where

$\hat \beta _k^* = \left\{ {\begin{array}{*{20}{l}} {\qquad {{\hat \beta }_k},}&{k \ne i\;{\rm and}\;k \ne j};\\ {s{{\hat \beta }_i} + (1 - s){{\hat \beta }_j},}&{k = i};\\ {(1 - s){{\hat \beta }_i} + s{{\hat \beta }_j},}&{k = j}; \end{array}} \right.$

(18)

for any $s\in (0, 1)$ and if there is one, and only one 0 in $\hat{\beta}_{i}$ and $\hat{\beta}_{j}$ , we can set $s$ to 0 in (18) to obtain another minimizer.

(c) If $f(\cdot)$ is the QEN penalty, then $\hat{\beta}_{i} \hat{\beta}_{j}\geqslant 0$ . Furthermore, if $\hat{\beta}_{i} \hat{\beta}_{j}>0$ , then $\hat{\beta}_{i} = \hat{\beta}_{j}$ .

This lemma shows the motivation to choose the QEN penalty rather than $L_{0}$ , as $L_{0}$ penalty may not lead to a unique solution. Although the proposed QEN penalty is not strictly convex, it can guarantee the grouping effect if the unknown coefficients are nonzero. Then, Theorem 3.3 indicates the relationship between unknown coefficients.

Theorem 3.3. Assume the response ${\boldsymbol{y}}$ satisfies ${\boldsymbol{y}}^{\rm T}W{\boldsymbol{y}} = 1$ and the observation variables matrix $X$ satisfies

$\tilde{{\boldsymbol{x}}}^{\rm T}_{i}W\tilde{{\boldsymbol{x}}}_{j} =\left\{ {\begin{array}{*{20}{l}} {1,}&{i = j;}\\ {{\rho _{ij}},}&{i \ne j,} \end{array}} \right.$

(19)

where $\rho_{ij} \in (-1, 1)$ . Supposing $\hat{{\boldsymbol{\beta}}}_{\tau} = (\hat{\beta}_{\tau, 1}, \cdots, \hat{\beta}_{\tau, p})^{\rm T}$ is the minimizer of the optimization problem (7), the following inequality can be obtained when $\hat{\beta}_{\tau, i} \hat{\beta}_{\tau, j}\neq 0, i\neq j \in \{1, \cdots, p\}$ , with probability tending to 1,

$D(i, j)\equiv \frac{| \hat{\beta}_{\tau, i}-\hat{\beta}_{\tau, j} |}{\sqrt{{\boldsymbol{y}}^{\rm T}W{\boldsymbol{y}}}}\leqslant \frac{\sqrt{2(1-\rho_{ij})}}{\lambda_{2}}.$

(20)

The distance function $D(i, j)$ indicates the difference between $\hat{\beta}_{\tau, i}$ and $\hat{\beta}_{\tau, j}$ , especially when $\rho_{ij}$ approaches 1, the difference between $\hat{\beta}_{\tau, i}$ and $\hat{\beta}_{\tau, j}$ is almost 0, and $\hat{\beta}_{\tau, i}$ and $-\hat{\beta}_{\tau, j}$ are almost equal when $\rho_{ij}$ is close to $-1$ . It is clear that only tuning parameter $\lambda_{2}$ of $\| {\boldsymbol{\beta}}_{\tau}\|^{2}_{2}$ can affect the upper bound of $D(i, j)$ , which indicates the grouping effect of the $L_{2}$ penalty.

It is necessary to note that $\tilde{{\boldsymbol{x}}}^{\rm T}_{i}W(-\tilde{{\boldsymbol{x}}}_{j}) = -\rho_{ij}$ is close to 1 when $\rho_{ij}$ is close to $-1$ and the corresponding coefficient of $-\tilde{{\boldsymbol{x}}}_{j}$ should be $-\hat{\beta}_{\tau, j}$ by

$\begin{split} {\boldsymbol{y}}-X\hat{{\boldsymbol{\beta}}}_{\tau} =& {\boldsymbol{y}}-\cdots-\tilde{{\boldsymbol{x}}}_{i}\hat{\beta}_{\tau, i}-\cdots- \tilde{{\boldsymbol{x}}}_{j}\hat{\beta}_{\tau, j}-\cdots =\\& {\boldsymbol{y}}-\cdots-\tilde{{\boldsymbol{x}}}_{i}\hat{\beta}_{\tau, i}-\cdots-(-\tilde{{\boldsymbol{x}}}_{j})(-\hat{\beta}_{\tau, j})-\cdots. \end{split}$

Then, according to (20), we have

$\dfrac{| \hat{\beta}_{\tau, i}-(-\hat{\beta}_{\tau, j}) |}{\sqrt{{\boldsymbol{y}}^{\rm T}W{\boldsymbol{y}}}} = \dfrac{| \hat{\beta}_{\tau, i}+\hat{\beta}_{\tau, j} |}{\sqrt{{\boldsymbol{y}}^{\rm T}W{\boldsymbol{y}}}}\leqslant \dfrac{\sqrt{2(1-(-\rho_{ij}))}}{\lambda_{2}} = \dfrac{\sqrt{2(1+\rho_{ij})}}{\lambda_{2}}.$

This means that $\hat{\beta}_{\tau, i}$ and $-\hat{\beta}_{\tau, j}$ are almost equal, as $| \hat{\beta}_{\tau, i}+\hat{\beta}_{\tau, j} |$ is close to 0 when $\rho_{ij}$ is close to $-1$ .

3.3 Properties of the estimator

In this subsection, we propose the properties of the estimator. In the optimization problem (8), the sample size is consistently larger than the number of predictors, as $n+p>p$ holds. This means that (8) will always be a low-dimensional problem even if (7) is high-dimensional, and it helps reduce the regularization conditions.

Let ${\boldsymbol{\beta}}_{\tau 0}$ be the true value, $O = \{1\leqslant j\leqslant p: \beta_{\tau 0, j} = 0\}\subset \{ 1, \cdots, p\}$ . The following conditions are essential for theoretical properties.

(C1) $\log p = o(n), n\to \infty$ .

(C2) Given $\tau$ , $\epsilon_{i} = y_{i}-{\boldsymbol{x}}_{i}{\boldsymbol{\beta}}_{\tau 0}, i = 1, \cdots, n$ , are independent identically distributed random variables with mean 0 and variance $\sigma^{2}<\infty$ .

(C3) There exists a constant $K>0$ such that $\lambda_{\rm max}\Big(\dfrac{X^{\rm T}WX+\lambda_{2} I}{(1+\lambda_{2})(n+p)}\Big)\leqslant K < \infty$ , where $\lambda_{\rm max}(A)$ represents the largest eigenvalue of matrix $A$ .

(C4) $\dfrac{{\rm max}_{1\leqslant j\leqslant p}\tilde{{\boldsymbol{x}}}_{j}^{\rm T}W\tilde{{\boldsymbol{x}}}_{j}+\lambda_{2}}{\sqrt{(1+\lambda_{2})(n+p)}} = O(\sqrt{\log{p(n+p)}})$ or $O(1)$ as $n, p\to \infty$ .

(C5) $\| {\boldsymbol{\beta}}_{\tau 0}\|_{0} = O(1)$ .

These conditions are relatively mild. (C1) shows the relationship between sample size $n$ and characteristic number $p$ , and the proposed method applies to ultrahigh dimensional cases such as $p = \exp{n^{\alpha}}$ for $0<\alpha<1$ . (C2) restricts the distribution of errors. (C3) is a standard linear regression condition. (C4) is a condition following Liu and Li^[24]. (C5) indicates that the model is sparse. Under those conditions, we establish the consistency of the proposed estimator $\hat{{\boldsymbol{\beta}}}_{\tau}$ as follows:

Theorem 3.4. Assume that conditions (C1)–(C5) hold. Given proper $\lambda_{2}$ . Let

$\mu(X) = {\rm max}_{1\leqslant i<j\leqslant p}\dfrac{\tilde{{\boldsymbol{x}}}^{\rm T}_{i}W\tilde{{\boldsymbol{x}}}_{j}}{(\tilde{{\boldsymbol{x}}}^{\rm T}_{i}W\tilde{{\boldsymbol{x}}}_{i}+\lambda_{2})^{\tfrac{1}{2}}(\tilde{{\boldsymbol{x}}}^{T}_{j}W\tilde{{\boldsymbol{x}}}_{j}+\lambda_{2})^{\tfrac{1}{2}}}.$

(21)

For some $0<\nu<1$ and any $0<q<1/2$ , let $n(\nu) = (1-\nu)(1+1/\mu(X))$ and

$\lambda_{1} = \dfrac{3\log{(p/q)}}{\nu (1+\mu(X))}\cdot \dfrac{{\rm max}_{j}\tilde{{\boldsymbol{x}}}^{\rm T}_{j}W\tilde{{\boldsymbol{x}}}_{j}+\lambda_{2}}{{\rm min}_{j}\tilde{{\boldsymbol{x}}}^{\rm T}_{j}W\tilde{{\boldsymbol{x}}}_{j}+\lambda_{2}},$

(22)

and $\hat{{\boldsymbol{\beta}}}_{\tau}$ is the minimizer of (7) under the restriction $\|{\boldsymbol{\beta}}_{\tau}\|_{0}\leqslant n(\nu)$ with $\lambda_{1}$ . Then,

(a)

$\|\hat{{\boldsymbol{\beta}}}_{\tau}-{\boldsymbol{\beta}}_{\tau 0}\|_{2} = O_{p}\Bigg(\sqrt{\frac{\log{p(n+p)}}{n+p}}\Bigg).$

(23)

(b) With probability tending to 1, $\hat{{\boldsymbol{\beta}}}_{\tau, j} = 0, \forall j\in O$ .

Theorem 3.3 indicates that under mild conditions (C1)–(C5), the proposed estimator has consistency and guarantees that the components are zeros if the true values of the corresponding components are zeros.

4. Simulation study

In this section, we first use one example to test the efficiency of the proposed algorithm for penalized linear extremile regression, while the next one shows the relationship between the MSE of the proposed QEN penalty and the sample size $n$ , and the last two show the advantages of the QEN. It should also be noted that the estimated coefficients will be treated as zero if their absolute value is smaller than $10^{-6}$ . We simulate data from the true model

$y = \boldsymbol{x}{\boldsymbol{\beta}}+\epsilon.$

(24)

The components of $\boldsymbol{x}$ were standard normal in Examples 4.1, 4.2 and 4.3. Within each example, our simulated data consist of a training set, an independent tuning data set and a testing set. Denote the sample size of the training data set by $n$ , while an independent tuning data set and testing data set of size $n$ and $100n$ , respectively, were generated in the same way, and we repeated the process 100 times to compute the average value. Here are the details of the three scenarios.

Example 4.1. We generated data from the model (24) with $n = 20$ . We set ${\boldsymbol{\beta}} = (3, 1.5, 0, 0, 2, 0, 0, 0)$ and $\epsilon \sim N(0, 1)$ . The pairwise correlation between $\tilde{{\boldsymbol{x}}}_{i}$ and $\tilde{{\boldsymbol{x}}}_{j}$ was set to $r^{|i-j|}$ , $r = 0, 0.5, 0.9$ and let $\tau = 0.5$ .

Example 4.2. We generated data from the model (24) with $n = 10,100, 1000$ . We set ${\boldsymbol{\beta}} = (3, 1.5, 0, 0, 2, 0, 0, 0)$ and $\epsilon \sim N(0, 1)$ . The pairwise correlation between $\tilde{{\boldsymbol{x}}}_{i}$ and $\tilde{{\boldsymbol{x}}}_{j}$ was set to $r^{|i-j|}$ , $r = 0, 0.5, 0.9$ and let $\tau = 0.5$ .

Example 4.3. We generated data from the model (24) with $n = 20$ . We set

${\boldsymbol{\beta}} = (3, 1.5, 0, 0, 2, \underbrace{0, \cdots, 0}_{45})$

and $\epsilon \sim t(3)$ . The pairwise correlation between $\tilde{{\boldsymbol{x}}}_{i}$ and $\tilde{{\boldsymbol{x}}}_{j}$ was set to $r^{|i-j|}$ , $r = 0.5, 0.9$ and let $\tau = 0.9, 0.97$ .

Example 4.4. We generated data from the model (24) with $n = 20$ . We set

${\boldsymbol{\beta}} = (\underbrace{5, \cdots, 5}_{15}, \underbrace{0, \cdots, 0}_{985})$

and $\epsilon \sim 0.5W(1, 1/2)+0.5N(-2, 3)$ . Predictors $X$ were generated as follows:

$\begin{array}{l} x_{i} = Z_{1}+\epsilon_{i}^{x}, Z_{1}\sim N(0, 1), \;i = 1, 2, 3, 4, 5;\\ x_{i} = Z_{2}+\epsilon_{i}^{x}, Z_{2}\sim N(0, 1), \;i = 6, 7, 8, 9, 10;\\ x_{i} = Z_{3}+\epsilon_{i}^{x}, Z_{3}\sim N(0, 1),\; i = 11, 12, 13, 14, 15;\\ x_{i}\stackrel{\text{i.i.d}}{\sim}N(0, 1),\; i = 16, \cdots, 1000;\\ \end{array}$

(25)

where $\epsilon_{i}^{x}\stackrel{\text{i.i.d}}{\sim}N(0, 0.01), \;i = 1, \cdots, 15$ . Let $\tau = 0.9, 0.97$ . In this model, we have three equally important groups with 5 predictors and 985 pure noise. An ideal method would select only the 15 true features and set the coefficients of the 985 noise to 0.

In those examples, we consider the following indicators: #SF, the average value of the number of selected predictors; P, the percentage of selecting the true model; MSE, the average mean squared error; CPU-time, the average CPU time in seconds; TP, the average value of the number of estimated nonzero components while those components are nonzero in true value; FP, the average value of the number of estimated nonzero components while those components are zeros in true value.

Table 1 shows the performance of extremile regression without penalty and with $L_{0}$ , $L_{1}$ , $L_{2}$ , EN, QEN penalties under $\tau = 0.5$ , which means OLS regression with different penalties, and indicates that the proposed algorithm is efficient. It is obvious that $L_{0}$ and QEN have advantages in CPU time, and QEN has a better performance in correction rate than EN. Moreover, $L_{0}$ might be superior to QEN because $r$ is small, but QNE is more robust with the increase of $r$ because of the $L_{2}$ penalty.

Table 1. Results in extremile regression without penalty and with

$L_{0}$ ,

$L_{1}$ ,

$L_{2}$ , EN, and QEN penalties with different

$r$ settings,

$\tau = 0.5$ , 100 repetitions (standard deviations are shown in parentheses).

r	Method	#SF	P	MSE	CPU-time
0	Extremile	8.00(0.00)	0.00	0.020	–
	$L_{0}$	2.99(0.54)	0.99	0.015	0.0006(0.0031)
	$L_{1}$	2.93(0.55)	0.92	0.635	0.0043(0.0099)
	$L_{2}$	8.00(0.00)	0.00	0.019	–
	EN	2.93(0.58)	0.91	0.622	0.0040(0.0090)
	QEN	2.98(0.62)	0.99	0.019	0.0004(0.0026)
0.5	Extremile	8.00(0.00)	0.00	0.031	–
	$L_{0}$	2.89(0.69)	0.93	0.047	0.0009(0.0037)
	$L_{1}$	2.66(0.72)	0.83	0.796	0.0084(0.0161)
	$L_{2}$	8.00(0.00)	0.00	0.035	–
	EN	2.61(0.72)	0.86	0.758	0.0080(0.0136)
	QEN	2.9(0.77)	0.94	0.064	0.0007(0.0034)
0.9	Extremile	8.00(0.00)	0.00	0.160	–
	$L_{0}$	2.20(0.74)	0.17	0.401	0.0012(0.0035)
	$L_{1}$	2.06(0.76)	0.12	0.668	0.0357(0.0518)
	$L_{2}$	8.00(0.00)	0.00	0.237	–
	EN	2.46(0.77)	0.78	0.594	0.0268(0.0352)
	QEN	2.78(0.80)	0.92	0.104	0.0009(0.0037)

| Show Table

DownLoad: CSV

Table 2 shows the performance of extremile regression with the QEN penalty under different sample sizes and $\tau = 0.5$ . Although the result obtained by the QEN penalty is better when $r$ is small, #SF and frequencies of selecting the true model will increase with increasing $n$ , and the corresponding MSE will be significantly reduced when $r$ is fixed.

Table 2. Results in extremile regression with QEN penalty with different

$r$ ,

$n$ settings,

$\tau = 0.5$ , 100 repetitions (standard deviations are shown in parentheses).

r	n	#SF	P	MSE
0	10	2.98(0.47)	0.98	0.242
	100	3.00(0.00)	1.00	0.058
	1000	3.00(0.00)	1.00	0.007
0.5	10	2.92(0.68)	0.94	0.393
	100	2.93(0.63)	0.96	0.092
	1000	2.95(0.60)	0.97	0.011
0.9	10	2.83(0.77)	0.87	0.991
	100	2.86(0.74)	0.90	0.243
	1000	2.87(0.69)	0.92	0.036

| Show Table

DownLoad: CSV

Tables 3 and 4 show the performance of extremile regression without penalty and with $L_{0}$ , $L_{1}$ , $L_{2}$ , EN, and QEN penalties in general and high-dimensional situations with $\tau = 0.9, 0.97$ . Although $L_{1}$ and EN are able to select more variables than $L_{0}$ and QEN, the value of QEN’s TP/FP is much larger, which means that the QEN is able to select correct variables more effectively. Simultaneously, EN and QEN are more robust as $\tau$ increases, and they also perform better when $r$ becomes larger. In the grouped variable situation, EN and QEN perform better, similarly affirming the grouping effect in theoretical properties.

Table 3. Results in extremile regression without penalty and with

$L_{0}$ ,

$L_{1}$ ,

$L_{2}$ , EN, and QEN penalties with different

$r$ ,

$\tau$ settings, 100 repetitions (standard deviations are shown in parentheses).

$\tau$	Method	$r=0.5$			$r=0.9$
$\tau$	Method	#SF	TP	FP	#SF	TP	FP
0.9	Extremile	50.00(0.00)	3.00(0.00)	47.00(0.00)	50.00(0.00)	3.00(0.00)	47.00(0.00)
	$L_{0}$	2.86(0.83)	2.03(0.89)	0.83(1.11)	2.82(1.07)	1.52(0.75)	1.30(1.34)
	$L_{1}$	2.89(0.94)	2.04(0.76)	0.85(1.03)	2.78(0.99)	1.51(0.85)	1.27(1.10)
	$L_{2}$	50.00(0.00)	3.00(0.00)	47.00(0.00)	50.00(0.00)	3.00(0.00)	47.00(0.00)
	EN	2.93(1.13)	2.37(0.78)	0.56(1.08)	2.95(1.14)	2.18(0.98)	0.77(1.00)
	QEN	2.87(0.91)	2.56(0.96)	0.31(1.03)	2.96(1.08)	2.37(0.81)	0.59(1.18)
0.97	Extremile	50(0.00)	3.00(0.00)	47.00(0.00)	50.00(0.00)	3.00(0.00)	47.00(0.00)
	$L_{0}$	2.72(0.65)	1.71(0.81)	1.01(0.98)	2.56(0.89)	1.41(0.72)	1.15(1.13)
	$L_{1}$	2.98(0.71)	1.92(0.79)	1.06(0.94)	2.74(0.78)	1.50(0.80)	1.24(1.00)
	$L_{2}$	50.00(0.00)	3.00(0.00)	47.00(0.00)	50.00(0.00)	3.00(0.00)	47.00(0.00)
	EN	2.91(0.77)	2.24(0.77)	0.67(0.89)	2.87(0.84)	2.20(0.92)	0.67(0.94)
	QEN	2.89(0.84)	2.30(0.89)	0.59(0.92)	2.86(0.89)	2.26(0.85)	0.60(0.98)

| Show Table

DownLoad: CSV

Table 4. Results in extremile regression without penalty and with

$L_{0}$ ,

$L_{1}$ ,

$L_{2}$ , EN, and QEN penalties with different

$\tau$ settings, 100 repetitions (standard deviations are shown in parentheses).

$\tau$	Method	#SF	TP	FP
0.9	Extremile	1000.00(0.00)	15.00(0.00)	985.00(0.00)
	$L_{0}$	12.86(2.02)	11.47(1.99)	1.39(2.58)
	$L_{1}$	13.74(1.56)	12.52(1.47)	1.22(0.46)
	$L_{2}$	1000.00(0.00)	15.00(0.00)	985.00(0.00)
	EN	14.67(2.31)	14.15(2.23)	0.52(0.73)
	QEN	14.65(1.45)	14.39(1.51)	0.26(0.54)
0.97	Extremile	1000.00(0.00)	15.00(0.00)	985.00(0.00)
	$L_{0}$	10.19(1.69)	8.42(2.00)	1.77(2.27)
	$L_{1}$	14.49(2.53)	12.12(2.49)	2.37(1.62)
	$L_{2}$	1000.00(0.00)	15.00(0.00)	985.00(0.00)
	EN	14.52(2.71)	12.65(2.62)	1.87(1.21)
	QEN	14.88(1.70)	13.44(2.19)	1.44(1.36)

| Show Table

DownLoad: CSV

These simulations all indicate that our QEN penalty is competent in variable selection because of its $L_{0}$ penalty, so this is an advantage of the proposed method when we face high-dimensional problems. On the other hand, although $L_{0}$ performs no worse than QEN when the correlations of the predictors are weak, the proposed QEN performs better as the correlations increase and its estimator always has lower MSE. Moreover, the proposed QEN also tends to choose more predictors than the $L_{0}$ penalty, especially in tail analysis, which means that we can obtain more information.

5. Real data analysis

Our real data about communities and crime are downloaded from UCI^①. The variables in the dataset involving various community data were picked if there was any plausible connection to crime ( $p = 122$ ), with the goal attribute PerCapitaViolentCrimes (total number of violent crimes per 100k population), and all numeric data were normalized into the decimal range 0.00–1.00 using an unsupervised equal-interval binning method. Redmond and Baveja^[31] adopted this dataset to present an artificial-intelligence software crime similarity system (CSS) to help police departments develop a strategic viewpoint toward decision-making and utilize the socioeconomic, crime and enforcement profiles of cities to generate a list of communities that are the best candidates to cooperate and share experiences. We aim to apply the QEN penalized linear extremile regression to select variables that are strongly correlated to PerCapitaViolentCrimes under different extreme $\tau$ .

In each repetition, we randomly choose 20 samples from 319 complete samples as a tuning set and 20 samples as a training set, while the remaining 279 groups are used as the testing set to compute the test error. The performance of 100 repetitions of penalized extremile regression with different penalties and different $\tau$ is summarized in Table 5, where test error represents the error of estimation under the testing set and #SF means the average value of the number of selected predictors.

Table 5. Results of the Communities and Crime Data in extremile regression with

$L_{0}$ ,

$L_{1}$ , EN, and QEN penalties with different

$\tau$ settings, 100 repetitions (standard deviations are shown in parentheses).

$\tau$	Method	Test error	#SF
0.9	$L_{0}$	0.7362(0.0085)	23.49(14.61)
	$L_{1}$	0.7364(0.0086)	27.56(11.51)
	EN	0.7346(0.0083)	26.56(12.31)
	QEN	0.7351(0.0090)	25.48(9.02)
0.95	$L_{0}$	0.9010(0.0073)	20.63(16.28)
	$L_{0}$	0.9021(0.0068)	24.35(12.57)
	EN	0.9004(0.0065)	23.87(10.36)
	QEN	0.9006(0.0068)	22.39(8.24)
0.97	$L_{0}$	0.9892(0.0048)	19.48(17.71)
	$L_{0}$	0.9897(0.0045)	22.66(14.23)
	EN	0.9884(0.0040)	22.23(9.84)
	QEN	0.9886(0.0042)	21.60(7.69)

| Show Table

DownLoad: CSV

Table 5 shows the performance of extremile regression without penalty and with $L_{0}$ , $L_{1}$ , $L_{2}$ , EN, and QEN penalties 100 repetitions with $\tau = 0.9, 0.95, 0.97$ under the Communities and Crime Data. This indicates that the test errors of different penalties have little difference, but QEN selects fewer variables with smaller standard deviations than $L_{0}$ and $L_{1}$ and has more robustness.

6. Conclusions

In this paper, we propose a linear extremile regression model with the QEN penalty to select variables in high-dimensional problems. The proposed QEN penalty retains the advantages of $L_{0}$ and $L_{2}$ penalties, which implies that it is able to obtain sparse solutions and deal with highly correlated predictors. Simultaneously, we propose an EM algorithm to solve optimization problems with the $L_{0}$ penalty approximately and establish corresponding theoretical properties under mild conditions. To summarize, the proposed method has important implications for extending the traditional extremile regression theory.

Future work can be extended to the following aspects. First, the extremile regression with penalty can be extended to nonparametric and semiparametric models. Second, the proposed EM algorithm can be applied to the $L_{p}$ penalty, $p\in (0, 2)$ , so we can consider optimization problems with different penalties. Finally, we seek methods to determine the true value of unknown coefficients under different $\tau$ to improve the evaluation standards of variable selection methods in extremile regression.

Acknowledgements

We thank the reviewers for their perceptive comments and suggestions. This work was supported by the National Natural Science Foundation of China (12171450).

Conflict of interest

The authors declare that they have no conflict of interest.

Conflict of Interest

The authors declare that they have no conflict of interest.

① http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime.

We propose a quasi elastic net penalized linear extremile regression to deal with high-dimensional data, which leads to a sparse solution as well as being suitable for strongly collinear situations. We adopt an EM algorithm to solve the L₀ approximation problem efficiently, and further solve the quasi elastic net penalized optimization problem. We prove that the proposed quasi elastic net penalized linear extremile regression model is effective through numerical studies.

References (31)

References

[1]	Koenker R, Bassett G. Regression quantiles. Econometrica, 1978, 46: 33–50. DOI: 10.2307/1913643
[2]	Newey W K, Powell J L. Asymmetric least squares estimation and testing. Econometrica, 1987, 55: 819–847. DOI: 10.2307/1911031
[3]	Daouia A, Gijbels I, Stupfler G. Extremiles: A new perspective on asymmetric least squares. Journal of the American Statistical Association, 2019, 114 (527): 1366–1381. DOI: 10.1080/01621459.2018.1498348
[4]	Allen D M. The relationship between variable selection and data agumentation and a method for prediction. Technometrics, 1974, 16 (1): 125–127. DOI: 10.1080/00401706.1974.10489157
[5]	Mallows C L. Some comments on C _p. Technometrics, 2000, 42 (1): 87–94. DOI: 10.1080/00401706.1973.10489103
[6]	Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974, 19 (6): 716–723. DOI: 10.1109/TAC.1974.1100705
[7]	Schwarz G. Estimating the dimension of a model. The Annals of Statistics, 1978, 6 (2): 461–464. DOI: 10.1214/aos/1176344136
[8]	Geisser S, Eddy W F. A predictive approach to model selection. Journal of the American Statistical Association, 1979, 74 (365): 153–160. DOI: 10.1080/01621459.1979.10481632
[9]	Devroye L, Wagner T. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 1979, 25 (5): 601–604. DOI: 10.1109/TIT.1979.1056087
[10]	Dietterich T G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 1998, 10 (7): 1895–1923. DOI: 10.1162/089976698300017197
[11]	Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2007, 35 (6): 2313–2351. DOI: 10.1214/009053606000001523
[12]	Dicker L, Lin X. Parallelism, uniqueness, and large-sample asymptotics for the Dantzig selector. Canadian Journal of Statistics, 2013, 41 (1): 23–35. DOI: 10.1002/cjs.11151
[13]	James G M, Radchenko P, Lv J. DASSO: connections between the Dantzig selector and lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2009, 71 (1): 127–142. DOI: 10.1111/j.1467-9868.2008.00668.x
[14]	Antoniadis A, Fryzlewicz P, Letué F. The Dantzig selector in Cox’s proportional hazards model. Scandinavian Journal of Statistics, 2010, 37 (4): 531–552. DOI: 10.1111/j.1467-9469.2009.00685.x
[15]	Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008, 70 (5): 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x
[16]	Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 2011, 106 (494): 544–557. DOI: 10.1198/jasa.2011.tm09779
[17]	Liu Z, Lin S, Tan M. Sparse support vector machines with L _p penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2008, 7 (1): 100–107. DOI: 10.1109/TCBB.2008.17
[18]	Mazumder R, Friedman J H, Hastie T. SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 2011, 106 (495): 1125–1138. DOI: 10.1198/jasa.2011.tm09738
[19]	Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996, 58 (1): 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x
[20]	Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 2001, 96 (456): 1348–1360. DOI: 10.1198/016214501753382273
[21]	Zhang C H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 2010, 38 (2): 894–942. DOI: 10.1214/09-AOS729
[22]	Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2005, 67 (2): 301–320. DOI: 10.1111/j.1467-9868.2005.00503.x
[23]	Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 2006, 101 (476): 1418–1429. DOI: 10.1198/016214506000000735
[24]	Liu Z, Li G. Efficient regularized regression with penalty for variable selection and network construction. Computational and Mathematical Methods in Medicine, 2016, 2016: 3456153. DOI: 10.1155/2016/3456153
[25]	Tihonov A N. Solution of incorrectly formulated problems and the regularization method. Soviet Math., 1963, 4: 1035–1038.
[26]	Wang J, Xue L, Zhu L, et al. Estimation for a partial-linear single-index model. The Annals of Statistics, 2010, 38 (1): 246–274. DOI: 10.1214/09-AOS712
[27]	West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences, 2001, 98 (20): 11462–11467. DOI: 10.1073/pnas.201162998
[28]	Hastie T, Tibshirani R, Eisen M B, et al. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 2000, 1: research0003.1. DOI: 10.1186/gb-2000-1-2-research0003
[29]	Hastie T, Tibshirani R, Botstein D, et al. Supervised harvesting of expression trees. Genome Biology, 2001, 2: research0003.1. DOI: 10.1186/gb-2001-2-1-research0003
[30]	Segal M S, Dahlquist K D, Conklin B R. Regression approaches for microarray data analysis. Journal of Computational Biology, 2003, 10 (6): 961–980. DOI: 10.1089/106652703322756177
[31]	Redmond M, Baveja A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 2002, 141 (3): 660–678. DOI: 10.1016/S0377-2217(01)00264-8

Supplements (1)

Supplements
Other Related Supplements
- Graphic and text summary
  Download

Cited By

Track Citations

Get Citation

{{if article.articleBusiness.pdfLink && article.articleBusiness.pdfLink != ''}} {{else}} {{/if}}PDF

XML

References

[1]	Koenker R, Bassett G. Regression quantiles. Econometrica, 1978, 46: 33–50. DOI: 10.2307/1913643
[2]	Newey W K, Powell J L. Asymmetric least squares estimation and testing. Econometrica, 1987, 55: 819–847. DOI: 10.2307/1911031
[3]	Daouia A, Gijbels I, Stupfler G. Extremiles: A new perspective on asymmetric least squares. Journal of the American Statistical Association, 2019, 114 (527): 1366–1381. DOI: 10.1080/01621459.2018.1498348
[4]	Allen D M. The relationship between variable selection and data agumentation and a method for prediction. Technometrics, 1974, 16 (1): 125–127. DOI: 10.1080/00401706.1974.10489157
[5]	Mallows C L. Some comments on C _p. Technometrics, 2000, 42 (1): 87–94. DOI: 10.1080/00401706.1973.10489103
[6]	Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974, 19 (6): 716–723. DOI: 10.1109/TAC.1974.1100705
[7]	Schwarz G. Estimating the dimension of a model. The Annals of Statistics, 1978, 6 (2): 461–464. DOI: 10.1214/aos/1176344136
[8]	Geisser S, Eddy W F. A predictive approach to model selection. Journal of the American Statistical Association, 1979, 74 (365): 153–160. DOI: 10.1080/01621459.1979.10481632
[9]	Devroye L, Wagner T. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 1979, 25 (5): 601–604. DOI: 10.1109/TIT.1979.1056087
[10]	Dietterich T G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 1998, 10 (7): 1895–1923. DOI: 10.1162/089976698300017197
[11]	Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2007, 35 (6): 2313–2351. DOI: 10.1214/009053606000001523
[12]	Dicker L, Lin X. Parallelism, uniqueness, and large-sample asymptotics for the Dantzig selector. Canadian Journal of Statistics, 2013, 41 (1): 23–35. DOI: 10.1002/cjs.11151
[13]	James G M, Radchenko P, Lv J. DASSO: connections between the Dantzig selector and lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2009, 71 (1): 127–142. DOI: 10.1111/j.1467-9868.2008.00668.x
[14]	Antoniadis A, Fryzlewicz P, Letué F. The Dantzig selector in Cox’s proportional hazards model. Scandinavian Journal of Statistics, 2010, 37 (4): 531–552. DOI: 10.1111/j.1467-9469.2009.00685.x
[15]	Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2008, 70 (5): 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x
[16]	Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 2011, 106 (494): 544–557. DOI: 10.1198/jasa.2011.tm09779
[17]	Liu Z, Lin S, Tan M. Sparse support vector machines with L _p penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2008, 7 (1): 100–107. DOI: 10.1109/TCBB.2008.17
[18]	Mazumder R, Friedman J H, Hastie T. SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 2011, 106 (495): 1125–1138. DOI: 10.1198/jasa.2011.tm09738
[19]	Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996, 58 (1): 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x
[20]	Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 2001, 96 (456): 1348–1360. DOI: 10.1198/016214501753382273
[21]	Zhang C H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 2010, 38 (2): 894–942. DOI: 10.1214/09-AOS729
[22]	Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2005, 67 (2): 301–320. DOI: 10.1111/j.1467-9868.2005.00503.x
[23]	Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 2006, 101 (476): 1418–1429. DOI: 10.1198/016214506000000735
[24]	Liu Z, Li G. Efficient regularized regression with penalty for variable selection and network construction. Computational and Mathematical Methods in Medicine, 2016, 2016: 3456153. DOI: 10.1155/2016/3456153
[25]	Tihonov A N. Solution of incorrectly formulated problems and the regularization method. Soviet Math., 1963, 4: 1035–1038.
[26]	Wang J, Xue L, Zhu L, et al. Estimation for a partial-linear single-index model. The Annals of Statistics, 2010, 38 (1): 246–274. DOI: 10.1214/09-AOS712
[27]	West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences, 2001, 98 (20): 11462–11467. DOI: 10.1073/pnas.201162998
[28]	Hastie T, Tibshirani R, Eisen M B, et al. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 2000, 1: research0003.1. DOI: 10.1186/gb-2000-1-2-research0003
[29]	Hastie T, Tibshirani R, Botstein D, et al. Supervised harvesting of expression trees. Genome Biology, 2001, 2: research0003.1. DOI: 10.1186/gb-2001-2-1-research0003
[30]	Segal M S, Dahlquist K D, Conklin B R. Regression approaches for microarray data analysis. Journal of Computational Biology, 2003, 10 (6): 961–980. DOI: 10.1089/106652703322756177
[31]	Redmond M, Baveja A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 2002, 141 (3): 660–678. DOI: 10.1016/S0377-2217(01)00264-8

[1]	Jie Wu, Yumeng Wu. Machine learning in data envelopment analysis: A smart mechanism for indicator selection[J]. JUSTC, 2022, 52(12): 5-1-5-9. DOI: 10.52396/JUSTC-2022-0106
[2]	CUI Di, ZHANG Weiping. Bayesian variable selection for proportional hazards model with current status data[J]. JUSTC, 2020, 50(10): 1303. DOI: 10.3969/j.issn.0253-2778.2020.10.003
[3]	LI Xiang, SHEN Aiting. Complete moment convergence for weighted sums of extended negatively dependent random variables[J]. JUSTC, 2020, 50(2): 156-162. DOI: 10.3969/j.issn.0253-2778.2020.02.011
[4]	YAO Yanjun, ZHOU Wuyang, KOU Baohua, FAN Dandan. Joint antenna selection and cooperative communication design to enhance physical layer security[J]. JUSTC, 2017, 47(8): 679-685. DOI: 10.3969/j.issn.0253-2778.2017.08.007
[5]	XU Xueli, ZHAO Xuejing. Application of sparse spectral clustering algorithm in high-dimensional data[J]. JUSTC, 2017, 47(4): 311-319. DOI: 10.3969/j.issn.0253-2778.2017.04.005
[6]	XIA Fengxi, DENG Xin, ZHENG Lulu, WANG Xuejun. Strong limit theorems for negatively associated random variables with general moment conditions[J]. JUSTC, 2015, 45(6): 460-464. DOI: 10.3969/j.issn.0253-2778.2015.06.005
[7]	ZHENG Lulu, WANG Qiang, WANG Xuejun. Strong convergence for weighted sums of negatively superadditive dependent random variables[J]. JUSTC, 2015, 45(2): 101-106. DOI: 10.3969/j.issn.0253-2778.2015.02.002
[8]	YE Renyu. Variable selection and shrinkage quantile estimation for censored regression model[J]. JUSTC, 2014, 44(2): 101-105. DOI: 10.3969/j.issn.0253-2778.2014.02.004
[9]	WU Wen, HUANG Wei, ZHAO Pinghui, YE Taohong. Study of RANS based flamelet/progress variable turbulent combustion model[J]. JUSTC, 2010, 40(10): 1016. DOI: 10.3969/j.issn.0253-2778.2010.10.005
[10]	ZHANG Chao, LIAO Xiaoguang, WANG Weidong, WEI Guo. Selective cooperative communication based on power optimal allocation[J]. JUSTC, 2010, 40(4): 419-424. DOI: 10.3969/j.issn.0253-2778.2010.04.015

TrendMD

Volume 53 Issue 2 PP. 1

Cover

Keywords

Article Metrics

Article views (728) PDF downloads (1928)

Variable selection in high-dimensional extremile regression via the quasi elastic net

Abstract

Graphical Abstract

Abstract

Public Summary

1. Introduction

2. Linear extremile regression with QEN

3. Theoretical properties

3.1 Convergency of the algorithm

3.2 Grouping effect

3.3 Properties of the estimator

4. Simulation study

5. Real data analysis

6. Conclusions

Acknowledgements

Conflict of interest

Conflict of Interest

References

Related Articles

Supplements

Other Related Supplements

Graphic and text summary

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Variable selection in high-dimensional extremile regression via the quasi elastic net

Share

Tools

Abstract

Graphical Abstract

Abstract

Public Summary

1. Introduction

2. Linear extremile regression with QEN

3. Theoretical properties

3.1 Convergency of the algorithm

3.2 Grouping effect

3.3 Properties of the estimator

4. Simulation study

5. Real data analysis

6. Conclusions

Acknowledgements

Conflict of interest

Conflict of Interest

References

Related Articles

Supplements

Other Related Supplements

Graphic and text summary

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content