Lasso算法

在统计学和机器学习中，Lasso算法（英語：，又译最小绝对值收敛和选择算子、套索算法）是一种同时进行特征选择和正则化（数学）的回归分析方法，旨在增强统计模型的预测准确性和可解释性，最初由斯坦福大学统计学教授罗伯特·蒂布希拉尼于1996年基于Leo Breiman的非负参数推断（Nonnegative Garrote, NNG）提出[1][2]。Lasso算法最初用于计算最小二乘法模型，这个简单的算法揭示了很多估计量的重要性质，如估计量与岭回归（Ridge regression，也叫吉洪诺夫正则化）和最佳子集选择的关系，Lasso系数估计值和软阈值（soft thresholding）之间的联系。它也揭示了当协变量共线时，Lasso系数估计值不一定唯一（类似标准线性回归）。

虽然最早是为应用最小二乘法而定义的算法，lasso正则化可以简单直接地拓展应用于许多统计学模型上，包括广义线性模型，广义估计方程，成比例灾难模型和M-估计[3][4]。Lasso选择子集的能力依赖于限制条件的形式并且有多种表现形式，包括几何学，贝叶斯统计，和凸分析。

Lasso算法与基追踪降噪联系紧密。

历史来源

蒂布希拉尼最初使用Lasso来提高预测的准确性与回归模型的可解释性，他修改了模型拟合的过程，在协变量中只选择一个子集应用到最终模型中，而非用上全部协变量。这是基于有着相似目的，但方法有所不同的Breiman的非负参数推断。

在Lasso之前，选择模型中协变量最常用的方法是移步选择，这种方法在某些情况下是准确的，例如一些协变量与模型输出值有强相关性情况。然而在另一些情况下，这种方法会让预测结果更差。在当时，岭回归是提高模型预测准确性最常用的方法。岭回归可以通过缩小大的回归系数来减少过拟合从而改善模型预测偏差。但是它并不选择协变量，所以对模型的准确构建和解释没有帮助。

Lasso结合了上述的两种方法，它通过强制让回归系数绝对值之和小于某固定值，即强制一些回归系数变为0，有效地选择了不包括这些回归系数对应的协变量的更简单的模型。这种方法和岭回归类似，在岭回归中，回归系数平方和被强制小于某定值，不同点在于岭回归只改变系数的值，而不把任何值设为0。

基本形式

Lasso最初为了最小二乘法而被设计出来，Lasso的最小二乘法应用能够简单明了地展示Lasso的许多特性。

最小二乘

假设一个样本包括N种事件，每个事件包括p个协变量和一个输出值。让 $y_{i}$ 为输出值，并且 $x_{i}:=(x_{1},x_{2},\ldots ,x_{p})^{T}$ 为第i种情况的协变量向量，那么Lasso要计算的目标方程就是：

对所有 $\sum _{j=1}^{p}|\beta _{j}|\leq t$ ，计算 $\min _{\beta _{0},\beta }\left\{{\frac {1}{N}}\sum _{i=1}^{N}(y_{i}-\beta _{0}-x_{i}^{T}\beta )^{2}\right\}$ [1]

这里 $t$ 是一个决定规则化程度的预定的自由参数。设 $X$ 为协变量矩阵，那么 $X_{ij}=(x_{i})_{j}$ ，其中 $x_{i}^{T}$ 是 $X$ 的第 i 行，那么上式可以写成更紧凑的形式：

对所有

\|\beta \|_{1}\leq t

，计算

\min _{\beta _{0},\beta }\left\{{\frac {1}{N}}\left\|y-\beta _{0}-X\beta \right\|_{2}^{2}\right\}

这里 $\|\beta \|_{p}=\left(\sum _{i=1}^{N}|\beta _{i}|^{p}\right)^{1/p}$ 是标准 $\ell ^{p}$ 范数， $1_{N}$ 是 $N\times 1$ 维的1的向量。

因为 ${\hat {\beta }}_{0}={\bar {y}}-{\bar {x}}^{T}\beta$ ，所以有

y_{i}-{\hat {\beta }}_{0}-x_{i}^{T}\beta =y_{i}-({\bar {y}}-{\bar {x}}^{T}\beta )-x_{i}^{T}\beta =(y_{i}-{\bar {y}})-(x_{i}-{\bar {x}})^{T}\beta ,

对变量进行中心化是常用的数据处理方法。并且协方差一般规范化为 $\textstyle \left(\sum _{i=1}^{N}x_{ij}^{2}=1\right)$ ，这样得到的解就不会依赖测量的规模。

它的目标方程还可以写为：

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}\right\}{\text{ subject to }}\|\beta \|_{1}\leq t.

其拉格朗日形式为：

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{1}\right\}

其中 $t$ 和 $\lambda$ 的关系取决于数据特征。

正交协变量

现在考虑一些Lasso回归估计的基本性质。

首先假定所有的协变量都是正交的，即 $(x_{i}\mid x_{j})=\delta _{ij}$ ，其中 $\delta _{ij}$ 为克罗内克δ函数。等价的矩阵写法为 $X^{T}X=I$ ，使用次梯度法可有如下的表达形式

{\begin{aligned}{\hat {\beta }}_{j}={}&S_{N\lambda }({\hat {\beta }}_{j}^{\text{OLS}})={\hat {\beta }}_{j}^{\text{OLS}}\max \left(0,1-{\frac {N\lambda }{|{\hat {\beta }}_{j}^{\text{OLS}}|}}\right)\\&{\text{ 其中 }}{\hat {\beta }}^{\text{OLS}}=(X^{T}X)^{-1}X^{T}y\end{aligned}}

[1]

$S_{\alpha }$ 用于表示软阈值算子，当这个值非常小的时候为0。一个与之相近的记号 $H_{\alpha }$ 用来表示硬阈值算子，将较小的数值记为0的同时保留原有的较大数值。

与岭回归相比较，其中岭回归的目标在于最小化

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\|y-X\beta \|_{2}^{2}+\lambda \|\beta \|_{2}^{2}\right\}

即有

{\hat {\beta }}_{j}=(1+N\lambda )^{-1}{\hat {\beta }}_{j}^{\text{OLS}}.

因此岭回归是对OLS回归中所有的系数以一致的系数 $(1+N\lambda )^{-1}$ 缩放，并不会进行变量选择。

同样也可以对best subset selection算法进行比较，其目标在于最小化

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {1}{N}}\left\|y-X\beta \right\|_{2}^{2}+\lambda \|\beta \|_{0}\right\}

其中 $\|\cdot \|_{0}$ 表示 " $\ell ^{0}$ norm"，即0范数，被定义为该向量中非零元的个数。在这个例子中，可以得到

{\hat {\beta }}_{j}=H_{\sqrt {N\lambda }}\left({\hat {\beta }}_{j}^{\text{OLS}}\right)={\hat {\beta }}_{j}^{\text{OLS}}\mathrm {I} \left(\left|{\hat {\beta }}_{j}^{\text{OLS}}\right|\geq {\sqrt {N\lambda }}\right)

其中 $H_{\alpha }$ 被称为软阈值算子， $\mathrm {I}$ 为示性函数。

总的来说，Lasso估计量展现出了岭回归和最佳子划分算法的系数收缩的优点，使得部分系数为0。此外，在岭回归全部使用一个常数系数缩放的时候，Lasso回归会将一个接近0的系数变为0。

一般形式

Lasso正则化可以扩展为其他目标函数，例如广义线性模型，广义估计方程，比例风险模型和M估计。[1][5] 有目标函数

{\frac {1}{N}}\sum _{i=1}^{N}f(x_{i},y_{i},\alpha ,\beta )

其中Lasso正则化回归给出了下面模型的估计量

\min _{\alpha ,\beta }{\frac {1}{N}}\sum _{i=1}^{N}f(x_{i},y_{i},\alpha ,\beta ){\text{ subject to }}\|\beta \|_{1}\leq t

在这里只有 $\beta$ 是一个惩罚项， $\alpha$ 是一个自由变量，与最基本的模型中的 $\beta _{0}$ 变量一样。

算法解释

几何解释

Forms of the constraint regions for lasso and ridge regression.

Lasso回归可以使得某些项系数为0，从几何上来看，不同约束边界形状的岭回归则不能。他们都可以解释为最小化相同的目标函数

\min _{\beta _{0},\beta }\left\{{\frac {1}{N}}\left\|y-\beta _{0}-X\beta \right\|_{2}^{2}\right\}

但是有不同的约束条件：在Lasso回归中为 $\|\beta \|_{1}\leq t$ 而在岭回归中为 $\|\beta \|_{2}^{2}\leq t$ 。1-范数

The figure shows that the constraint region defined by the $\ell ^{1}$ norm is a square rotated so that its corners lie on the axes (in general a cross-polytope), while the region defined by the $\ell ^{2}$ norm is a circle (in general an n-sphere), which is rotationally invariant and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner (or a higher-dimensional equivalent) of a hypercube, for which some components of $\beta$ are identically zero, while in the case of an n-sphere, the points on the boundary for which some of the components of $\beta$ are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of $\beta$ are zero than one for which none of them are.

Making λ easier to interpret with an accuracy-simplicity tradeoff

The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of $\lambda$ .[6] It is assumed that $X$ is standardized with z-scores and that $y$ is centered (zero mean). Let $\beta _{0}$ represent the hypothesized regression coefficients and let $b_{OLS}$ refer to the data-optimized ordinary least squares solutions. We can then define the Lagrangian as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values.[7] This results in

\min _{\beta \in \mathbb {R} ^{p}}\left\{{\frac {(y-X\beta )'(y-X\beta )}{(y-X\beta _{0})'(y-X\beta _{0})}}+2\lambda \sum _{i=1}^{p}{\frac {|\beta _{i}-\beta _{0,i}|}{q_{i}}}\right\}

where $q_{i}$ is specified below. The first fraction represents relative accuracy, the second fraction relative simplicity, and $\lambda$ balances between the two.

Solution paths for the

\ell _{1}

norm and

\ell _{2}

norm when

b_{OLS}=2

and

\beta _{0}=0

Given a single regressor, relative simplicity can be defined by specifying $q_{i}$ as $|b_{OLS}-\beta _{0}|$ , which is the maximum amount of deviation from $\beta _{0}$ when $\lambda =0$ . Assuming that $\beta _{0}=0$ , the solution path can be defined in terms of $R^{2}$ :

b_{\ell _{1}}={\begin{cases}(1-\lambda /R^{2})b_{OLS}&{\mbox{if }}\lambda \leq R^{2},\\0&{\mbox{if }}\lambda >R^{2}.\end{cases}}

If $\lambda =0$ , the ordinary least squares solution (OLS) is used. The hypothesized value of $\beta _{0}=0$ is selected if $\lambda$ is bigger than $R^{2}$ . Furthermore, if $R^{2}=1$ , then $\lambda$ represents the proportional influence of $\beta _{0}=0$ . In other words, $\lambda \times 100\%$ measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data-optimized OLS solution.

If an $\ell _{2}$ -norm is used to penalize deviations from zero given a single regressor, the solution path is given by

$b_{\ell _{2}}={\bigg (}1+{\frac {\lambda }{R^{2}(1-\lambda )}}{\bigg )}^{-1}b_{OLS}$ . Like $b_{\ell _{1}}$ , $b_{\ell _{2}}$ moves in the direction of the point $(\lambda =R^{2},b=0)$ when $\lambda$ is close to zero; but unlike $b_{\ell _{1}}$ , the influence of $R^{2}$ diminishes in $b_{\ell _{2}}$ if $\lambda$ increases (see figure).
Given multiple regressors, the moment that a parameter is activated (i.e. allowed to deviate from $\beta _{0}$ ) is also determined by a regressor's contribution to $R^{2}$ accuracy. First,

R^{2}=1-{\frac {(y-Xb)'(y-Xb)}{(y-X\beta _{0})'(y-X\beta _{0})}}.

An $R^{2}$ of 75% means that in-sample accuracy improves by 75% if the unrestricted OLS solutions are used instead of the hypothesized $\beta _{0}$ values. The individual contribution of deviating from each hypothesis can be computed with the $p$ x $p$ matrix

R^{\otimes }=(X'{\tilde {y}}_{0})(X'{\tilde {y}}_{0})'(X'X)^{-1}({\tilde {y}}_{0}'{\tilde {y}}_{0})^{-1},

where ${\tilde {y}}_{0}=y-X\beta _{0}$ . If $b=b_{OLS}$ when $R^{2}$ is computed, then the diagonal elements of $R^{\otimes }$ sum to $R^{2}$ . The diagonal $R^{\otimes }$ values may be smaller than 0 or, less often, larger than 1. If regressors are uncorrelated, then the $i^{th}$ diagonal element of $R^{\otimes }$ simply corresponds to the $r^{2}$ value between $x_{i}$ and $y$ .

A rescaled version of the adaptive lasso of can be obtained by setting $q_{{\mbox{adaptive lasso}},i}=|b_{OLS,i}-\beta _{0,i}|$ .[8] If regressors are uncorrelated, the moment that the $i^{th}$ parameter is activated is given by the $i^{th}$ diagonal element of $R^{\otimes }$ . Assuming for convenience that $\beta _{0}$ is a vector of zeros,

b_{i}={\begin{cases}(1-\lambda /R_{ii}^{\otimes })b_{OLS,i}&{\mbox{if }}\lambda \leq R_{ii}^{\otimes },\\0&{\mbox{if }}\lambda >R_{ii}^{\otimes }.\end{cases}}

That is, if regressors are uncorrelated, $\lambda$ again specifies the minimal influence of $\beta _{0}$ . Even when regressors are correlated, the first time that a regression parameter is activated occurs when $\lambda$ is equal to the highest diagonal element of $R^{\otimes }$ .

These results can be compared to a rescaled version of the lasso by defining $q_{{\mbox{lasso}},i}={\frac {1}{p}}\sum _{l}|b_{OLS,l}-\beta _{0,l}|$ , which is the average absolute deviation of $b_{OLS}$ from $\beta _{0}$ . Assuming that regressors are uncorrelated, then the moment of activation of the $i^{th}$ regressor is given by

{\tilde {\lambda }}_{{\text{lasso}},i}={\frac {1}{p}}{\sqrt {R_{i}^{\otimes }}}\sum _{l=1}^{p}{\sqrt {R_{l}^{\otimes }}}.

For $p=1$ , the moment of activation is again given by ${\tilde {\lambda }}_{{\text{lasso}},i}=R^{2}$ . If $\beta _{0}$ is a vector of zeros and a subset of $p_{B}$ relevant parameters are equally responsible for a perfect fit of $R^{2}=1$ , then this subset is activated at a $\lambda$ value of ${\frac {1}{p}}$ . The moment of activation of a relevant regressor then equals ${\frac {1}{p}}{\frac {1}{\sqrt {p_{B}}}}p_{B}{\frac {1}{\sqrt {p_{B}}}}={\frac {1}{p}}$ . In other words, the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso. The adaptive lasso and the lasso are special cases of a '1ASTc' estimator. The latter only groups parameters together if the absolute correlation among regressors is larger than a user-specified value.[6]

Bayesian interpretation

Laplace distributions are sharply peaked at their mean with more probability density concentrated there compared to a normal distribution.

Just as ridge regression can be interpreted as linear regression for which the coefficients have been assigned normal prior distributions, lasso can be interpreted as linear regression for which the coefficients have Laplace prior distributions. The Laplace distribution is sharply peaked at zero (its first derivative is discontinuous at zero) and it concentrates its probability mass closer to zero than does the normal distribution. This provides an alternative explanation of why lasso tends to set some coefficients to zero, while ridge regression does not.[1]

Convex relaxation interpretation

Lasso can also be viewed as a convex relaxation of the best subset selection regression problem, which is to find the subset of $\leq k$ covariates that results in the smallest value of the objective function for some fixed $k\leq n$ , where n is the total number of covariates. The " $\ell ^{0}$ norm", $\|\cdot \|_{0}$ , (the number of nonzero entries of a vector), is the limiting case of " $\ell ^{p}$ norms", of the form $\textstyle \|x\|_{p}=\left(\sum _{i=1}^{n}|x_{j}|^{p}\right)^{1/p}$ (where the quotation marks signify that these are not really norms for $p<1$ since $\|\cdot \|_{p}$ is not convex for $p<1$ , so the triangle inequality does not hold). Therefore, since p = 1 is the smallest value for which the " $\ell ^{p}$ norm" is convex (and therefore actually a norm), lasso is, in some sense, the best convex approximation to the best subset selection problem, since the region defined by $\|x\|_{1}\leq t$ is the convex hull of the region defined by $\|x\|_{p}\leq t$ for $p<1$ .

应用

LASSO已被应用于经济和金融领域，可以改善预测结果并选择有时被忽视的变量。例如：公司破产预测[9]和高增长公司预测[10]。

参见

参考文献

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the lasso”. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178 （页面存档备份，存于）.
Breiman, Leo. . Technometrics. 1995-11-01, 37 (4): 373–384 [2017-10-06]. ISSN 0040-1706. doi:10.2307/1269730. （原始内容存档于2020-06-08）.
Tibshirani, Robert. . Journal of the Royal Statistical Society. Series B (Methodological). 1996, 58 (1): 267–288 [2016-07-25]. （原始内容存档于2020-11-17）.
Tibshirani, Robert. . Statistics in Medicine. 1997-02-28, 16 (4): 385–395. ISSN 1097-0258. doi:10.1002/(sici)1097-0258(19970228)16:4%3C385::aid-sim380%3E3.0.co;2-3 （英语）.
Hoornweg, Victor. . . Hoornweg Press. 2018 [2023-08-08]. ISBN 978-90-829188-0-9. （原始内容存档于2023-11-02）.
Motamedi, Fahimeh; Sanchez, Horacio; Mehri, Alireza; Ghasemi, Fahimeh. . Bioinformatics. October 2021, 37 (19): 469–475. ISSN 1367-4803. PMID 34979024. doi:10.1093/bioinformatics/btab659.
Zou, Hui. (PDF). 2006 [2023-08-08]. （原始内容存档 (PDF)于2021-07-11）.
Shaonan, Tian; Yu, Yan; Guo, Hui. . Journal of Banking & Finance. 2015, 52 (1): 89–100. doi:10.1016/j.jbankfin.2014.12.003 .
Coad, Alex; Srhoj, Stjepan. . Small Business Economics. 2020, 55 (1): 541–565. doi:10.1007/s11187-019-00203-3 .

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[Tibshirani_1996-1] Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the lasso”. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178 （页面存档备份，存于）.

[2] Breiman, Leo. . Technometrics. 1995-11-01, 37 (4): 373–384 [2017-10-06]. ISSN 0040-1706. doi:10.2307/1269730. （原始内容存档于2020-06-08）.

[3] Tibshirani, Robert. . Journal of the Royal Statistical Society. Series B (Methodological). 1996, 58 (1): 267–288 [2016-07-25]. （原始内容存档于2020-11-17）.

[4] Tibshirani, Robert. . Statistics in Medicine. 1997-02-28, 16 (4): 385–395. ISSN 1097-0258. doi:10.1002/(sici)1097-0258(19970228)16:4%3C385::aid-sim380%3E3.0.co;2-3 （英语）.

[Hoornweg2018SUS-6] Hoornweg, Victor. . . Hoornweg Press. 2018 [2023-08-08]. ISBN 978-90-829188-0-9. （原始内容存档于2023-11-02）.

[7] Motamedi, Fahimeh; Sanchez, Horacio; Mehri, Alireza; Ghasemi, Fahimeh. . Bioinformatics. October 2021, 37 (19): 469–475. ISSN 1367-4803. PMID 34979024. doi:10.1093/bioinformatics/btab659.

[8] Zou, Hui. (PDF). 2006 [2023-08-08]. （原始内容存档 (PDF)于2021-07-11）.

[Tian-9] Shaonan, Tian; Yu, Yan; Guo, Hui. . Journal of Banking & Finance. 2015, 52 (1): 89–100. doi:10.1016/j.jbankfin.2014.12.003 .

[sbe-10] Coad, Alex; Srhoj, Stjepan. . Small Business Economics. 2020, 55 (1): 541–565. doi:10.1007/s11187-019-00203-3 .

机器学习与

范式监督学习無監督學習線上機器學習元学习半监督学习自监督学习强化学习基于规则的机器学习量子機器學習
问题统计分类生成模型迴歸分析聚类分析降维密度估计异常检测数据清洗自动机器学习关联规则学习語意分析结构预测特征工程表征学习排序学习语法归纳本体学习多模态学习
监督学习 (分类 · 回归) 学徒学习决策树学习集成学习 Bagging 提升方法随机森林 k-NN 線性回歸朴素贝叶斯人工神经网络邏輯斯諦迴歸感知器相关向量机（RVM）支持向量机（SVM）迁移学习微调
聚类分析 BIRCH CURE算法层次 k-平均 Fuzzy 期望最大化（EM） DBSCAN OPTICS 均值飘移
降维因素分析 CCA ICA LDA NMF PCA PGD t-SNE SDL
结构预测圖模式貝氏網路條件隨機域隐马尔可夫模型
异常检测 RANSAC k-NN 局部异常因子孤立森林
人工神经网络自编码器認知計算深度学习 DeepDream 多层感知器 RNN LSTM GRU ESN 储备池计算受限玻尔兹曼机 GAN SOM CNN U-Net Transformer Vision transforme 脉冲神经网络 Memtransistor 电化学RAM（ECRAM）
强化学习 Q学习 SARSA 时序差分（TD）多智能体 Self-play RLHF
与人类学习主动学习众包 Human-in-the-loop
模型诊断学习曲线
数学基础内核机器偏差–方差困境计算学习理论经验风险最小化奥卡姆学习 PAC学习统计学习 VC理论
大会与出版物 NeurIPS ICML ICLR ML JMLR
相关条目人工智能术语机器学习研究数据集列表机器学习概要