Mean squared prediction error

In statistics the mean squared prediction error (MSPE), also known as mean squared error of the predictions, of a smoothing, curve fitting, or regression procedure is the expected value of the squared prediction errors (PE), the square difference between the fitted values implied by the predictive function ${\widehat {g}}$ and the values of the (unobservable) true value g. It is an inverse measure of the explanatory power of ${\widehat {g}},$ and can be used in the process of cross-validation of an estimated model. Knowledge of g would be required in order to calculate the MSPE exactly; in practice, MSPE is estimated.^[1]

Formulation

If the smoothing or fitting procedure has projection matrix (i.e., hat matrix) L, which maps the observed values vector $y$ to predicted values vector ${\hat {y}}=Ly,$ then PE and MSPE are formulated as:

\operatorname {PE_{i}} =g(x_{i})-{\widehat {g}}(x_{i}),

\operatorname {MSPE} =\operatorname {E} \left[\operatorname {PE} _{i}^{2}\right]=\sum _{i=1}^{n}\operatorname {PE} _{i}^{2}/n.

The MSPE can be decomposed into two terms: the squared bias (mean error) of the fitted values and the variance of the fitted values:

\operatorname {MSPE} =\operatorname {ME} ^{2}+\operatorname {VAR} ,

\operatorname {ME} =\operatorname {E} \left[{\widehat {g}}(x_{i})-g(x_{i})\right]

\operatorname {VAR} =\operatorname {E} \left[\left({\widehat {g}}(x_{i})-\operatorname {E} \left[{g}(x_{i})\right]\right)^{2}\right].

The quantity $SSPE= n MSPE$ is called sum squared prediction error. The root mean squared prediction error is the square root of MSPE: $RMSPE= \sqrt MSPE$ .

Computation of MSPE over out-of-sample data

The mean squared prediction error can be computed exactly in two contexts. First, with a data sample of length n, the data analyst may run the regression over only q of the data points (with q < n), holding back the other n – q data points with the specific purpose of using them to compute the estimated model’s MSPE out of sample (i.e., not using data that were used in the model estimation process). Since the regression process is tailored to the q in-sample points, normally the in-sample MSPE will be smaller than the out-of-sample one computed over the n – q held-back points. If the increase in the MSPE out of sample compared to in sample is relatively slight, that results in the model being viewed favorably. And if two models are to be compared, the one with the lower MSPE over the n – q out-of-sample data points is viewed more favorably, regardless of the models’ relative in-sample performances. The out-of-sample MSPE in this context is exact for the out-of-sample data points that it was computed over, but is merely an estimate of the model’s MSPE for the mostly unobserved population from which the data were drawn.

Second, as time goes on more data may become available to the data analyst, and then the MSPE can be computed over these new data.

Estimation of MSPE over the population

When the model has been estimated over all available data with none held back, the MSPE of the model over the entire population of mostly unobserved data can be estimated as follows.

For the model $y_{i}=g(x_{i})+\sigma \varepsilon _{i}$ where $\varepsilon _{i}\sim {\mathcal {N}}(0,1)$ , one may write

n\cdot \operatorname {MSPE} (L)=g^{\text{T}}(I-L)^{\text{T}}(I-L)g+\sigma ^{2}\operatorname {tr} \left[L^{\text{T}}L\right].

Using in-sample data values, the first term on the right side is equivalent to

\sum _{i=1}^{n}\left(\operatorname {E} \left[g(x_{i})-{\widehat {g}}(x_{i})\right]\right)^{2}=\operatorname {E} \left[\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}\right]-\sigma ^{2}\operatorname {tr} \left[\left(I-L\right)^{T}\left(I-L\right)\right].

Thus,

n\cdot \operatorname {MSPE} (L)=\operatorname {E} \left[\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}\right]-\sigma ^{2}\left(n-\operatorname {tr} \left[L\right]\right).

If $\sigma ^{2}$ is known or well-estimated by ${\widehat {\sigma }}^{2}$ , it becomes possible to estimate MSPE by

n\cdot \operatorname {\widehat {MSPE}} (L)=\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}-{\widehat {\sigma }}^{2}\left(n-\operatorname {tr} \left[L\right]\right).

Colin Mallows advocated this method in the construction of his model selection statistic C_p, which is a normalized version of the estimated MSPE:

C_{p}={\frac {\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}}{{\widehat {\sigma }}^{2}}}-n+2p.

where p the number of estimated parameters p and ${\widehat {\sigma }}^{2}$ is computed from the version of the model that includes all possible regressors. That concludes this proof.

References

↑ Pindyck, Robert S.; Rubinfeld, Daniel L. (1991). "Forecasting with Time-Series Models". Econometric Models & Economic Forecasts (3rd ed.). New York: McGraw-Hill. pp. 516–535. ISBN 0-07-050098-3.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Pindyck, Robert S.; Rubinfeld, Daniel L. (1991). "Forecasting with Time-Series Models". Econometric Models & Economic Forecasts (3rd ed.). New York: McGraw-Hill. pp. 516–535. ISBN 0-07-050098-3.

[1]

Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR DCG NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix

Formulation

Computation of MSPE over out-of-sample data

Estimation of MSPE over the population

See also

References