标题: Multicollinearity and the solutions [打印本页] 作者: shiyiming 时间: 2012-3-22 21:55 标题: Multicollinearity and the solutions From Dapangmao's blog on sas-analysis
<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/--Zz3fr5Fjg8/T2lIOAYHVcI/AAAAAAAAA_o/sd5nmXvLXFA/s1600/SGPlot1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="http://4.bp.blogspot.com/--Zz3fr5Fjg8/T2lIOAYHVcI/AAAAAAAAA_o/sd5nmXvLXFA/s400/SGPlot1.png" width="400" /></a></div><br />
In <a href="http://www.amazon.com/Regression-Methods-Statistics-textbooks-monographs/dp/0824766474/ref=sr_1_1?ie=UTF8&qid=1332172741&sr=8-1">his book</a>, Rudolf Freund described a confounding phenomenon while fitting a linear regression. Given a small data set below, there are three variables - dependent variable(y) and independent variables(x1 and x2). Using x2 to fit y alone, the estimated parameter of x2 f is positive that is 0.78. Then using x1 and x2 together to fit y, the parameter of x2 becomes -1.29, which is hard to explain since clearly x2 and y has a positive correlation.<br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
data raw;
input y x1 x2;
cards;
2 0 2
3 2 6
2 2 7
7 2 5
6 4 9
8 4 8
10 4 7
7 6 10
8 6 11
12 6 9
11 8 15
14 8 13
;;;
run;
ods graphics on / border = off;
proc sgplot data = raw;
reg x = x2 y = y;
reg x = x2 y = y / group = x1 datalable = x1;
run;
</code></pre><div class="separator" style="clear: both; text-align: center;"></div><b><br />
</b><br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-JLtalNH5b40/T2lK8TwzAKI/AAAAAAAAA_w/usCX3SyF9s4/s1600/Presentation1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="http://3.bp.blogspot.com/-JLtalNH5b40/T2lK8TwzAKI/AAAAAAAAA_w/usCX3SyF9s4/s640/Presentation1.png" width="640" /></a></div>The reason is that x1 and x2 have strong correlation each other. Diagnostics are well when using x2 to fit y. However, counting x1 and x2 together into the regression model causes multicollinearity, and therefore demonstrates severe heteroskedasticity and a skewed distribution of the residuals, which violates the <a href="http://www.sasanalysis.com/2011/07/10-minute-tutorial-for-linear.html">assumptions</a> for OLS regressions. Shown in the top scatter plot, 0.78 is the slope of the regression line by y ~ x2 (the longest straight line), while -1.29 is actually the slope of the partial regression lines by y ~ x2|x1 (four short segments). <br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
proc reg data = raw;
model y = x2;
ods select parameterestimates diagnosticspanel;
run;
proc reg data = raw;
model y = x1 x2;
ods select parameterestimates diagnosticspanel;
run;
</code></pre><b>Solutions:</b><br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-MSiPPghR-g4/T2lMjrq4JXI/AAAAAAAAA_4/FjthFobzbGs/s1600/DiagnosticsPanel2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://2.bp.blogspot.com/-MSiPPghR-g4/T2lMjrq4JXI/AAAAAAAAA_4/FjthFobzbGs/s320/DiagnosticsPanel2.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br />
</div>1. Drop a variable<br />
Standing alone, x1 seems like a better predictor (higher R-square and lower MSE) than x2. The easiest way to remove this multicollinearity is to keep only x1 in the model.<br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
proc reg data = raw;
model y = x1;
ods select parameterestimates diagnosticspanel;
run;
</code></pre><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-XCWo81NBAdk/T2eOVsSWEPI/AAAAAAAAA_U/Bj_C83lilsY/s1600/ScreePlot5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="http://2.bp.blogspot.com/-XCWo81NBAdk/T2eOVsSWEPI/AAAAAAAAA_U/Bj_C83lilsY/s400/ScreePlot5.png" width="400" /></a></div><br />
2. Principle component regression<br />
If we want to keep both variables to avoid information loss, principle component regression is a good option. PCA would transform the correlated variables to the orthogonal factors. In this case, the 1st eigenvector explains 97.77% of the total variance, which is fairly enough for the following regression. SAS's <a href="http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_pls_sect004.htm">PLS procedure </a>can also perform the principle component regression.<br />
<br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
proc princomp data = raw out = pca;
ods select screeplot corr eigenvalues eigenvectors;
var x1 x2;
run;
proc reg data = pca;
model y = prin1;
ods select parameterestimates diagnosticspanel;
run;
</code></pre><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3256159328630041416-7982648040085775225?l=www.sasanalysis.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasAnalysis/~4/BX2vQoC9hzs" height="1" width="1"/>