|
楼主

楼主 |
发表于 2012-3-14 04:21:33
|
只看该作者
Count Data Models in SAS®
From LCChien's blog on blogspot
原文載點:[<a href="http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cts=1331653599525&ved=0CCkQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.176.1766%26rep%3Drep1%26type%3Dpdf&ei=jGVfT4CNM6mnsAKy19j7Bw&usg=AFQjCNHGgPrkhBj4iPh30Ig5Hh251j5jQg&sig2=lxnDgMUyWfiKn0hC1L4vvw">LINK</a>]<br /><br />在統計應用領域,我們經常會遇到所謂的"計數資料"(count data),比方說就診人數,死亡人數,顧客人數等等。在分析這些計數資料時,最常使用的模型莫過於卜瓦松模型(Poisson model)。但在實際的應用上,使用者經常需要面對兩大問題:一是過度離散(overdispersion)問題,二是觀測值有太多零(excess zeros)的問題。這些問題通常都會導致卜瓦松模型估計的結果有偏誤。因此 WenSui Liu 和 Jimmy Cela 在 2008 年 SAS Global Forum 發表了一篇技術文件,介紹一些其他比較適合處理計數資料的模型。<br /><br /><a name='more'></a><b><span style="font-size: large;">。資料</span></b><br />該文使用的資料是 Deb and Trivedi (1997) 所發表的一篇關於醫療保健在老人上的利用研究所使用的資料。裡面包含了:<br />應變數 -<br />1. HOSP = 住院次數<br />因變數 -<br />2. EXCLHLTH = 自我健康評量(1=極佳,0=其他)<br />3. POORHLTH = 自我健康評量(1=極差,0=其他)<br />4. NUMCHRON = 慢性病數目<br />5. AGE = 年齡<br />6. MALE = 性別(1=男性,0=女性)<br />7. SCHOOL = 教育程度(以年為單位)<br />8. PRIVINS = 保險(1=有健保,0=沒有健保)<br /><br />由於應變數 HOSP 的變異數是平均值的兩倍,這表示該筆資料有過度離散的問題。此外,4406 份樣本中,有 3541 人沒有任何住院記錄,所以這筆資料也有太多零的問題。我們從下面這張圖也可以看出一些端倪:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.diigo.com/item/t/2043829_111676671_6438414" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="344" src="http://www.diigo.com/item/t/2043829_111676671_6438414" width="640" /></a></div><br />這張圖是將樣本的觀測機率(紅線)以及實際的卜瓦松分配的機率密度函數(藍線)拿來比較。我們可以發現,在 HOSP = 1 處,紅線和藍線有著極大的差距。因此,接下來將這筆資料利用數個不同的模型來配飾,來比較估計參數的不同以及觀測機率和預測機率的差異。<br /><br /><b><span style="font-size: large;">。卜瓦松模型</span></b><br />程式:<br /><code>/* METHOD 1: PROC NLMIXED */<br />proc nlmixed data = tblNMES;<br /> parms b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;<br /> mu = exp(b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE + <br /> b5 * MALE + b6 * SCHOOL + b7 * PRIVINS);<br /> ll = -mu + HOSP * log(mu) - log(fact(HOSP));<br /> model HOSP ~ general(ll);<br /> predict mu out = poi_out (rename = (pred = Yhat)); <br />run;<br /><br />/* METHOD 2: PROC COUNTREG */<br />proc countreg data = tblNMES type = poisson; <br /> model HOSP = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS;<br />run; </code><br /><br />原作者使用比較複雜的 PROC NLMIXED 和 PROC GOUNTREG 程序來完成模型配飾。這兩種方法與 PROC GENMOD 做出來的結果應該會一致。簡單來講,在 PROC NLMIXED 程序裡面,要先設定每個參數的初始值(params),然後把還沒加上 link function 前的公式打出來(mu),然後把 log-likelihood 寫好(ll),最後才是配飾模型(model)。而 PROC COUNTREG 程序則比較單純,只要把後面的 type= 設定成 poisson 即可。<br /><br />報表:<br /><code> Model Fit Summary<br /> Log Likelihood -3046<br /> AIC 6108<br /> SBC 6159<br /><br /> Parameter Estimates<br /><br /> Standard Approx<br /> Parameter Estimate Error t Value Pr > |t|<br /> Intercept -3.329044 0.339728 -9.80 <.0001<br /> exclhlth -0.723412 0.175644 -4.12 <.0001<br /> poorhlth 0.626157 0.067858 9.23 <.0001<br /> numchron 0.264462 0.018277 14.47 <.0001<br /> age 0.186406 0.042014 4.44 <.0001<br /> male 0.103186 0.056274 1.83 0.0667<br /> school -0.000206 0.007871 -0.03 0.9791<br /> privins 0.108652 0.069251 1.57 0.1167</code><br /><br /><br />觀測機率與預測機率比較圖:<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://www.diigo.com/item/t/2043829_111695391_6439481" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="336" src="http://www.diigo.com/item/t/2043829_111695391_6439481" width="640" /></a></div><br />毫不意外地,在沒有用任何調整的情況下用卜瓦松模型去估計,則預測機率值在 HOSP = 0 和 HOSP = 1 處明顯地可發現與實際機率值有差異。當然利用這種圖來評量過度離散是比較主觀。作者提到了兩種簡單的方法來檢定。在此不額外說明演算過程。重點是這兩種檢定方法會產生一個 p-value 供判斷使用。只要出現 p-value < 0.05 就表示過度離散的情況存在。<br /><br /><b>(1) Cameron and Trivedi (1996)</b><br />程式:<br /><code>data ols_tmp;<br /><br /> set poi_out; <br /> dep = ((HOSP - Yhat) ** 2 - HOSP) / Yhat; <br />run;<br /><br />proc reg data = ols_tmp;<br /> model dep = Yhat / noint; /* FIT A OLS REGRESSION WITHOUT INTERCEPT */<br />run; quit;</code><br /><br />報表:<br /><code> Parameter Estimates<br /> Parameter Standard<br /> Variable Label DF Estimate Error t Value Pr > |t|<br /> Yhat Predicted Value 1 1.63419 0.22609 7.23 <b><span style="color: red;"><.0001</span></b></code><br /><br />用此法檢定出來的 p-value < .0001,所以表示過度離散的現象存在於這筆資料中。<br /><br /><b>(2) Greene (2002)</b><br />程式:<br /><code>proc iml;<br /> use poi_out; <br /> read all var {HOSP} into y;<br /> read all var {Yhat} into yhat;<br /> close poisson_out;<br /> e = (y - yhat);<br /> n = nrow(y);<br /> ybar = y`[, :];<br /> LM = (e` * e - n * ybar) ** 2 / (2 * yhat` * yhat); <br /> Pvalue = 1 - probchi(LM, 1);<br /> print LM Pvalue;<br />quit;</code><br /><br />報表:<br /><code> LM PVALUE<br /> 794.14707 <b><span style="color: red;">0</span></b></code><br /><br />同樣地,用此法算出來的 p-value 逼近於零,所以最後預測機率值和觀測機率值出現落差是完全不意外。<br /><br /><b><span style="font-size: large;">。負二項模型</span></b><br />負二項模型的 PROC NLMIXED 程序內的寫法跟卜瓦松模型很像,唯一要注意的就是他的 log-likelihood 長得不一樣。看起來是複雜些。所以還是用 PROC COUNTREG 程序來做比較簡單。 程式如下:<br /><code>/* METHOD 1: PROC NLMIXED */<br />proc nlmixed data = tblNMES;<br /> parms b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;<br /> mu = exp(b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE + <br /> b5 * MALE + b6 * SCHOOL + b7 * PRIVINS);<br /> <span style="color: red;">ll = lgamma(HOSP + 1 / alpha) - lgamma(HOSP + 1) - lgamma(1 / alpha) +<br /> HOSP * log(alpha * mu) - <br /> (HOSP + 1 / alpha) * log(1 + alpha * mu);</span><br /> model HOSP ~ general(ll);<br /> predict mu out = nb_out (rename = (pred = Yhat)); <br />run;<br /><br />/* METHOD 2: PROC COUNTREG */<br />proc countreg data = tblNMES type = negativebinom method = qn; <br /> model HOSP = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS;<br />run;</code><br /><br />報表:<br /><code> Model Fit Summary<br /> Log Likelihood -2857<br /> AIC 5731<br /> SBC 5789<br /><br /> Parameter Estimates<br /> Standard Approx<br /> Parameter Estimate Error t Value Pr > |t|<br /> Intercept -3.752640 0.446835 -8.40 <.0001<br /> exclhlth -0.697875 0.193318 -3.61 0.0003<br /> poorhlth 0.613926 0.095392 6.44 <.0001<br /> numchron 0.289418 0.026470 10.93 <.0001<br /> age 0.238444 0.055265 4.31 <.0001<br /> male 0.153862 0.073033 2.11 0.0351<br /> school -0.002271 0.010203 -0.22 0.8238<br /> privins 0.093922 0.090494 1.04 0.2993<br /> _Alpha 1.766727 0.160492 11.01 <.0001</code><br /><br />觀測機率與預測機率比較圖:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.diigo.com/item/t/2043829_111681693_6438630" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="348" src="http://www.diigo.com/item/t/2043829_111681693_6438630" width="640" /></a></div>從上圖可以明顯發現,用負二項模型來估計顯然是比較好的,因為觀測機率值和預測機率值兩條線幾乎重疊在一起。<br /><br /><b><span style="font-size: large;">。跨欄(Hurdle)模型</span></b><br />跨欄模型是 Dr. Mullahy 於 1986 年研發,他是假設當應變數是零,則服從二項分配,當大於零時,則服從另一個機率密度函數長得很像卜瓦松分配的分配(該分配沒有實際名字)。其機率密度函數可以用下面公式來表示:<br /><br /><div style="text-align: center;"><a href="http://www.codecogs.com/eqnedit.php?latex=\large%20f(Y_{i}|X_{i})=\left\{\begin{matrix}%20\theta_{i}%20&%20for\;%20Y_{i}=0\\%20&%20\frac{(1-\theta_{i})exp(-u_{i})u_{i}^{Y_{i}}}{(1-exp(-u_{i})Y_{i}!)}\;%20for\;%20Y_{i}%3E0%20\end{matrix}\left." target="_blank"><img src="http://latex.codecogs.com/gif.latex?\large f(Y_{i}|X_{i})=\left\{\begin{matrix} \theta_{i} & for\; Y_{i}=0\\ & \frac{(1-\theta_{i})exp(-u_{i})u_{i}^{Y_{i}}}{(1-exp(-u_{i})Y_{i}!)}\; for\; Y_{i}>0 \end{matrix}\left." title="\large f(Y_{i}|X_{i})=\left\{\begin{matrix} \theta_{i} & for\; Y_{i}=0\\ & \frac{(1-\theta_{i})exp(-u_{i})u_{i}^{Y_{i}}}{(1-exp(-u_{i})Y_{i}!)}\; for\; Y_{i}>0 \end{matrix}\left." /></a></div><br />因此其 log-likelihood 就變得非常複雜:<br /><div style="text-align: center;"><a href="http://www.codecogs.com/eqnedit.php?latex=\large%20LL=\sum_{i=1}^{n}[I(Y_{i}=0)log(\theta_{i})@plus;I(Y_{i}%3E0)(log(1-\theta_{i})-u_{i}@plus;Y_{i}log(u_{i})-log(1-exp(-u_{i}))-log(Y_{i}!))]" target="_blank"><img src="http://latex.codecogs.com/gif.latex?\large LL=\sum_{i=1}^{n}[I(Y_{i}=0)log(\theta_{i})+I(Y_{i}>0)(log(1-\theta_{i})-u_{i}+Y_{i}log(u_{i})-log(1-exp(-u_{i}))-log(Y_{i}!))]" title="\large LL=\sum_{i=1}^{n}[I(Y_{i}=0)log(\theta_{i})+I(Y_{i}>0)(log(1-\theta_{i})-u_{i}+Y_{i}log(u_{i})-log(1-exp(-u_{i}))-log(Y_{i}!))]" /></a></div><br />所以這個模型只能用 PROC NLMIXED 來處理。程式如下:<br /><code>/* METHOD 1: PROC NLMIXED */<br />proc nlmixed data = tblNMES tech = dbldog;<br /> parms a0 = 0 a1 = 0 a2 = 0 a3 = 0 a4 = 0 a5 = 0 a6 = 0 a7 = 0<br /> b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;<br /> eta0 = a0 + a1 * EXCLHLTH + a2 * POORHLTH + a3 * NUMCHRON + a4 * AGE + <br /> a5 * MALE + a6 * SCHOOL + a7 * PRIVINS;<br /> exp_eta0 = exp(eta0);<br /> p0 = exp_eta0 / (1 + exp_eta0);<br /> etap = b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE + <br /> b5 * MALE + b6 * SCHOOL + b7 * PRIVINS;<br /> exp_etap = exp(etap);<br /> if HOSP = 0 then ll = log(p0);<br /> else ll = log(1 - p0) - exp_etap + HOSP * etap - lgamma(HOSP + 1)<br /> - log(1 - exp(-exp_etap));<br /> model HOSP ~ general(ll);<br /> predict exp_etap out = hdl_out1 (keep = pred HOSP rename = (pred = Yhat));<br /> predict p0 out = hdl_out2 (keep = pred rename = (pred = p0));<br />run;</code><br /><br />報表:<br /><code> Fit Statistics<br /> -2 Log Likelihood 5758.4<br /> AIC (smaller is better) 5790.4<br /> AICC (smaller is better) 5790.6<br /> BIC (smaller is better) 5892.7<br /><br /> Parameter Estimates<br /><br /> Standard<br /> Parameter Estimate Error DF t Value Pr > |t| Alpha<br /> a0 4.2311 0.4889 4406 8.65 <.0001 0.05<br /> a1 0.5826 0.1991 4406 2.93 0.0035 0.05<br /> a2 -0.6953 0.1073 4406 -6.48 <.0001 0.05<br /> a3 -0.3078 0.02890 4406 -10.65 <.0001 0.05<br /> a4 -0.2752 0.06061 4406 -4.54 <.0001 0.05<br /> a5 -0.1948 0.08008 4406 -2.43 0.0151 0.05<br /> a6 -0.00593 0.01126 4406 -0.53 0.5982 0.05<br /> a7 -0.01924 0.09944 4406 -0.19 0.8466 0.05<br /> b0 -0.4693 0.5627 4406 -0.83 0.4043 0.05<br /> b1 -0.9422 0.4949 4406 -1.90 0.0570 0.05<br /> b2 0.3373 0.1008 4406 3.35 0.0008 0.05<br /> b3 0.1426 0.02967 4406 4.81 <.0001 0.05<br /> b4 -0.01229 0.06834 4406 -0.18 0.8573 0.05<br /> b5 -0.03854 0.09227 4406 -0.42 0.6762 0.05<br /> b6 -0.01815 0.01290 4406 -1.41 0.1597 0.05<br /> b7 0.2589 0.1139 4406 2.27 0.0231 0.05</code><br /><br /><br />觀測機率與預測機率比較圖:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.diigo.com/item/t/2043829_111682160_6438659" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="348" src="http://www.diigo.com/item/t/2043829_111682160_6438659" width="640" /></a></div><br />上圖顯示預測機率值跟觀測機率值也相當接近。<br /><br /><b><span style="font-size: large;">。零膨脹(Zero-inflated: ZIP)模型</span></b><br />以下皆簡稱 ZIP 模型。此模型是專門處理觀測值裡面有太多零的問題。在 PROC NLMIXED 和 PROC COUNTREG 程序裡面都可以處理。特別注意的是,在 PROC COUNTREG 程序裡面,除了要把 type 設定成 zip 外,還要在 model 後面加上額外的指令。如下面紅色程式碼所示。<br /><br />程式:<br /><code>/* METHOD 1: PROC COUNTREG */<br />proc countreg data = tblNMES type = zip;<br /> model HOSP = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS<br /> / <span style="color: red;">zi(link = logistic, var = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS)</span>;<br />run;<br /><br />/* METHOD 2: PROC NLMIXED */<br />proc nlmixed data = tblNMES tech = dbldog;<br /> parms a0 = 0 a1 = 0 a2 = 0 a3 = 0 a4 = 0 a5 = 0 a6 = 0 a7 = 0<br /> b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;<br /> eta0 = a0 + a1 * EXCLHLTH + a2 * POORHLTH + a3 * NUMCHRON + a4 * AGE + <br /> a5 * MALE + a6 * SCHOOL + a7 * PRIVINS;<br /> exp_eta0 = exp(eta0);<br /> p0 = exp_eta0 / (1 + exp_eta0);<br /> etap = b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE + <br /> b5 * MALE + b6 * SCHOOL + b7 * PRIVINS;<br /> exp_etap = exp(etap);<br /> if HOSP = 0 then ll = log(p0 + (1 - p0) * exp(-exp_etap));<br /> else ll = log(1 - p0) + HOSP * etap - exp_etap - lgamma(HOSP + 1);<br /> model HOSP ~ general(ll);<br /> predict exp_etap out = zip_out1 (keep = pred HOSP rename = (pred = Yhat));<br /> predict p0 out = zip_out2 (keep = pred rename = (pred = p0));<br />run;</code><br /><br /><br />報表:<br /><code> Model Fit Summary<br /> Log Likelihood -2878<br /> AIC 5788<br /> SBC 5890<br /><br /> Parameter Estimates<br /> Standard Approx<br /> Parameter Estimate Error t Value Pr > |t|<br /> Intercept -0.366506 0.572032 -0.64 0.5217<br /> exclhlth -0.919990 0.458460 -2.01 0.0448<br /> poorhlth 0.324926 0.101157 3.21 0.0013<br /> numchron 0.127746 0.033867 3.77 0.0002<br /> age -0.024359 0.068806 -0.35 0.7233<br /> male -0.059629 0.099133 -0.60 0.5475<br /> school -0.012473 0.013520 -0.92 0.3562<br /> privins 0.229208 0.114004 2.01 0.0444<br /> Inf_Intercept 4.265976 0.971218 4.39 <.0001<br /> Inf_exclhlth -0.369944 0.717395 -0.52 0.6061<br /> Inf_poorhlth -0.589745 0.195174 -3.02 0.0025<br /> Inf_numchron -0.280116 0.062396 -4.49 <.0001<br /> Inf_age -0.405962 0.119765 -3.39 0.0007<br /> Inf_male -0.334773 0.162429 -2.06 0.0393<br /> Inf_school -0.019390 0.022126 -0.88 0.3808<br /> Inf_privins 0.224859 0.196133 1.15 0.2516</code><br /><br /><br />觀測機率與預測機率比較圖:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.diigo.com/item/t/2043829_111682382_6438671" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="344" src="http://www.diigo.com/item/t/2043829_111682382_6438671" width="640" /></a></div><br />從上圖看來 ZIP 模型的表現也不錯!<br /><br /><b><span style="font-size: large;">。ZIP-tau 模型</span></b><br />這是上面 ZIP 模型的延伸。他的表現比 ZIP 模型好一點點,因為 AIC 比較小,但也沒有差太多。程式如下:<br /><code>/* METHOD 1: PROC NLMIXED */<br />proc nlmixed data = tblNMES;<br /> parms b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0 tau = 1;<br /> eta0 = tau * (b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE + <br /> b5 * MALE + b6 * SCHOOL + b7 * PRIVINS);<br /> exp_eta0 = exp(eta0);<br /> p0 = exp_eta0 / (1 + exp_eta0);<br /> etap = b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE + <br /> b5 * MALE + b6 * SCHOOL + b7 * PRIVINS;<br /> exp_etap = exp(etap);<br /> if HOSP = 0 then ll = log(p0 + (1 - p0) * exp(-exp_etap));<br /> else ll = log(1 - p0) + HOSP * etap - exp_etap - lgamma(HOSP + 1);<br /> model HOSP ~ general(ll);<br /> predict exp_etap out = zip_out1 (keep = pred HOSP rename = (pred = Yhat));<br /> predict p0 out = zip_out2 (keep = pred rename = (pred = p0));<br />run; </code><br /><br /><br />報表:<br /><code> Fit Statistics<br /> -2 Log Likelihood 5768.7<br /> AIC (smaller is better) 5786.7<br /> AICC (smaller is better) 5786.7<br /> BIC (smaller is better) 5844.2<br /><br /> Parameter Estimates<br /> Standard<br /> Parameter Estimate Error DF t Value Pr > |t| Alpha<br /> b0 -1.3944 0.2698 4406 -5.17 <.0001 0.05<br /> b1 -0.2685 0.09606 4406 -2.80 0.0052 0.05<br /> b2 0.3223 0.05980 4406 5.39 <.0001 0.05<br /> b3 0.1391 0.02195 4406 6.34 <.0001 0.05<br /> b4 0.1040 0.02789 4406 3.73 0.0002 0.05<br /> b5 0.07254 0.03383 4406 2.14 0.0321 0.05<br /> b6 -0.00039 0.004641 4406 -0.08 0.9331 0.05<br /> b7 0.04292 0.04216 4406 1.02 0.3087 0.05<br /> tau -1.8406 0.4585 4406 -4.01 <.0001 0.05</code><br /><br />觀測機率與預測機率比較圖:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.diigo.com/item/t/2043829_111682570_6438680" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="348" src="http://www.diigo.com/item/t/2043829_111682570_6438680" width="640" /></a></div><br />另外,作者提到一個名為 Vuong test 的檢定方法,可以提供數值化的證據來比較 ZIP 模型和其他模型的差異。最主要的用意是,我們已經可以從數個比較圖發現,無論是負二項模型、跨欄模型還是 ZIP 模型,表現的都比卜瓦松模型要好,但如果有個比較科學的證據來證明,而不是主觀用圖形來判定的話,會比較能夠信服。Vuong test的算法如下:<br /><br /><div style="text-align: center;"><a href="http://www.codecogs.com/eqnedit.php?latex=\large%20m_{i}=log(\frac{P_{1}(Y_{i}|X_{i})}{P_{2}(Y_{i}|X_{i})})" target="_blank"><img src="http://latex.codecogs.com/gif.latex?\large m_{i}=log(\frac{P_{1}(Y_{i}|X_{i})}{P_{2}(Y_{i}|X_{i})})" title="\large m_{i}=log(\frac{P_{1}(Y_{i}|X_{i})}{P_{2}(Y_{i}|X_{i})})" /></a></div><br />上式是把每個觀測值用兩個模型都算出預測值出來,然後相除取對數。之後再用這些m值去計算一個V統計量:<br /><br /><div style="text-align: center;"><a href="http://www.codecogs.com/eqnedit.php?latex=\large%20V=\frac{\sqrt{n}(\frac{1}{n}\sum_{i=1}^{n}m_{i})}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(m_{i}-\bar{m})^{2}}}" target="_blank"><img src="http://latex.codecogs.com/gif.latex?\large V=\frac{\sqrt{n}(\frac{1}{n}\sum_{i=1}^{n}m_{i})}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(m_{i}-\bar{m})^{2}}}" title="\large V=\frac{\sqrt{n}(\frac{1}{n}\sum_{i=1}^{n}m_{i})}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(m_{i}-\bar{m})^{2}}}" /></a></div><br />當 V > 1.96 時,就表示第一個模型比較好,當 V < -1.96 時,就表示第二個模型比較好。若介於 -1.96 和 1.96 之間,就表示兩模型沒有顯著差異。以卜瓦松模型和ZIP模型比較為例,其SAS程式如下:<br /><code>data poi_pred (keep = poi_prob); <br /> set poi_out; /* OUTPUT FROM POISSON REGRESSION */<br /> do i = 0 to 8;<br /> poi_prob = pdf('poisson', i , Yhat);<br /> if hosp = i then output;<br /> end;<br />run;<br /><br />data zip_pred (keep = zip_prob);<br /> merge zip_out1 zip_out2; /* OUTPUT FROM ZIP REGRESSION */<br /> do i = 0 to 8;<br /> if i = 0 then zip_prob = p0 + (1 - p0) * pdf('poisson', i, Yhat);<br /> else zip_prob = (1 - p0) * pdf('poisson', i, Yhat);<br /> if hosp = i then output;<br /> end;<br />run;<br /><br />data compare;<br /> merge poi_pred zip_pred;<br /> m = log(zip_prob / poi_prob);<br />run;<br /><br />proc sql;<br />select<br /> mean(m) as mbar,<br /> std(m) as s,<br /> sqrt(count(*)) * mean(m) / std(m) as v<br />from<br /> compare;<br />quit;</code><br /><br />報表:<br /><code> mbar s v<br /> 0.038138 0.375956 <span style="color: red;">6.733444 </span></code><br /><br />結果顯示 V 統計量為 6.73 大於 1.96,所以表示 ZIP 模型比卜瓦松模型要來的好。<br /><br /><b><span style="font-size: large;">。比較</span></b><br />把所有估計參數值並排在一起比較:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.diigo.com/item/t/2043829_111685699_6438913" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="268" src="http://www.diigo.com/item/t/2043829_111685699_6438913" width="640" /></a></div><br />從 AIC 這個最直接的證據來看,都可以顯示出直接用卜瓦松模型去估計參數是最不適合的。而負二項模型擁有最小的 AIC 值,所以最好。<br /><b><br /></b><br /><b>CONTACT INFORMATION </b><br />Wensui Liu, Statistical Project Manager<br />ChoicePoint Precision Marketing, Alpharetta, GA<br />Email: <!-- e --><a href="mailto:wensui.liu@choicepoint.com">wensui.liu@choicepoint.com</a><!-- e --><br /><br />Jimmy Cela, AVP<br />ChoicePoint Precision Marketing, Alpharetta, GA<br />Email: <!-- e --><a href="mailto:jimmy.cela@choicepoint.com">jimmy.cela@choicepoint.com</a><!-- e --><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6268919072942670865-7910254849827852366?l=sugiclub.blogspot.com' alt='' /></div> |
|