SAS中文论坛

 找回密码
 立即注册

扫一扫,访问微社区

查看: 962|回复: 0
打印 上一主题 下一主题

A test to count missing values for large data

[复制链接]

49

主题

76

帖子

1462

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
1462
楼主
 楼主| 发表于 2011-9-19 23:36:06 | 只看该作者

A test to count missing values for large data

From Dapangmao's blog on sas-analysis

<div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-9P4IRzHAraU/TnddcKSfvQI/AAAAAAAAAwI/MEs_cB2EoVo/s1600/Presentation1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="http://1.bp.blogspot.com/-9P4IRzHAraU/TnddcKSfvQI/AAAAAAAAAwI/MEs_cB2EoVo/s400/Presentation1.jpg" width="400" /></a></div>This morning Rick introduced <a href="http://blogs.sas.com/content/iml/2011/09/19/count-the-number-of-missing-values-for-each-variable/">how to count the missing frequencies for all variables in a dataset</a>, including character and numeric variables. He provided two solutions by either PROC FREQ or PROC IML. I have a petty macro based on PROC SQL’s nmiss() function to do the same job.  In this big data era, I am interested in those SAS codes’ efficiencies to check large data. <br />
<br />
Then I simulated a 1e8 size dataset (5 numerical and 5 character variables) and tested the 3 methods. According to the results, PROC FREQ is slightly better than PROC SQL in both memory usage and speed. However, a macro may be needed to integrate a number of output tables by PROC FREQ for reporting <a href="http://support.sas.com/kb/44/124.html">(the demo macro</a> by SAS Knowledge Base is primitive). Sadly, PROC IML used out all the 250MB memories the server allocated to me and can’t finish this test. Similarly, the memory volume should be considered for other matrix languages like R and Matlab. <br />
<pre style="background-color: #ebebeb; border: 1px dashed rgb(153, 153, 153); color: #000001; font-size: 14px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
**********************(1) SIMULATION STEP**************************************;
data test;
   array _c{*} $ c1-c5;
   array _n{*} n1-n5;
    do i = 1 to 1e8;
       do j = 1 to 5;
          if ranuni(1234)&gt;0.2 then _c[j]= 'A';
          else call missing(_c[j]);
        if ranuni(4321)&gt;0.3 then _n[j] = rannor(0);
          else call missing(_n[j]);
       end;
       output;
       drop i j;
    end;
run;

**********************(2) MODULE-BUILDING STEP*********************************;
%macro countmiss(data = );
   ods listing close;
   ods output variables = _varlist;
   proc contents data = &amp;data;
   run;
   proc sql;
      alter table _varlist
      add nmiss num
      ;
      select count(*) into :nvar
      from _varlist
      ;
      select count(*) into :nobs
      from &amp;data
      ;
   quit;
   %do i = 1 %to &amp;nvar;
      proc sql;
         select cats('nmiss(', variable, ')') into :miss_var
         from _varlist
         where num = &amp;i
         ;
         select &amp;miss_var into: miss_num
         from &amp;data
         ;
         update _varlist
         set nmiss = &amp;miss_num
         where num = &amp;i
         ;
      quit;
   %end;
   proc sort data = _varlist;
     by num;
   run;
   ods listing;
   proc report data = _varlist nowd split = '*';
      columns  type num variable label nmiss pcnmiss;
      define type / group;
      define variable / width = 15;
      define nmiss / display 'Missing*Number';
      define pcnmiss / computed format = percent8.3
                  'Missing*Percentage' width = 10;
      compute pcnmiss;
         pcnmiss = nmiss / &amp;nobs;
      endcomp;
   run;
%mend countmiss();

proc format;
value $missfmt ' '='Missing' other='Not Missing';
value  missfmt  . ='Missing' other='Not Missing';
run;

**********************(3) TESTING STEP***************************************;
options fullstimer;
*******************METHOD(1): PROC SQL***************************************;
%countmiss(data = test);

*******************METHOD(2): PROC FREQ*************************************;
proc freq data = TEST;
   format _CHAR_ $missfmt.;
   tables _CHAR_ / missing missprint nocum nopercent;
   format _NUMERIC_ missfmt.;
   tables _NUMERIC_ / missing missprint nocum nopercent;
run;
*******************METHOD(3): PROC IML***************************************;
proc iml;
   use test;
   read all var _NUM_ into x[colname=nNames];
   n = countn(x,"col");
   nmiss = countmiss(x,"col");
   read all var _CHAR_ into x[colname=cNames];
   close one;
   c = countn(x,"col");
   cmiss = countmiss(x,"col");
   Names = cNames || nNames;
   rNames = {"    Missing", "Not Missing"};
   cnt = (cmiss // c) || (nmiss // n);
   print cnt[r=rNames c=Names label=""];
quit;
********************END OF ALL CODING*****************************************;
</code></pre><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3256159328630041416-6164289802432737186?l=www.sasanalysis.com' alt='' /></div><img src="http://feeds.feedburner.com/~r/SasAnalysis/~4/mAA92t2WqF0" height="1" width="1"/>
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|小黑屋|手机版|Archiver|SAS中文论坛  

GMT+8, 2025-5-7 23:12 , Processed in 0.068717 second(s), 19 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表