SAS中文论坛

标题: 请高手过招 [打印本页]

作者: shiyiming    时间: 2012-2-16 09:23
标题: 请高手过招
假设一个数据集有两个变量:child 和 parent
DATA sample;
INPUT child $ parent $;
DATALINES;
a b
c b
d c
e g
f e
;
run;

例如:
a 的 parent 是 b,
c 的 parent 是 b,
d 的 parent 是 c,
e 的 parent 是 g,
f 的 parent 是 e,
等等。。。

现在需要创建第三个变量highest_parent, 使得最终的数据集成为:
DATA sample;
INPUT child $ parent $ highest_parent $;
DATALINES;
a b b
c b b
d c b
e g g
f e g
;
run;

a 的 parent 是 b,highest_parent 是 b
c 的 parent 是 b,highest_parent 是 b
d 的 parent 是 c,但 c 的 parent 是 b, 所以highest_parent 是 b,
相似的
e 的 parent 是 g,highest_parent 是 g
f 的 parent 是 e,highest_parent 是 g
等等。。。
如何用SAS实现?
作者: shiyiming    时间: 2012-2-16 13:59
标题: Re: 请高手过招
[code:3a6kqgig]
proc sql noprint;
create table sample2 as
select s1.child,s1.parent,coalesce(s2.hp,s1.parent) as hp
from sample as s1 left join sample(rename=(parent=hp)) as s2
on s1.parent=s2.child
order by 1;
quit;
[/code:3a6kqgig]
作者: shiyiming    时间: 2012-2-16 16:40
标题: Re: 请高手过招
过招似乎是比武大擂台的意思?
作者: shiyiming    时间: 2012-2-16 16:43
标题: Re: 请高手过招
1. Use breadth first algorithm until you find the deepest.
2. you don't have to rename at first inorder to avoid name conflict and data overwite in sql. this method is only used in data step.
[quote="zhangzachary":13uv3f5j][code:13uv3f5j]
proc sql noprint;
create table sample2 as
select s1.child,s1.parent,coalesce(s2.hp,s1.parent) as hp
from sample as s1 left join sample(rename=(parent=hp)) as s2
on s1.parent=s2.child
order by 1;
quit;
[/code:13uv3f5j][/quote:13uv3f5j]
作者: shiyiming    时间: 2012-2-16 17:21
标题: Re: 请高手过招
[quote:cv0ke04g]e 的 parent 是 c,
...
e 的 parent 是 g, [/quote:cv0ke04g]
e的parent是g还是b?
作者: shiyiming    时间: 2012-2-16 22:40
标题: Re: 请高手过招
[quote="jeozu":39plzahb]1. Use breadth first algorithm until you find the deepest.
2. you don't have to rename at first inorder to avoid name conflict and data overwite in sql. this method is only used in data step.
[quote="zhangzachary":39plzahb][code:39plzahb]
proc sql noprint;
create table sample2 as
select s1.child,s1.parent,coalesce(s2.hp,s1.parent) as hp
from sample as s1 left join sample(rename=(parent=hp)) as s2
on s1.parent=s2.child
order by 1;
quit;
[/code:39plzahb][/quote:39plzahb][/quote:39plzahb]

嗯,对,好像不用rename,直接用s2.parent就行……我比较习惯用data step, 所以在想用SQL解决一些问题的时候容易混淆一些东西……
对于第一条,意思是要找更好的算法?还是要研究透了把最优的东西写出来?~ 我只是刚入门,第一目的是解决问题,优化算法等还需要多多学习啊~~
作者: shiyiming    时间: 2012-2-17 03:12
标题: Re: 请高手过招
[quote="hopewell":tfwi7vu5][quote:tfwi7vu5]e 的 parent 是 c,
...
e 的 parent 是 g, [/quote:tfwi7vu5]
e的parent是g还是b?[/quote:tfwi7vu5]

sorry for your confusion. corrections have been made. actually the sample data set contains a hierarchy structure
作者: shiyiming    时间: 2012-2-17 03:21
标题: Re: 请高手过招
[quote="zhangzachary":2ic1wsjg][code:2ic1wsjg]
proc sql noprint;
create table sample2 as
select s1.child,s1.parent,coalesce(s2.hp,s1.parent) as hp
from sample as s1 left join sample(rename=(parent=hp)) as s2
on s1.parent=s2.child
order by 1;
quit;
[/code:2ic1wsjg][/quote:2ic1wsjg]

After you add two more data lines, for example,
DATA sample;
INPUT child $ parent $;
DATALINES;
a b
c b
d c
e g
f e
[b:2ic1wsjg]t d
s t[/b:2ic1wsjg];
run;

it doesn't work.
For the last row, s->t , t->d, d->c, c->b, so the highest_parent for s is b.
作者: shiyiming    时间: 2012-2-17 10:04
标题: Re: 请高手过招
[code:kmi4n4ym]data regular;
    set sample;
    rename child=c parent=highest_parent;
run;
data out(drop=rc c);
    length c highest_parent $8;
    if _n_=1 then do;
        declare hash h(dataset:'regular');
        rc=h.defineKey('c');
        rc=h.defineData('highest_parent');
        rc=h.defineDone();
        call missing(c,highest_parent);
    end;
    set sample;
    highest_parent=parent;
    do while(h.find(key:highest_parent)=0);
    end;
run;[/code:kmi4n4ym]
作者: shiyiming    时间: 2012-2-17 13:47
标题: Re: 请高手过招
[quote="jerrya00":3dy4xkum][quote="zhangzachary":3dy4xkum][code:3dy4xkum]
proc sql noprint;
create table sample2 as
select s1.child,s1.parent,coalesce(s2.hp,s1.parent) as hp
from sample as s1 left join sample(rename=(parent=hp)) as s2
on s1.parent=s2.child
order by 1;
quit;
[/code:3dy4xkum][/quote:3dy4xkum]

After you add two more data lines, for example,
DATA sample;
INPUT child $ parent $;
DATALINES;
a b
c b
d c
e g
f e
[b:3dy4xkum]t d
s t[/b:3dy4xkum];
run;

it doesn't work.
For the last row, s->t , t->d, d->c, c->b, so the highest_parent for s is b.[/quote:3dy4xkum]
好吧~ 的确没有考虑到位~ 以后写程序的确要了解透问题,考虑全面才行啊~
hopewell大大的 do while 用得很精彩,学习了!~
作者: shiyiming    时间: 2012-2-17 22:57
标题: Re: 请高手过招
pf! Using hash searching is absolutely right solution to sovle this sort of problem!


[quote="hopewell":znd11mcn][code:znd11mcn]data regular;
    set sample;
    rename child=c parent=highest_parent;
run;
data out(drop=rc c);
    length c highest_parent $8;
    if _n_=1 then do;
        declare hash h(dataset:'regular');
        rc=h.defineKey('c');
        rc=h.defineData('highest_parent');
        rc=h.defineDone();
        call missing(c,highest_parent);
    end;
    set sample;
    highest_parent=parent;
    do while(h.find(key:highest_parent)=0);
    end;
run;[/code:znd11mcn][/quote:znd11mcn]
作者: shiyiming    时间: 2012-2-18 22:53
标题: Re: 请高手过招
深度优先~~跟Oracle的start with ... connect by相似的思路。
精妙的hash,学习了。

我原来都是用广度优先,感觉数据量大的时候不受内存影响,且速度较快,loop的次数有限。看来我的思路都限制在大数据上了。

[quote="hopewell":2rw35q2y][code:2rw35q2y]data regular;
    set sample;
    rename child=c parent=highest_parent;
run;
data out(drop=rc c);
    length c highest_parent $8;
    if _n_=1 then do;
        declare hash h(dataset:'regular');
        rc=h.defineKey('c');
        rc=h.defineData('highest_parent');
        rc=h.defineDone();
        call missing(c,highest_parent);
    end;
    set sample;
    highest_parent=parent;
    do while(h.find(key:highest_parent)=0);
    end;
run;[/code:2rw35q2y][/quote:2rw35q2y]




欢迎光临 SAS中文论坛 (http://www.mysas.net/forum/) Powered by Discuz! X3.2