求助,正则表达式处理中文描述的问题

shiyiming · 发表于 2010-4-2 13:05:04

利用已给出的关键字来找出在中文描述中含有该关键字的观测.
/*生成测试数据集*/
data data_src;
length desc $300;
input desc;
cards;
程斌因癌症而到医院治疗
;
run;

data test;
set data_src;
if _N_ = 1 then do;
retain patternID1;
retain patternID2;
pattern1 = "/瘫/";
pattern2 = "/癌/";
patternID1 = prxparse(pattern1);
patternID2 = prxparse(pattern2);
if prxmatch(patternID1,desc)^=0 then do;
call prxsubstr(patternID1,desc, position, length);
if position ^= 0 then do;
match = substr(desc, position, length);
end;
end;
else if prxmatch(patternID2,desc)^=0 then do;
call prxsubstr(patternID2,desc, position, length);
if position ^= 0 then do;
match = substr(desc, position, length);
end;
end;
run;
逻辑结果 match='癌'
运行结果 match='瘫'.
初步判断一个中文字符被拆成了两个8位的字符 '程'的第二个字符和'斌'的第一个字符正好是'瘫'的分解.
谁能帮我纠正这个错误.very very 感谢

shiyiming · 发表于 2010-4-2 14:09:01

我不会regular expression,对你的program没什么建议,但好象if _n_=1后面的do group缺少end
如果是从中文查找关键字的话,可以用NLS的kindex() function试试
[code:3hcvieq7]data data_src;
length desc $300;
input desc;
datalines;
程斌因癌症而到医院治疗
某某脑淤血瘫痪
某某骨折
某某因癌症瘫痪
;

data test;
length match $20;
set data_src;
if kindex(desc,'瘫') then match=strip(catx(' ',match,'瘫'));
if kindex(desc,'癌') then match=strip(catx(' ',match,'癌'));
run;[/code:3hcvieq7]

shiyiming · 发表于 2010-4-2 14:41:27

其实主要是字符集的问题,请问data test(encoding="ms-936"); 能改成适合读取中文的编码方式么??

		自动登录	找回密码
密码			立即注册

求助,正则表达式处理中文描述的问题

求助,正则表达式处理中文描述的问题

Re: 求助,正则表达式处理中文描述的问题

Re: 求助,正则表达式处理中文描述的问题