Question

NLP With Ruta Script

I have created a Decision Data rule for entity extraction. I am performing NLP using RUTA script in pega. My requirement is to extract policy number from an email.

S- Represents Alphanumeric A- Represents Numeric

Policy Number has format: 1)With Hyphen SS-SSSSSSS-AAA 2)Without Hyphen SS SSSSSSS AAA 3)Without Spaces SSSSSSSSSAAA 4)Optionally This policy number can be prefixed with 1 also.So 1SS-SSSSSSS-AAA, 1SS SSSSSSS AAA and 1SSSSSSSSSAAA are also valid combination.

So policy number has 3 parts; 1st part is of length 2(SS), 2nd part is of length 7(SSSSSSS) and third part is of length 3(AAA). And optionally "1" is fourth part which would be prefixed to policy number.

I have written a script for this but its not working for combination in which policy number is prefixed with 1.

Below is code from script:

  1. PACKAGE uima.ruta.example;
  2. Document{-> RETAINTYPE(SPACE)};
  3.  
  4. DECLARE VarA;
  5. DECLARE VarC;
  6. DECLARE VarE;
  7.  
  8.  
  9. ("1")? W{REGEXP(".{2}")} ("-"|SPACE)? ((W* NUM* W* NUM* W* NUM* W*)|(NUM* W* NUM* W* NUM* W* NUM*)){REGEXP(".{7}")} ("-"|SPACE)? W{REGEXP(".{3}")->MARK(EntityType,1,6)};
  10.  
  11.  
  12. (W* NUM*){REGEXP(".{2}")} ("-"|SPACE)? ((W* NUM* W* NUM* W* NUM* W*)|(NUM* W* NUM* W* NUM* W* NUM*)){REGEXP(".{7}")} ("-"|SPACE)? W{REGEXP(".{3}")->MARK(EntityType,1,5)};
  13.  
  14. ((W|NUM)(NUM|W)*){REGEXP("(?i)\\b[1]{0,1}[A-Z0-9]{2}[A-Z0-9]{7}[A-Z]{3}\\b" )->MARK(EntityType)};

Valid Policy Numbers: AB-CD123EF-GHI, 1AB-CD123EF-GHI, ABCD123EFGHI, 23-456ABC7-GHI, 123-456ABC7-GHI, 1A3-456ABC7-GHI, 12A-456ABC7-GHI etc..

i am not able to handle 123-456ABC7-GHI, 1A3-456ABC7-GHI, 12A-456ABC7-GHI these combination.

Please help to write correct script that cover all possible combination. Thanks in advance.

Correct Answer
April 13, 2019 - 8:40am

The UIMA Ruta seed annotation NUM or W, covers the whole number or Word. Therefore, examples like 
23456, 123456 cannot be split in subannotations by Ruta.
A solution would be to use pure regexp to annotate all the mentioned examples

"\\w{2,3}[\\-|\\s]?\\w{2,3}" -> EntityType;

Comments

Keep up to date on this post and subscribe to comments

Pega
April 13, 2019 - 8:40am

The UIMA Ruta seed annotation NUM or W, covers the whole number or Word. Therefore, examples like 
23456, 123456 cannot be split in subannotations by Ruta.
A solution would be to use pure regexp to annotate all the mentioned examples

"\\w{2,3}[\\-|\\s]?\\w{2,3}" -> EntityType;