Question

Text extraction - language issue

Hi There,

I am defining an Entity extraction with a Ruta script. I defined my entity with support of ALL languages.

My rule detect a pattern like 12-1234-1234567-12. This work fine (in the test the entity is detected) only if in my test I add some words. If I just test the patter, the entity is not recognized.

As an example:

Test 1: "12-1234-1234567-12" -> entity not detected

Test 2: "Account 12-1234-1234567-12" -> entity detected

This Entity Extraction will be used in an email channel and users might just sent and email with the bank account without any word.

Let me know if it's clear or you need more information.

Cheers, Giovanni

***Edited by Moderator: Lochan to update platform capability tags***

Correct Answer
October 11, 2019 - 6:13am

Yes, that is correct behavior. We cannot expect natural language processing to detect language from just numbers. It has to be a semantic statement in a given language.

However, there is a workaround for your problem. In your case, Text analyzer fails to detect language. You can force Text Analyzer to fallback to a language if language is undetected. This setting is found on 'Advanced' tab on Text Analyzer - Enable fallback language if the language is undetected

Comments

Keep up to date on this post and subscribe to comments

October 9, 2019 - 8:09am

Hi Giovanni,

Have you checked this Pega community article on "Creating Entity extraction rules for text analytics":

https://community.pega.com/knowledgebase/articles/decision-management-overview/creating-entity-extraction-rules-text-analytics

Hope this helps!

Cheers,

Pega
October 9, 2019 - 6:44pm
Response to Santhosh_Holla

Thanks for your reply.

Yes, I saw that document, even if it's from v7 (2017) and few things might be different in v8.

My Entity Extraction works if in the piece of content (eg email) where my entity is has some other words. In this case the engine recognize a language and extract my entity.

But if the piece of content has just my entity (eg 12-1234-1234567-12) then the engine doesn't identify the language and it doesn't get my entity.

 

Is this the correct behaviour? Is there a way to work around this situation?

Cheers.

October 11, 2019 - 6:13am
Response to vigog

Yes, that is correct behavior. We cannot expect natural language processing to detect language from just numbers. It has to be a semantic statement in a given language.

However, there is a workaround for your problem. In your case, Text analyzer fails to detect language. You can force Text Analyzer to fallback to a language if language is undetected. This setting is found on 'Advanced' tab on Text Analyzer - Enable fallback language if the language is undetected

Pega
October 25, 2019 - 12:55am
Response to Vikas@Multichannel

Thank you very much.