presidio-structured misidentifies email as URL #1316

ardband · 2024-02-28T04:02:27Z

Presidio-structured incorrectly identifies an email address as a URL within the extracted entities. This can be observed in the following example output:

StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'URL', 'city': 'LOCATION', 'state': 'LOCATION'})

the value in the "email" column ("[email protected]") is mistakenly identified as a URL ("URL") instead of an email address ("EMAIL") during entity extraction.

miltonsim · 2024-02-28T09:10:13Z

I've also encountered this issue.

The issue mainly stems from the _find_most_common_entity() method where email addresses in test_structured.csv are being incorrectly identified as URLs, albeit with low confidence. It prioritises the entity with the highest count.

Observed behavior:

Entity Count: {'URL': 6, 'EMAIL_ADDRESS': 3}
Confidence Scores: {'EMAIL_ADDRESS': [1.0, 1.0, 1.0], 'URL': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}

The emails are accurately recognised but are outnumbered by the URL identifications due to their higher frequency, despite lower confidence levels.

I would like to suggest two potential improvements:

Adapting _find_most_common_entity() to Consider Confidence Scores: It might be beneficial to adjust the method to account for the actual confidence scores provided by the recognizer results.
Enhancing the URL Recognizer: Improving the recognizer's ability to differentiate between URLs and email addresses could help reduce this type of misidentification

I'm keen to contribute to making these improvements and would love to work on refining the logic. Any thoughts or feedback on these suggestions would be greatly appreciated!

omri374 · 2024-02-28T13:27:24Z

Thanks for the feedback! the URL recognizer detects parts of emails as well (e.g. microsoft.com is a url inside [email protected]), which makes it detect more URLs than emails.

I think that a good way forward here would be to allow the user to decide on a strategy for the entity selected. In some cases, we would want the entity with the majority of cases, in others we'd like the one that has the highest confidence, and in others we might want a mix of the two (e.g. most common entity, if confidence > 0.5)

A quick fix could be to update the structured analysis once finalized, in case the column's name is "email" but the detection is actually "URL".

If you're interested in creating a PR, I'd be happy to review it and discuss.

miltonsim mentioned this issue Feb 29, 2024

feat: Implement user-defined entity selection strategies in Presidio Structured #1319

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

presidio-structured misidentifies email as URL #1316

presidio-structured misidentifies email as URL #1316

ardband commented Feb 28, 2024

miltonsim commented Feb 28, 2024

omri374 commented Feb 28, 2024

presidio-structured misidentifies email as URL #1316

presidio-structured misidentifies email as URL #1316

Comments

ardband commented Feb 28, 2024

miltonsim commented Feb 28, 2024

omri374 commented Feb 28, 2024