-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
presidio-structured misidentifies email as URL #1316
Comments
I've also encountered this issue. The issue mainly stems from the Observed behavior:
The emails are accurately recognised but are outnumbered by the URL identifications due to their higher frequency, despite lower confidence levels. I would like to suggest two potential improvements:
I'm keen to contribute to making these improvements and would love to work on refining the logic. Any thoughts or feedback on these suggestions would be greatly appreciated! |
Thanks for the feedback! the URL recognizer detects parts of emails as well (e.g. microsoft.com is a url inside [email protected]), which makes it detect more URLs than emails. I think that a good way forward here would be to allow the user to decide on a strategy for the entity selected. In some cases, we would want the entity with the majority of cases, in others we'd like the one that has the highest confidence, and in others we might want a mix of the two (e.g. most common entity, if confidence > 0.5) A quick fix could be to update the structured analysis once finalized, in case the column's name is "email" but the detection is actually "URL". If you're interested in creating a PR, I'd be happy to review it and discuss. |
Presidio-structured incorrectly identifies an email address as a URL within the extracted entities. This can be observed in the following example output:
StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'URL', 'city': 'LOCATION', 'state': 'LOCATION'})
the value in the "email" column ("[email protected]") is mistakenly identified as a URL ("URL") instead of an email address ("EMAIL") during entity extraction.
The text was updated successfully, but these errors were encountered: