-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement user-defined entity selection strategies in Presidio Structured #1319
feat: Implement user-defined entity selection strategies in Presidio Structured #1319
Conversation
Thanks @mitonsim! Left a few minor comments |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Please also add a note to the docs about the strategies: https://github.com/microsoft/presidio/blob/main/docs/structured/index.md |
@omri374, thanks for your feedback! I've incorporated the requested changes. However, I encountered an issue where my newly added test cases failed in the Azure Pipeline. The assertion assert Could you shed some light on this discrepancy and how I can resolve it? |
Thanks for the updates @miltonsim! I'm not sure where the discrepancy is coming from. It could be the seed used for sampling, or a different version of the spaCy model. To reduce noise in unit tests, perhaps it would be easier to choose an entity detected using a simpler logic? like a phone number, email, credit card? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Hi @miltonsim would you be interested in continuing to work on this? I would suggest to replace the non-deterministic entities with others to make sure tests pass on all environments. |
@omri374 Apologies for the delay. I'll provide an update by 14 March (Tues) |
Thanks! |
'street' column is incorrectly identified as a location by CI/CD pipeline
@omri374 Thank you for waiting. I figured that removing the street column might be the best solution. Please help me run the pipeline to check if it passes! |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Could this update be pushed to PyPi? There is only the original release available at the moment. https://pypi.org/project/presidio-structured/#history |
Hi @elbud, we plan to have a new release in the next couple of weeks. |
Change Description
I implemented an additional
selection_strategy
option in thegenerate_analysis()
function, allowing users to define their preferred strategy for entity selection. Previously, the function only enabled selection based on the most common entity, which could result in scenarios where a small number of high-confidence entities were overshadowed by a larger number of low-confidence entities. Now, users have the flexibility to choose between three strategies: most common, highest_confidence or mixed.I would love to hear to hear feedback and suggestions!
Issue reference
This PR fixes issue #1316
Checklist