Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement user-defined entity selection strategies in Presidio Structured #1319

Merged
merged 9 commits into from
Mar 20, 2024

Conversation

miltonsim
Copy link
Contributor

@miltonsim miltonsim commented Feb 29, 2024

Change Description

I implemented an additional selection_strategy option in the generate_analysis() function, allowing users to define their preferred strategy for entity selection. Previously, the function only enabled selection based on the most common entity, which could result in scenarios where a small number of high-confidence entities were overshadowed by a larger number of low-confidence entities. Now, users have the flexibility to choose between three strategies: most common, highest_confidence or mixed.

I would love to hear to hear feedback and suggestions!

Issue reference

This PR fixes issue #1316

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@omri374
Copy link
Contributor

omri374 commented Feb 29, 2024

Thanks @mitonsim! Left a few minor comments

@omri374
Copy link
Contributor

omri374 commented Feb 29, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Feb 29, 2024

Please also add a note to the docs about the strategies: https://github.com/microsoft/presidio/blob/main/docs/structured/index.md

@miltonsim
Copy link
Contributor Author

@omri374, thanks for your feedback! I've incorporated the requested changes.

However, I encountered an issue where my newly added test cases failed in the Azure Pipeline. The assertion assert key_recognizer_result_map["street"].entity_type == "NON_PII" is not passing because the 'street' column is incorrectly identified as a location, whereas, in my local environment, it's correctly identified as NON_PII.

Could you shed some light on this discrepancy and how I can resolve it?

@omri374
Copy link
Contributor

omri374 commented Mar 3, 2024

Thanks for the updates @miltonsim! I'm not sure where the discrepancy is coming from. It could be the seed used for sampling, or a different version of the spaCy model. To reduce noise in unit tests, perhaps it would be easier to choose an entity detected using a simpler logic? like a phone number, email, credit card?

@omri374
Copy link
Contributor

omri374 commented Mar 5, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Mar 14, 2024

Hi @miltonsim would you be interested in continuing to work on this? I would suggest to replace the non-deterministic entities with others to make sure tests pass on all environments.

@miltonsim
Copy link
Contributor Author

@omri374 Apologies for the delay. I'll provide an update by 14 March (Tues)

@omri374
Copy link
Contributor

omri374 commented Mar 14, 2024

Thanks!

'street' column is incorrectly identified as a location by CI/CD pipeline
@miltonsim
Copy link
Contributor Author

@omri374 Thank you for waiting. I figured that removing the street column might be the best solution. Please help me run the pipeline to check if it passes!

@omri374
Copy link
Contributor

omri374 commented Mar 18, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374 omri374 requested a review from Jakob-98 March 19, 2024 11:52
@omri374
Copy link
Contributor

omri374 commented Mar 20, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374 omri374 merged commit db8ff82 into microsoft:main Mar 20, 2024
31 checks passed
@elbud
Copy link

elbud commented Jun 3, 2024

Could this update be pushed to PyPi? There is only the original release available at the moment. https://pypi.org/project/presidio-structured/#history

@omri374
Copy link
Contributor

omri374 commented Jun 3, 2024

Hi @elbud, we plan to have a new release in the next couple of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants