-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved the logic of conflict handling in AnonymizerEngine #1196
Conversation
…a method in AnonymizerEngine
@microsoft-github-policy-service agree |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Hi @omri374 do I need to change anything in code. |
Commenter does not have sufficient privileges for PR 1196 in repo microsoft/presidio |
@VMD7 thanks for this addition. Could you please add a test case which checks the new logic specifically? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Hi @omri374
Now, if you observe the problem with this test case is that. It gives the overlapped entities. But, It fails to ensure that after unmasking or decrypting the text do we get the original text back. So, my suggestion would be we should modify this test case by below changes in
I will commit the changes for this as well. Please let me know your thoughts on it. |
…nage_to_anonymize_successfully
Hi @VMD7. There are two use cases for conflict resolution:
Is your suggested change solving those two use cases? If it only solves 1, maybe it would be better to tackle it on the text building part, and not the entity conflicts part. |
Hi @VMD7, thanks for creating this PR! Would you be interested in continuing to work on it? Perhaps we can collaborate? |
Hi @omri374, |
Hi @VMD7, as discussed, we should have a configuration allowing conflict resolution to happen in different ways. class ConflictResolutionStrategy(Enum):
"""Conflict resolution strategy.
The strategy to use when there is a conflict between two entities.
TEXT: The conflict will be resolved on the output text level, assuring that
the output text will not contain any overlapping entities.
ENTITIES: The conflict will be resolved on the entities level, assuring that
the output entities will not overlap. This is useful for de-anonymization.
TEXT_AND_ENTITIES: The conflict will be resolved on both the output text and the
output entities level.
NONE: No conflict resolution will be performed.
"""
TEXT = "text"
ENTITIES = "entities"
TEXT_AND_ENTITIES = "text_and_entities"
NONE = "none" |
Hi @omri374 Case A: TEXT = "text":: Case B: ENTITIES = "entities":: Case C: TEXT_AND_ENTITIES = "text_and_entities":: Case D: NONE = "none":: Please let me know your thoughts on this. |
Thanks @VMD7. I suppose we don't have to implement all the possible options. How would you suggest to simplify? Implement A,C,D? |
Hi @omri374 To implement case A - I am not getting confident because somehow we have to change the entity and their position to avoid the overlap on text side. So, indirectly we are changing the entity positions for text side handling. What do you say on this. Should I proceed for the code changes. |
Yes, let's go with C and D. Should we have this as an enum anyway? If for instance we'd like to add a new strategy for conflict handling, we wouldn't have to change the API. WDYT? |
Yes sure using enum it will be better. I will add case C and D with enum and test cases related with that and commit the update code. |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Thinking about the default case, where ConflictResolutionStrategy is None, maybe it's better to pass a default value instead of None.
…residio into testing-anonymizer-engine
Hi @omri374 For default case - Yes right, instead of using ConflictResolutionStrategy.NONE we can use ConflictResolutionStrategy.DEFAULT as it sounds more relevant. One more thought should we make three cases: Or should I simply replace the ConflictResolutionStrategy.NONE by ConflictResolutionStrategy.DEFAULT for now. Please let me know your thoughts on it. |
Hi, thanks for the updates! Are these good descriptors of the strategies? We should also mention that b will also trigger a. |
Hi @omri374 |
Hi @omri374 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! looks great. One last comment if I may :)
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Omri Mendels <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Thank you @VMD7 for this contribution! |
Thanks @omri374 for your support and guidance. |
Improved the logic of
_remove_conflicts_and_get_text_manipulation_data
method inAnonymizerEngine
for a edge case when entities gets intersected then they are changing the original text data due to overlap portion.Change Description
-> In
_remove_conflicts_and_get_text_manipulation_data
method all the scenarios handled. Only, the issue coming when two different entities detects PII and have some common portion of text in between them.-> Now if we try to raise this as conflict in method
_is_result_conflicted_with_other_elements
. Then both entities might get lost, so this might not be the right way to handle it.-> So, after getting the list of result of all entities of
_remove_conflicts_and_get_text_manipulation_data
method at end in_unique_text_metadata_elements
list. Will first sort them ascending order by considering the "start" key.-> Then will check is there any conflict is happening or not, if yes then will adjust the start and end positions, without removing them.
Issue reference
This PR fixes issue #1195
Checklist