Improved the logic of conflict handling in AnonymizerEngine #1196

VMD7 · 2023-10-28T20:34:52Z

Improved the logic of _remove_conflicts_and_get_text_manipulation_data method in AnonymizerEngine for a edge case when entities gets intersected then they are changing the original text data due to overlap portion.

Change Description

-> In _remove_conflicts_and_get_text_manipulation_data method all the scenarios handled. Only, the issue coming when two different entities detects PII and have some common portion of text in between them.
-> Now if we try to raise this as conflict in method _is_result_conflicted_with_other_elements. Then both entities might get lost, so this might not be the right way to handle it.
-> So, after getting the list of result of all entities of _remove_conflicts_and_get_text_manipulation_data method at end in _unique_text_metadata_elements list. Will first sort them ascending order by considering the "start" key.
-> Then will check is there any conflict is happening or not, if yes then will adjust the start and end positions, without removing them.

Issue reference

This PR fixes issue #1195

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

…a method in AnonymizerEngine

VMD7 · 2023-10-28T20:37:25Z

@microsoft-github-policy-service agree

omri374 · 2023-10-30T10:26:25Z

/azp run

azure-pipelines · 2023-10-30T10:26:38Z

Azure Pipelines successfully started running 1 pipeline(s).

VMD7 · 2023-11-03T07:43:43Z

Hi @omri374 do I need to change anything in code.
-> The 2/45 test cases are failing due to the updated change, because the test cases you made also changing the original text. So, please check the test cases for the same.
-> There are some another tests failing they are related with Linting, I will work on it.

azure-pipelines · 2023-11-03T20:39:26Z

Commenter does not have sufficient privileges for PR 1196 in repo microsoft/presidio

omri374 · 2023-11-03T22:00:31Z

@VMD7 thanks for this addition. Could you please add a test case which checks the new logic specifically?

omri374 · 2023-11-03T22:00:44Z

/azp run

azure-pipelines · 2023-11-03T22:01:00Z

Azure Pipelines successfully started running 1 pipeline(s).

VMD7 · 2023-11-04T07:01:31Z

Hi @omri374
-> Added the test case for new logic of conflict management.
-> It passes the all test cases except 1 which is below

def test_given_custom_anonymizer_then_we_manage_to_anonymize_successfully():
    engine = AnonymizerEngine()
    text = (
        "Fake card number 4151 3217 6243 3448.com that "
        "overlaps with nonexisting URL."
    )
    analyzer_result = RecognizerResult("CREDIT_CARD", 17, 36, 0.8)
    analyzer_result2 = RecognizerResult("URL", 32, 40, 0.8)
    anonymizer_config = OperatorConfig("custom", {"lambda": lambda x: f"<ENTITY: {x}>"})
    result = engine.anonymize(
        text, [analyzer_result, analyzer_result2], {"DEFAULT": anonymizer_config}
    ).text
    resp = (
        "Fake card number <ENTITY: 4151 3217 6243 3448>"
        "<ENTITY: 3448.com> that overlaps with nonexisting URL."
    )
    assert result == resp

Now, if you observe the problem with this test case is that. It gives the overlapped entities. But, It fails to ensure that after unmasking or decrypting the text do we get the original text back. So, my suggestion would be we should modify this test case by below changes in resp such as. The expected resp should be:

resp = (
        "Fake card number <ENTITY: 4151 3217 6243 3448>"
        "<ENTITY:  .com> that overlaps with nonexisting URL."
    )

I will commit the changes for this as well. Please let me know your thoughts on it.

…nage_to_anonymize_successfully

omri374 · 2023-11-05T09:17:46Z

Hi @VMD7. There are two use cases for conflict resolution:

Getting the de-anonymized text correctly with no duplications
For each entity, regardless if it is conflicting with another or not, we should be able to de-anonymize the anonymized value and get the previous value, the original spans and the anonymized spans.

Is your suggested change solving those two use cases? If it only solves 1, maybe it would be better to tackle it on the text building part, and not the entity conflicts part.

omri374 · 2023-12-13T16:57:57Z

Hi @VMD7, thanks for creating this PR! Would you be interested in continuing to work on it? Perhaps we can collaborate?

VMD7 · 2023-12-13T17:53:24Z

Hi @omri374,
Thanks for acknowledging. I wanted to work on this thread. But, little bit got stuck in that. Your collaboration definitely help me to implement further.

omri374 · 2023-12-21T12:16:03Z

Hi @VMD7, as discussed, we should have a configuration allowing conflict resolution to happen in different ways.
Do you think something like this would make sense?

class ConflictResolutionStrategy(Enum):
    """Conflict resolution strategy.

    The strategy to use when there is a conflict between two entities.
    TEXT: The conflict will be resolved on the output text level, assuring that
    the output text will not contain any overlapping entities.
    ENTITIES: The conflict will be resolved on the entities level, assuring that
    the output entities will not overlap. This is useful for de-anonymization.
    TEXT_AND_ENTITIES: The conflict will be resolved on both the output text and the
    output entities level.
    NONE: No conflict resolution will be performed.
    """


    TEXT = "text"
    ENTITIES = "entities"
    TEXT_AND_ENTITIES = "text_and_entities"
    NONE = "none"

VMD7 · 2023-12-21T15:54:24Z

Hi @omri374
Yes, this conflict resolution strategy looks nice. I have following doubts for each case, please guide me on this.

Case A: TEXT = "text"::
-> I think we have to modify in the EngineBase class -> _operate method .
-> I have gave a try to directly modify the text without modifying the entites span or entities text. Its working for the all the operators.
-> The one problem I could see is that in encryption and decryption. In encryption we are getting fine output. In decryption also we are getting perfect output without overlapping the text. But, the text spans are getting changed for decrypt because while encrypting we are giving modified text. So, its expected behaviour. (Please let me know your thoughts on this).

Case B: ENTITIES = "entities"::
-> One doubt in this case, if we remove the conflict from entities completely then there won't be any overlapping problem of text. So, indirectly we are completing the third case here as well.

Case C: TEXT_AND_ENTITIES = "text_and_entities"::
-> Yes the earlier written piece of code works for this scenario. Also, if any additional changes needed please let me know.

Case D: NONE = "none"::
-> Are we planning to ignore all the conflicts (#1156 ) like the conflicts resolved by some assumptions as well.

Please let me know your thoughts on this.

omri374 · 2023-12-26T12:42:51Z

Thanks @VMD7. I suppose we don't have to implement all the possible options. How would you suggest to simplify? Implement A,C,D?

VMD7 · 2023-12-27T05:26:10Z

Hi @omri374
For simplicity can we go with C and D only. Like if user wants to complete remove the conflicts then they will use C case and if user want to go with current conflict management (that contains some assumptions) then they will choose D. This can be implemented using simple remove_conflicts flag in anonymizer engine.

To implement case A - I am not getting confident because somehow we have to change the entity and their position to avoid the overlap on text side. So, indirectly we are changing the entity positions for text side handling.

What do you say on this. Should I proceed for the code changes.

omri374 · 2023-12-27T07:42:11Z

Yes, let's go with C and D. Should we have this as an enum anyway? If for instance we'd like to add a new strategy for conflict handling, we wouldn't have to change the API. WDYT?

VMD7 · 2023-12-27T08:43:25Z

Yes sure using enum it will be better. I will add case C and D with enum and test cases related with that and commit the update code.

…en_we_manage_to_anonymize_successfully" This reverts commit a41d437.

…residio into testing-anonymizer-engine

azure-pipelines · 2023-12-31T09:56:21Z

Azure Pipelines successfully started running 1 pipeline(s).

omri374

Looks great!
Thinking about the default case, where ConflictResolutionStrategy is None, maybe it's better to pass a default value instead of None.

presidio-anonymizer/tests/test_anonymizer_engine.py

…residio into testing-anonymizer-engine

VMD7 · 2023-12-31T12:20:19Z

Hi @omri374
Added one more test case where conflict_strategy = None. Pleas check it once and let me know if any further changes required.

For default case - Yes right, instead of using ConflictResolutionStrategy.NONE we can use ConflictResolutionStrategy.DEFAULT as it sounds more relevant.

One more thought should we make three cases:
a) ConflictResolutionStrategy.DEFAULT -> This will be our default strategy (that is current strategy)
b) ConflictResolutionStrategy.TEXT_AND_ENTITIES -> This will further removes the conflicts from text and entities (The written piece of code already).
c) ConflictResolutionStrategy.NONE -> This will be our new strategy where we will not handle any type of conflicts will give the results as it that are coming from analyzer engine.

Or should I simply replace the ConflictResolutionStrategy.NONE by ConflictResolutionStrategy.DEFAULT for now. Please let me know your thoughts on it.

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py

presidio-anonymizer/tests/test_anonymizer_engine.py

omri374 · 2023-12-31T20:14:16Z

Hi, thanks for the updates!
I agree with you that we should have a default here which is identical to the previous logic.
We should make sure the description of each strategy is clear, and I'm not sure DEFAULT and TEXT_AND_ENTITIES would say much to a user interested in understanding which strategy to choose.

Are these good descriptors of the strategies?
a. MERGE_SIMILAR_OR_CONTAINED (default one we have today)
b. REMOVE_INTERSECTIONS (new logic)
c. NONE (do nothing, including skipping a)

We should also mention that b will also trigger a.
WDYT?

VMD7 · 2024-01-02T06:02:47Z

Hi @omri374
Yes, these descriptions looks good. I will update the code by considering these cases. As well as will add some more test cases for updated logic of intersection where we are considering the score as well and None case as well.

…residio into testing-anonymizer-engine

VMD7 · 2024-01-02T12:39:29Z

Hi @omri374
Updated all the required changes. Please review it once and let me know if any improvements needed.

omri374

Thanks! looks great. One last comment if I may :)

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py

omri374 · 2024-01-08T20:36:23Z

/azp run

azure-pipelines · 2024-01-08T20:36:35Z

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Omri Mendels <[email protected]>

omri374 · 2024-01-09T11:54:40Z

/azp run

azure-pipelines · 2024-01-09T11:54:54Z

Azure Pipelines successfully started running 1 pipeline(s).

omri374 · 2024-01-09T12:12:06Z

Thank you @VMD7 for this contribution!

VMD7 · 2024-01-09T13:07:42Z

Thanks @omri374 for your support and guidance.

Improved the logic of _remove_conflicts_and_get_text_manipulation_dat…

a05610a

…a method in AnonymizerEngine

VMD7 and others added 2 commits November 4, 2023 01:45

Resolved the linting errors

dd3fb7d

Merge branch 'microsoft:main' into testing-anonymizer-engine

0dadff6

Merge branch 'main' into testing-anonymizer-engine

28ab0fc

Added the test case for new logic of conflict handling

8b7d7ee

Modifying the test case named test_given_custom_anonymizer_then_we_ma…

a41d437

…nage_to_anonymize_successfully

Merge branch 'main' into testing-anonymizer-engine

3b08b82

Merge branch 'main' into testing-anonymizer-engine

2d01950

Merge branch 'main' into testing-anonymizer-engine

9421c6e

VMD7 and others added 3 commits December 27, 2023 19:30

Revert "Modifying the test case named test_given_custom_anonymizer_th…

a69729a

…en_we_manage_to_anonymize_successfully" This reverts commit a41d437.

Merge branch 'testing-anonymizer-engine' of https://github.com/VMD7/p…

5c59798

…residio into testing-anonymizer-engine

Merge branch 'microsoft:main' into testing-anonymizer-engine

45ba4e9

omri374 reviewed Dec 31, 2023

View reviewed changes

presidio-anonymizer/tests/test_anonymizer_engine.py Outdated Show resolved Hide resolved

VMD7 added 2 commits December 31, 2023 17:27

Added one more test case where conflict_stragy=None

4a81c6a

Merge branch 'testing-anonymizer-engine' of https://github.com/VMD7/p…

e584729

…residio into testing-anonymizer-engine

omri374 reviewed Dec 31, 2023

View reviewed changes

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py Outdated Show resolved Hide resolved

presidio-anonymizer/tests/test_anonymizer_engine.py Outdated Show resolved Hide resolved

VMD7 and others added 2 commits January 2, 2024 10:15

Utilizing score for entity span changes

4622602

Merge branch 'microsoft:main' into testing-anonymizer-engine

660a6f3

VMD7 added 4 commits January 2, 2024 18:01

Added all three conflict resolutions

66ca4ff

Added the respective conflict resolutions logic

4e40570

Added all types of conflict resolution test cases

2e0694e

Merge branch 'testing-anonymizer-engine' of https://github.com/VMD7/p…

2731375

…residio into testing-anonymizer-engine

omri374 reviewed Jan 2, 2024

View reviewed changes

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py Outdated Show resolved Hide resolved

Removed the NONE option from conflict resoluiton strategy

c9d539b

omri374 reviewed Jan 8, 2024

View reviewed changes

presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py Outdated Show resolved Hide resolved

Merge branch 'main' into testing-anonymizer-engine

5c1e82e

VMD7 and others added 3 commits January 9, 2024 07:38

Update presidio-anonymizer/presidio_anonymizer/anonymizer_engine.py

09dc282

Co-authored-by: Omri Mendels <[email protected]>

Handled linting error

489ebba

Removed extra space

f8ca304

omri374 approved these changes Jan 9, 2024

View reviewed changes

omri374 merged commit 629c5dc into microsoft:main Jan 9, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved the logic of conflict handling in AnonymizerEngine #1196

Improved the logic of conflict handling in AnonymizerEngine #1196

VMD7 commented Oct 28, 2023 •

edited

Loading

VMD7 commented Oct 28, 2023

omri374 commented Oct 30, 2023

azure-pipelines bot commented Oct 30, 2023

VMD7 commented Nov 3, 2023 •

edited

Loading

azure-pipelines bot commented Nov 3, 2023

omri374 commented Nov 3, 2023

omri374 commented Nov 3, 2023

azure-pipelines bot commented Nov 3, 2023

VMD7 commented Nov 4, 2023 •

edited

Loading

omri374 commented Nov 5, 2023

omri374 commented Dec 13, 2023

VMD7 commented Dec 13, 2023

omri374 commented Dec 21, 2023

VMD7 commented Dec 21, 2023 •

edited

Loading

omri374 commented Dec 26, 2023

VMD7 commented Dec 27, 2023

omri374 commented Dec 27, 2023

VMD7 commented Dec 27, 2023

azure-pipelines bot commented Dec 31, 2023

omri374 left a comment

VMD7 commented Dec 31, 2023

omri374 commented Dec 31, 2023 •

edited

Loading

VMD7 commented Jan 2, 2024 •

edited

Loading

VMD7 commented Jan 2, 2024

omri374 left a comment

omri374 commented Jan 8, 2024

azure-pipelines bot commented Jan 8, 2024

omri374 commented Jan 9, 2024

azure-pipelines bot commented Jan 9, 2024

omri374 commented Jan 9, 2024

VMD7 commented Jan 9, 2024

Improved the logic of conflict handling in AnonymizerEngine #1196

Improved the logic of conflict handling in AnonymizerEngine #1196

Conversation

VMD7 commented Oct 28, 2023 • edited Loading

Change Description

Issue reference

Checklist

VMD7 commented Oct 28, 2023

omri374 commented Oct 30, 2023

azure-pipelines bot commented Oct 30, 2023

VMD7 commented Nov 3, 2023 • edited Loading

azure-pipelines bot commented Nov 3, 2023

omri374 commented Nov 3, 2023

omri374 commented Nov 3, 2023

azure-pipelines bot commented Nov 3, 2023

VMD7 commented Nov 4, 2023 • edited Loading

omri374 commented Nov 5, 2023

omri374 commented Dec 13, 2023

VMD7 commented Dec 13, 2023

omri374 commented Dec 21, 2023

VMD7 commented Dec 21, 2023 • edited Loading

omri374 commented Dec 26, 2023

VMD7 commented Dec 27, 2023

omri374 commented Dec 27, 2023

VMD7 commented Dec 27, 2023

azure-pipelines bot commented Dec 31, 2023

omri374 left a comment

Choose a reason for hiding this comment

VMD7 commented Dec 31, 2023

omri374 commented Dec 31, 2023 • edited Loading

VMD7 commented Jan 2, 2024 • edited Loading

VMD7 commented Jan 2, 2024

omri374 left a comment

Choose a reason for hiding this comment

omri374 commented Jan 8, 2024

azure-pipelines bot commented Jan 8, 2024

omri374 commented Jan 9, 2024

azure-pipelines bot commented Jan 9, 2024

omri374 commented Jan 9, 2024

VMD7 commented Jan 9, 2024

VMD7 commented Oct 28, 2023 •

edited

Loading

VMD7 commented Nov 3, 2023 •

edited

Loading

VMD7 commented Nov 4, 2023 •

edited

Loading

VMD7 commented Dec 21, 2023 •

edited

Loading

omri374 commented Dec 31, 2023 •

edited

Loading

VMD7 commented Jan 2, 2024 •

edited

Loading