Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved the logic of conflict handling in AnonymizerEngine #1196

Merged
merged 31 commits into from
Jan 9, 2024

Conversation

VMD7
Copy link
Contributor

@VMD7 VMD7 commented Oct 28, 2023

Improved the logic of _remove_conflicts_and_get_text_manipulation_data method in AnonymizerEngine for a edge case when entities gets intersected then they are changing the original text data due to overlap portion.

Change Description

-> In _remove_conflicts_and_get_text_manipulation_data method all the scenarios handled. Only, the issue coming when two different entities detects PII and have some common portion of text in between them.
-> Now if we try to raise this as conflict in method _is_result_conflicted_with_other_elements. Then both entities might get lost, so this might not be the right way to handle it.
-> So, after getting the list of result of all entities of _remove_conflicts_and_get_text_manipulation_data method at end in _unique_text_metadata_elements list. Will first sort them ascending order by considering the "start" key.
-> Then will check is there any conflict is happening or not, if yes then will adjust the start and end positions, without removing them.

Issue reference

This PR fixes issue #1195

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@VMD7
Copy link
Contributor Author

VMD7 commented Oct 28, 2023

@microsoft-github-policy-service agree

@omri374
Copy link
Contributor

omri374 commented Oct 30, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VMD7
Copy link
Contributor Author

VMD7 commented Nov 3, 2023

Hi @omri374 do I need to change anything in code.
-> The 2/45 test cases are failing due to the updated change, because the test cases you made also changing the original text. So, please check the test cases for the same.
-> There are some another tests failing they are related with Linting, I will work on it.

Copy link

Commenter does not have sufficient privileges for PR 1196 in repo microsoft/presidio

@omri374
Copy link
Contributor

omri374 commented Nov 3, 2023

@VMD7 thanks for this addition. Could you please add a test case which checks the new logic specifically?

@omri374
Copy link
Contributor

omri374 commented Nov 3, 2023

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VMD7
Copy link
Contributor Author

VMD7 commented Nov 4, 2023

Hi @omri374
-> Added the test case for new logic of conflict management.
-> It passes the all test cases except 1 which is below

def test_given_custom_anonymizer_then_we_manage_to_anonymize_successfully():
    engine = AnonymizerEngine()
    text = (
        "Fake card number 4151 3217 6243 3448.com that "
        "overlaps with nonexisting URL."
    )
    analyzer_result = RecognizerResult("CREDIT_CARD", 17, 36, 0.8)
    analyzer_result2 = RecognizerResult("URL", 32, 40, 0.8)
    anonymizer_config = OperatorConfig("custom", {"lambda": lambda x: f"<ENTITY: {x}>"})
    result = engine.anonymize(
        text, [analyzer_result, analyzer_result2], {"DEFAULT": anonymizer_config}
    ).text
    resp = (
        "Fake card number <ENTITY: 4151 3217 6243 3448>"
        "<ENTITY: 3448.com> that overlaps with nonexisting URL."
    )
    assert result == resp

Now, if you observe the problem with this test case is that. It gives the overlapped entities. But, It fails to ensure that after unmasking or decrypting the text do we get the original text back. So, my suggestion would be we should modify this test case by below changes in resp such as. The expected resp should be:

resp = (
        "Fake card number <ENTITY: 4151 3217 6243 3448>"
        "<ENTITY:  .com> that overlaps with nonexisting URL."
    )

I will commit the changes for this as well. Please let me know your thoughts on it.

@omri374
Copy link
Contributor

omri374 commented Nov 5, 2023

Hi @VMD7. There are two use cases for conflict resolution:

  1. Getting the de-anonymized text correctly with no duplications
  2. For each entity, regardless if it is conflicting with another or not, we should be able to de-anonymize the anonymized value and get the previous value, the original spans and the anonymized spans.

Is your suggested change solving those two use cases? If it only solves 1, maybe it would be better to tackle it on the text building part, and not the entity conflicts part.

@omri374
Copy link
Contributor

omri374 commented Dec 13, 2023

Hi @VMD7, thanks for creating this PR! Would you be interested in continuing to work on it? Perhaps we can collaborate?

@VMD7
Copy link
Contributor Author

VMD7 commented Dec 13, 2023

Hi @omri374,
Thanks for acknowledging. I wanted to work on this thread. But, little bit got stuck in that. Your collaboration definitely help me to implement further.

@omri374
Copy link
Contributor

omri374 commented Dec 21, 2023

Hi @VMD7, as discussed, we should have a configuration allowing conflict resolution to happen in different ways.
Do you think something like this would make sense?

class ConflictResolutionStrategy(Enum):
    """Conflict resolution strategy.

    The strategy to use when there is a conflict between two entities.
    TEXT: The conflict will be resolved on the output text level, assuring that
    the output text will not contain any overlapping entities.
    ENTITIES: The conflict will be resolved on the entities level, assuring that
    the output entities will not overlap. This is useful for de-anonymization.
    TEXT_AND_ENTITIES: The conflict will be resolved on both the output text and the
    output entities level.
    NONE: No conflict resolution will be performed.
    """


    TEXT = "text"
    ENTITIES = "entities"
    TEXT_AND_ENTITIES = "text_and_entities"
    NONE = "none"

@VMD7
Copy link
Contributor Author

VMD7 commented Dec 21, 2023

Hi @omri374
Yes, this conflict resolution strategy looks nice. I have following doubts for each case, please guide me on this.

Case A: TEXT = "text"::
-> I think we have to modify in the EngineBase class -> _operate method .
-> I have gave a try to directly modify the text without modifying the entites span or entities text. Its working for the all the operators.
-> The one problem I could see is that in encryption and decryption. In encryption we are getting fine output. In decryption also we are getting perfect output without overlapping the text. But, the text spans are getting changed for decrypt because while encrypting we are giving modified text. So, its expected behaviour. (Please let me know your thoughts on this).

Case B: ENTITIES = "entities"::
-> One doubt in this case, if we remove the conflict from entities completely then there won't be any overlapping problem of text. So, indirectly we are completing the third case here as well.

Case C: TEXT_AND_ENTITIES = "text_and_entities"::
-> Yes the earlier written piece of code works for this scenario. Also, if any additional changes needed please let me know.

Case D: NONE = "none"::
-> Are we planning to ignore all the conflicts (#1156 ) like the conflicts resolved by some assumptions as well.

Please let me know your thoughts on this.

@omri374
Copy link
Contributor

omri374 commented Dec 26, 2023

Thanks @VMD7. I suppose we don't have to implement all the possible options. How would you suggest to simplify? Implement A,C,D?

@VMD7
Copy link
Contributor Author

VMD7 commented Dec 27, 2023

Hi @omri374
For simplicity can we go with C and D only. Like if user wants to complete remove the conflicts then they will use C case and if user want to go with current conflict management (that contains some assumptions) then they will choose D. This can be implemented using simple remove_conflicts flag in anonymizer engine.

To implement case A - I am not getting confident because somehow we have to change the entity and their position to avoid the overlap on text side. So, indirectly we are changing the entity positions for text side handling.

What do you say on this. Should I proceed for the code changes.

@omri374
Copy link
Contributor

omri374 commented Dec 27, 2023

Yes, let's go with C and D. Should we have this as an enum anyway? If for instance we'd like to add a new strategy for conflict handling, we wouldn't have to change the API. WDYT?

@VMD7
Copy link
Contributor Author

VMD7 commented Dec 27, 2023

Yes sure using enum it will be better. I will add case C and D with enum and test cases related with that and commit the update code.

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
Thinking about the default case, where ConflictResolutionStrategy is None, maybe it's better to pass a default value instead of None.

presidio-anonymizer/tests/test_anonymizer_engine.py Outdated Show resolved Hide resolved
@VMD7
Copy link
Contributor Author

VMD7 commented Dec 31, 2023

Hi @omri374
Added one more test case where conflict_strategy = None. Pleas check it once and let me know if any further changes required.

For default case - Yes right, instead of using ConflictResolutionStrategy.NONE we can use ConflictResolutionStrategy.DEFAULT as it sounds more relevant.

One more thought should we make three cases:
a) ConflictResolutionStrategy.DEFAULT -> This will be our default strategy (that is current strategy)
b) ConflictResolutionStrategy.TEXT_AND_ENTITIES -> This will further removes the conflicts from text and entities (The written piece of code already).
c) ConflictResolutionStrategy.NONE -> This will be our new strategy where we will not handle any type of conflicts will give the results as it that are coming from analyzer engine.

Or should I simply replace the ConflictResolutionStrategy.NONE by ConflictResolutionStrategy.DEFAULT for now. Please let me know your thoughts on it.

@omri374
Copy link
Contributor

omri374 commented Dec 31, 2023

Hi, thanks for the updates!
I agree with you that we should have a default here which is identical to the previous logic.
We should make sure the description of each strategy is clear, and I'm not sure DEFAULT and TEXT_AND_ENTITIES would say much to a user interested in understanding which strategy to choose.

Are these good descriptors of the strategies?
a. MERGE_SIMILAR_OR_CONTAINED (default one we have today)
b. REMOVE_INTERSECTIONS (new logic)
c. NONE (do nothing, including skipping a)

We should also mention that b will also trigger a.
WDYT?

@VMD7
Copy link
Contributor Author

VMD7 commented Jan 2, 2024

Hi @omri374
Yes, these descriptions looks good. I will update the code by considering these cases. As well as will add some more test cases for updated logic of intersection where we are considering the score as well and None case as well.

@VMD7
Copy link
Contributor Author

VMD7 commented Jan 2, 2024

Hi @omri374
Updated all the required changes. Please review it once and let me know if any improvements needed.

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! looks great. One last comment if I may :)

@omri374
Copy link
Contributor

omri374 commented Jan 8, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Jan 9, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Jan 9, 2024

Thank you @VMD7 for this contribution!

@omri374 omri374 merged commit 629c5dc into microsoft:main Jan 9, 2024
22 checks passed
@VMD7
Copy link
Contributor Author

VMD7 commented Jan 9, 2024

Thanks @omri374 for your support and guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants