Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] Add rotation reference to brute force lookup #1669

Closed
wants to merge 5 commits into from

Conversation

haianhng31
Copy link
Contributor

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I like a lot of the interface proposals, including the new rotation enum.

I spoke to Eric offline, who had look into the PDQ algorithm, and I think there might be a problem with the approach that is hard to fix, though I think you have made some of the right supporting interface.

Namely, my assertion that you can "rotate" a PDQ hash may not actually be true. The rotation code you have copied evaluates rotations on the DCT matrix and not the PDQ hash itself. The output of the DCT -> PDQ looks like it might be destroying information, and so there might not be a valid equivalence from image -> rotate -> hash and image -> hash -> rotate.

We can still get the same effects as rotating the hash, but we may have to do it in more steps (we can discuss here async or over a meeting).

  1. Simplest option - move rotating the image as a helper in PhotoContent, then just hash all 8 rotations (we may want to do this anyways), but we will need to go in more steps.
  2. Requires research - Using solution 1 or by mathing out the matrix work + torbens in the PDQ code, we can check to see whether there are actually valid rotations of the PDQ hash itself, and then resuming from there.

In either case, I think we should retain your proposed enum, and add some functionality

Top level changes suggested in current version of code:

  1. Let's avoid adding it to the base-level interface, because this doesn't apply to text, and since it's not universal I don't think we should add it just yet - instead add it to PDQ distance
  2. For your test plan, look into how we might demonstrate this using either the hash or match commands.

FLIPPLUS1 = auto()
FLIPMINUS1 = auto()

class PDQHashRotations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ymmv: python modules are already pretty namepsaced, so there's usually not an additional benefit to using classes like this unless.

What do you think about removing this wrapping class and having them just be functions in this module? The only one that is likely to be called by the caller is the rotate(hash, rotation_type) version.

)

rotator = PDQHashRotations()
rotations = rotator._try_all_rotation(hash2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: by naming this function with a leading underscore (_try_...), you are signaling you don't want it to be used outside of the module. However, you are then using it outside of the module!

@@ -68,10 +68,28 @@ def compare_hash(
hash1: str,
hash2: str,
pdq_dist_threshold: int = PDQ_CONFIDENT_MATCH_THRESHOLD,
try_rotation: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's default this to being true for now, though we may need to come back to it in the future,

)

@classmethod
def from_dist(
cls, dist: index.SignalSimilarityInfo, threshold: index.SignalSimilarityInfo
cls, dist: index.SignalSimilarityInfo, threshold: index.SignalSimilarityInfo, rotation_type: RotationType = RotationType.ORIGINAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: I like this idea! However, this concept applies to more than just photos and videos, the big alternatives being URLs and Text, where rotations doesn't make sense, so I don't think it should go onto the base class at this time.

However, you should add it to PDQ distnace.

@@ -0,0 +1,98 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing tests!

@@ -0,0 +1,102 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: Add a docstring at the top that links to the explainer for how rotations work and a one line of what this module is for

https://peps.python.org/pep-0257/

Comment on lines +7 to +14
ORIGINAL = auto()
ROTATE90 = auto()
ROTATE180 = auto()
ROTATE270 = auto()
FLIPX = auto()
FLIPY = auto()
FLIPPLUS1 = auto()
FLIPMINUS1 = auto()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While auto() is very fancy, it can lead to very challenging bugs when using it with Enum which can cause the numbers to change if you re-order these! Therefore, I recommend one of the following:

  1. Use StrEnum instead, which makes auto() the lowercase version of the names
  2. Manually type out all the numbers

Also consider adding a brief note explaining these rotations.

import numpy as np
from enum import Enum, auto

class RotationType(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am convinced by you that this Enum should be generic, but I think we should move it under content_base instead. Perhaps it can live here for now: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/content_type/content_base.py

WDYT?

Comment on lines +6 to +7
class TestPDQHashRotations:
SAMPLE_HASH = "f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: While this test does test some things, something it doesn't tell us is whether the rotations actually represent the rotated images! We shouldn't land that unless we can prove that.

# Convert hex string to binary
binary = bin(int(hash_str, 16))[2:].zfill(256)
# Convert to 16x16 matrix of floats
return np.array([float(int(b)) for b in binary]).reshape(16, 16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: Why turn to floats?

ignorable: Take a look at some of the other np helpers like frombuffer() to see if you can get something that is a native operation in numpy, which is likely to be implemented as a c binding, which will be faster than doing the equivalent logic in python.

@haianhng31
Copy link
Contributor Author

The output of the DCT -> PDQ looks like it might be destroying information, and so there might not be a valid equivalence from image -> rotate -> hash and image -> hash -> rotate.

I've just written and tried to run the test cases with compare_hash using the same pictures (with one rotated) and the match result is false. So I confirm that rotating the hash doesn't work.

I will now try rotating the image in PhotoContent instead and will update later if this approach works. I will also do some research to see if there's a way to rotate the hash itself

@Dcallies
Copy link
Contributor

Thanks for submitting this as a draft @haianhng31 - I often recommend engineers on my team to "have the discussion with code" - it's much easier to talk about concrete things, and I can also give you some pointers on other things you've written.

@haianhng31
Copy link
Contributor Author

@Dcallies Please help me confirm if this is the file I should be working with to rotate the image before hashing & matching and if this approach sounds good:

https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/cli/match_cmd.py#L219-L225

  • I'm thinking that in the _match_file function, I will check whether the type is photo. If yes, i will rotate the object then brute force it with index.query

https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/content_type/photo.py

  • I will add the rotation helpers to this file

@Dcallies
Copy link
Contributor

Dcallies commented Oct 28, 2024

Let's break this up into multiple steps / PRs.

  1. PR one should add the helper functions in PhotoContent using pillow. It doesn't need a unittest, and we'll test it in PR 2. photo.py is the right place for this
  2. PR 2, we should add a new command line option (--rotations) to turn each photo file added into 8 photos that are all hashed. The test plan can be the inputs/outputs of the CLI, though there is a way to write a unittest if you want
  3. P3, we should update the match function to also take a --rotations, that builds on the code we used in PR2 to query all 8 rotations.

@haianhng31 haianhng31 changed the title Add rotation reference to brute force lookup [py-tx] Add rotation reference to brute force lookup Oct 30, 2024
@haianhng31 haianhng31 closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[py-tx][mlh] Add rotation reference to brute force lookup
3 participants