Skip to content
David Callies edited this page Apr 21, 2022 · 70 revisions

Introduction

An diagram of HMA

Hasher-Matcher-Actioner (HMA) is an open-source, turnkey trust and safety tool. You can submit content to your own instance of HMA to scan through content on your platform and flag potential community standards violations. You can configure rules in HMA to automatically take actions (such as enqueue to a review system) when these potential violations are flagged.

HMA can pull in data from Facebook's ThreatExchange API (if you are a member) and use that to flag content.

Why would I want HMA?

HMA can be either a tool in your existing content moderation strategy, or the starting point of a wider moderation ecosystem on your platform. At it's core, HMA provides "similarity detection" or "copy detection", which allows detecting perceptually same or similar content that you (or someone else) has already seen. If the specific algorithms that HMA comes with only meet some of your needs, there are interfaces to plug into other solutions (including solutions that aren't just copy-detection, and can handle never-before-seen content), or to only use the subset of functionality you need. Where HMA might shine is integrating with collaborative trust & safety solutions, such as ThreatExchange.

What kinds of capabilities does HMA have?

Over time, HMA may gain more functionality. Additionally, HMA was designed to play nicely with other systems, and so missing functionality can be added by interfacing with other solutions.

  • βœ… Ready
  • 🚧 In development or planned 2022
  • πŸ“‹ Planned / Long Term
Content Type Matching Capability
Photos βœ… PDQ πŸ“‹ PDQ+OCR
Videos βœ… MD5 πŸ“‹ TMK+PDQF πŸ“‹ vPDQ
Text πŸ“‹ Hamming πŸ“‹ TLSH
URL πŸ“‹ Exact match πŸ“‹ MD5

Where is the data hosted?

You run your own instance of HMA and have control of the contents you evaluate. You end up having to pay the hosting costs as a result. If someone else runs an instance and says you can call it, then they host the data.

HMA can download matching signals from APIs hosted by someone else.

How does HMA use external APIs?

If you configure it to, HMA will connect to external APIs (like ThreatExchange) to get signals and hashes to compare against.

HMA does not share any data that you do not explicitly share by configuring it to do so. No metrics, no telemetry, etc. You can configure it to give feedback on signals that others have hosted (SEEN, true/false positive reporting), but it won't do so if you don't configure it to.

Can I use HMA without connecting to external APIs?

Right now HMA can only match against collections of signals stored in ThreatExchange, but you can use ThreatExchange's privacy controls to only share those signals with yourself. This isn't quite ideal if you want to keep everything inside of your platform, and we are aiming to provide local-only support eventually.

How long does it take to start using HMA?

You can get a test deployment up in roughly an hour, especially if you are already familiar with tools such as Terraform.

The time to fully integrate into your infrastructure might require:

  1. Setting up any custom AWS environment things you need (VPCs, routing, access controls, SSO)
  2. Adding a hook in your content flow to trigger an API call for HMA to evaluate content.
  3. Adding an endpoint in your admin tooling to receive callbacks from HMA to react to content it has flagged.
  4. Setting configuration in HMA to download the right datasets and route matches.
  5. Running some kind of experiment to slowly turn up traffic into HMA, make a judgement on the performance of the results.

A well-motivated engineer with access to all the resources they would need might take 1-2 weeks to do the above.

What scale can HMA run at?

We have a target of processing 4k images/sec. In practice, we can currently hit 1.3k images/sec. However, the bottleneck is the hashing component. If you move the hashing to occur inside your own infrastructure, the "MA" version of HMA can hit 4k+ images/sec.

How expensive is it to run HMA?

HMA is built off AWS lambda. If it's getting no traffic, it's almost-but-not-quite-free to run (queue polling events currently lead to ~$1 of charges a month in our testing, but please monitor and use limits in your setup).

As of last benchmarking in March 2021, the cost for 1MB images was 1 cent per 1000 images. Computing hashes in your own infrastructure can reduce the cost, as hashing is the most expensive component.

I already have some content moderation tools, and don't need all the bells and whistles.

You can only use the subsets of HMA that make sense for your platform. Check out What are the Different Ways that I Can Use HMA?

If AWS itself is a dealbreaker, you can try adapting the code to work with other clouds, or just use the underlying libraries, which live in this same repo.

Clone this wiki locally