Skip to content
David Callies edited this page Aug 24, 2021 · 70 revisions

Introduction

An diagram of HMA

Hasher-Matcher-Actioner (HMA) is an open-source, turnkey trust and safety tool. You can submit content to your own instance of HMA to scan through content on your platform and flag potential community standards violations. You can configure rules in HMA to automatically take actions (such as enqueue to a review system) when these potential violations are flagged.

HMA can pull in data from Facebook's ThreatExchange API (if you are a member) and use that to flag content.

Why would I want HMA?

To scan you platform for content that might violate your terms of service based on a known list of violating content, or you already are participating in ThreatExchange and don't want to invest in writing integrations from scratch.

What kinds of capabilities does HMA have?

As of now, only Photo PDQ image similarity is supported. PDQ is a "copy detection" algorithm. It can help find copies or images that a human would say are the same. It can't find new content like machine learning can. Over time, HMA may gain more functionality.

  • ✅ Ready
  • 🚧 In development or planned 2021
  • 📋 Planned / Long Term
Content Type Matching Capability
Photos ✅ PDQ 📋 PDQ+OCR
Videos 🚧 MD5 📋 TMK+PDQF
Text 📋 Hamming 📋 TBD Hashing Algorithm
URL TBD

Where is the data hosted?

You run your own instance of HMA and have control of the contents you evaluate. You end up having to pay the hosting costs as a result. If someone else runs an instance and says you can call it, then they host the data.

HMA can download matching signals from APIs hosted by someone else.

How does HMA use external APIs?

If you configure it to, HMA will connect to external APIs (like ThreatExchange) to get signals and hashes to compare against.

HMA does not share any data that you do not explicitly share by configuring it to do so. No metrics, no telemetry, etc. You can configure it to give feedback on signals that others have hosted (SEEN, true/false positive reporting), but it won't do so if you don't configure it to.

Can I use HMA without connecting to external APIs?

Right now HMA can only match against collections of signals stored in ThreatExchange, but you can use ThreatExchange's privacy controls to only share those signals with yourself. This isn't quite ideal if you want to keep everything inside of your platform, and we are aiming to provide local-only support eventually.

How long does it take to start using HMA?

You can get a test deployment up in roughly an hour, especially if you are already familiar with tools such as Terraform.

The time to fully integrate into your infrastructure might require:

  1. Setting up any custom AWS environment things you need (VPCs, routing, access controls, SSO)
  2. Adding a hook in your content flow to trigger an API call for HMA to evaluate content.
  3. Adding an endpoint in your admin tooling to receive callbacks from HMA to react to content it has flagged.
  4. Setting configuration in HMA to download the right datasets and route matches.
  5. Running some kind of experiment to slowly turn up traffic into HMA, make a judgement on the performance of the results.

A well-motivated engineer with access to all the resources they would need might take 1-2 weeks to do the above.

What scale can HMA run at?

We have a target of processing 4k images/sec. In practice, we can currently hit 1.3k images/sec. However, the bottleneck is the hashing component. If you move the hashing to occur inside your own infrastructure, the "MA" version of HMA can hit 4k+ images/sec.

How expensive is it to run HMA?

HMA is built off AWS lambda. If it's getting no traffic, it's almost-but-not-quite-free to run (queue polling events currently lead to ~$1 of charges a month in our testing, but please monitor and use limits in your setup).

As of last benchmarking in March 2021, the cost for 1MB images was 1 cent per 1000 images. Computing hashes in your own infrastructure can reduce the cost, as hashing is the most expensive component.

I already have some integrity tools, and don't need all the bells and whistles.

You can just skip to only using subsets of HMA. Check out What are the Different Ways that I Can Use HMA?

If AWS itself is a dealbreaker, you can try adapting the code to work with other clouds, or just use the underlying libraries, which live in this same repo.

Clone this wiki locally