scrape_pdf/README.md at master · sooshie/scrape_pdf · GitHub

A basic script based that uses PDFMiner to decompress streams, and then looks inside the streams

Currently it attempts to pull out IPs, hashes, URLs, and hostnames.

Requires:

pip install dnspython
grab uniaccept from here
pip install pdfminer

Then after you've done that, you'll likely want to get the newest TLD list.
Open a Python interpreter then:

import uniaccept
uniaccept.refreshtlddb("/tmp/tld-list.txt")

Feel free to change the location of the tld-list.txt file, the scrape-pdf.py script expects it in the CWD.