Identifying Disinformation Websites Using Infrastructure Features

Source code and training data for the academic paper available here.

Structure

bin - Contains all entry points for the system
system - All source code for system, including fetching and classifying data
analysis - All code to analyze classification performance
disinfo_train.tar.gz - Compressed .sql file representing our database for training data

Installation

Steps to Develop on disinfo-infra

Create Python virtual environment (optional) to avoid conflicts with local packages : python3 -m venv name-of-environment
Activate the virtual environment to install packages in isolated environment: source name-of-environment/bin/activate
Install required dependencies for disinfo-infra via pip: pip install -r requirements.txt
Setup development environment install via setup.py: python setup.py develop This allows changes to be tracked while developing without re-installation on each change
Develop!
If you want to deactivate the virtual environment: deactivate

Steps to install and run code (NOT FOR DEVELOPMENT)

Download source
pip install -r requirements.txt
python setup.py install

Entry Points

disinfo_net_data_fetch.py - continually fetches new domains and raw data for those domains, implemented domain pipes include reddit, twitter, certstream, and domaintools.

disinfo_net_train_classifier.py - script that trains the classifier from designated training data.

disinfo_net_classify.py - script to classify raw data fetched by disinfo_data_fetch.py. It extracts features, classifies websites, and inserts them into a database table named by the user. It can be run in "live" mode where it constantly classifies new domains as they are fetched from disinfo_data_fetch.py or it can classify an entire database of candidate domains at once.

System Structure

Orchestrate - contains a conductor class that handles thread creation for domain pipes and worker threads, a worker thread class that fetches raw data for a domain, and a classification thread class that extracts features from raw data and classifies them.
Classify - Classifier class that has classes and functions for training, extracting features, and classifying candidate domains
Features - Classes to both fetch raw data and extract features from that raw data.
Pipe - contains an abstract base class for domain pipes that creates a standard interface for what the system expects when a domain is processed: current implementations of this ABC include Reddit, Twitter, Certstream, and DomainTools domain pipes
Postgres - Classes to interact with a postgres database including inserting, checking, and retrieving data.
util - various utility classes including classes to unshorten urls, get tlds, and determining ownership of ip addresses

Database Entries

Our system works in two parts, a data fetching script, which inserts raw data into a database table structured as follows:

 Attribute (Type) {
 domain (Text) (Primary Key),
 certificate (Text)
 whois (Text),
 html (Text),
 dns (Text),
 post_id (Text)
 platform (Text)
 insertion_time (UTC) 
}

Where each attribute is:

domain - unique domain which the rest of the data is associated with
certificate - the certificate, in raw string format, of the domain
whois - the whois response in raw string format
html - the raw HTML source of the homepage of the domain
dns - the IP address(es) that the domain was found to map to
post_id - the unique post id on the given platform of the post (for verification purposes)
platform - the platform on which the domain was posted
insertion_time - time of insertion into the database

The second part of our system, which classifies a domain given raw data about the domain, inserts those classifcations into a database table structured as follows:

 Attribute (Type) {
 domain (Text) (Primary Key),
 classification (Text) (one of: unclassified, non_news, disinformation)
 probabilities (JSON),
 insertion_time (UTC) 
}

Where each attribute is:

domain - unique domain which the rest of the data is associated with
classification - the actual classification of the domain by the classifier, one of: news, non_news, disinformation
probabilities - the probabilities of each class mentioned above in a JSON dictionary format
insertion_time - time of insertion to the database

Finally, we have a prepopulated training database, including raw data of all of our training data, in the format of:

 Attribute (Type) {
 domain (Text) (Primary Key),
 target (Text) one of: unclassified, non_news, disinformation)
 certificate (Text)
 whois (Text),
 html (Text),
 dns (Text),
}

Where each attribute is the same as our raw data table, with target being the known label for the domain.

Using the Chrome extension

Navigate your Chrome browser to chrome://extensions and enable developer mode in the top right.

Click "Load Unpacked" and upload the contents of the src/plugin/ directory.

Navigate to any of the sites listed in src/plugins/classified_sites.txt (for example, needtoknow.news) and you will see the warning message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Identifying Disinformation Websites Using Infrastructure Features

Structure

Installation

Steps to Develop on disinfo-infra

Steps to install and run code (NOT FOR DEVELOPMENT)

Entry Points

System Structure

Database Entries

Using the Chrome extension

Files

README.md

Latest commit

History

README.md

File metadata and controls

Identifying Disinformation Websites Using Infrastructure Features

Structure

Installation

Steps to Develop on disinfo-infra

Steps to install and run code (NOT FOR DEVELOPMENT)

Entry Points

System Structure

Database Entries

Using the Chrome extension