Wisdom of Polarized Crowds

This repo contains the code for the paper

Feng Shi, Misha Teplitskiy, Eamon Duede & James A. Evans; The wisdom of polarized crowds; Nature Human Behaviour 3, 329–336 (2019).

Project Workflow

Identify pages for politics, social issues and science titles. See the paper on how the corpora are identified.
Calculate political alignment for the editors of the political pages (alignment.ipynb).
Use Wikipedia's API to get the total number of edits made to Wikipedia for each editor whose alignment is assessed (user_query.ipynb). Also remove editors whose usernames are IP addresses.
Download and parse the Wikipedia dump files for Wikipedia's complete snapshot (including all editing histories) on 20161201. (See Wikipedia dump parsers for details.) In sum, 217 dump files are to be parsed; the following parsers are used.
- attack_parser: extract page title, time, editor name, attack score, and aggressive score for each edit. It uses wiki-detox to assess the aggressive and attack scores. It relies on revision_differ.py and diff_match_patch.py (both are included) from wikihadoop to extract the difference between two snapshots of a page. Each line of the output corresponds to an edit.
- edit_info_parser: extract page title, time, editor name, and current page length for each edit. Each line of the output corresponds to an edit.
- policy_parser: extract page title, policy name, and number of mentions of the policy for each page. It uses a pre-compiled list of Wikipedia policies and guidelines (wiki_policies.tsv). Each line of the output corresponds to a (title, policy) combination.
- quality_parser: assess the quality of each page. It uses wikiclass and revscoring to calculate the quality of each page. Each line of the output corresponds to page.
- tf_parser: calculate word frequncies for each page. It relies on the stopword dictionary from NLTK to remove stopwords. Each line of the output corresponds to a (page, word) combination.
- word_radius_parser: extract page title, number of words, and radius of each page. It relies on the stopword dictionary from NLTK to remove stopwords. It uses the word embeddings from fastText to represent words as numeric vectors; the radius of a page is then defined as the median distance from the words on the page to their centroid. Each line of the output corresponds to a page.
Each parser will generate 217 TSV files containing the information extracted from the dumps. Tar those files into a tarball. The data_wrangling.ipynb will combine those files in the tarball and prepare the data for further analysis.
Analysis is done in the analysis.ipynb notebook. The analysis relies on data generated from all the steps above, which is saved in the Data folder. To reproduce the analysis, some of the data files are already included in the Data folder (some large files are compressed due to Github's file size policy). Three data files, social_issue_pages.tsv, science_pages.tsv, and talk_pages.tsv are too large to be included.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Data		Data
wiki_dump_parsers		wiki_dump_parsers
.gitignore		.gitignore
README.md		README.md
alignment.ipynb		alignment.ipynb
analysis.ipynb		analysis.ipynb
data_wrangling.ipynb		data_wrangling.ipynb
user_query.ipynb		user_query.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wisdom of Polarized Crowds

Project Workflow

About

Releases

Packages

Languages

KnowledgeLab/wisdom-of-polarized-crowds

Folders and files

Latest commit

History

Repository files navigation

Wisdom of Polarized Crowds

Project Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages