Skip to content

Reddit News Homophily Dataset

Latest
Compare
Choose a tag to compare
@corradomonti corradomonti released this 13 Feb 19:04
· 2 commits to main since this release

This data set is described in the paper: Corrado Monti, Jacopo D’Ignazi, Michele Starnini, and Gianmarco De Francisci Morales. 2023. "Evidence of Demographic rather than Ideological Segregation in News Discussion on Reddit." In Proceedings of the ACM Web Conference 2023 (WWW ’23).

For every year 2016 to 2020 (included), the data set contains these three CSV files.
Each username is consistently replaced with an anonymized string.

  • YEAR_news_authors.csv: for each Reddit users included in the analysis (non-bots users with at least 25 messages on r/news and at least one submission in 5 different subreddits in that year), this file reports their anonymized username and their score on the age, gender, partisan, and affluence axes. Scores are quantile-normalized, so that i.e. a score of 0.25 indicates the 25th percentile. The axes respectively correspond to probability of being young (low) or old (high), male or female, poor or rich, and left-leaning or right-leaning.

  • YEAR_news_graph.csv: each line corresponds to a comment on r/news in that year. The file lists an anonymized id for the submission under which the comment happens, the author of the comment, the author of the parent comment to which this comment is replying to, and the sentiment of the text of the interaction. This can be seen as a weighted graph among users.

  • YEAR_news_submissions.csv: each line corresponds to a submission on r/news, including the anonymized id of the submission, username of its author, total number of comments received, and the topic of the submission.

See the paper for more details about how we extracted this information.
The total number of considered users and comments per year is

Year 2016 2017 2018 2019 2020
N. nodes 27976 34060 31997 21225 29045
N. edges 1166076 1390243 1221779 793569 1067614