This data set is described in the paper: Corrado Monti, Jacopo D’Ignazi, Michele Starnini, and Gianmarco De Francisci Morales. 2023. "Evidence of Demographic rather than Ideological Segregation in News Discussion on Reddit." In Proceedings of the ACM Web Conference 2023 (WWW ’23).
For every year 2016 to 2020 (included), the data set contains these three CSV files.
Each username is consistently replaced with an anonymized string.
-
YEAR_news_authors.csv
: for each Reddit users included in the analysis (non-bots users with at least 25 messages on r/news and at least one submission in 5 different subreddits in that year), this file reports their anonymized username and their score on the age, gender, partisan, and affluence axes. Scores are quantile-normalized, so that i.e. a score of 0.25 indicates the 25th percentile. The axes respectively correspond to probability of being young (low) or old (high), male or female, poor or rich, and left-leaning or right-leaning. -
YEAR_news_graph.csv
: each line corresponds to a comment on r/news in that year. The file lists an anonymized id for the submission under which the comment happens, the author of the comment, the author of the parent comment to which this comment is replying to, and the sentiment of the text of the interaction. This can be seen as a weighted graph among users. -
YEAR_news_submissions.csv
: each line corresponds to a submission on r/news, including the anonymized id of the submission, username of its author, total number of comments received, and the topic of the submission.
See the paper for more details about how we extracted this information.
The total number of considered users and comments per year is
Year | 2016 | 2017 | 2018 | 2019 | 2020 |
---|---|---|---|---|---|
N. nodes | 27976 | 34060 | 31997 | 21225 | 29045 |
N. edges | 1166076 | 1390243 | 1221779 | 793569 | 1067614 |