This repository contains keyword blacklists and lists of other content such as URLs or images used to trigger censorship in apps used in China. With the exception of WeChat, these lists were reverse engineered and are the exhaustive lists of keywords used to trigger censorship on these platforms.
The full details on data collection and analysis methods and results are available below.
The research below tracks daily changes to censorship in three different chat apps used in China: TOM-Skype, Sina UC, and Line. Overall, our chat app data consists of over 4,000 blacklisted keywords.
-
Three Researchers, Five Conjectures: An Empirical Analysis of TOM-Skype Censorship and Surveillance
-
Chat program censorship and surveillance in China: Tracking TOM-Skype and Sina UC
-
Asia Chats: Investigating Regionally-based Keyword Censorship in LINE
Data: TOM-Skype and Sina UC, LINE
The research below tracks hourly changes to censorship in three different live streaming apps in China: YY, Sina Show, and 9158; and documents the keywords censored by GuaGua, which does not include a mechanism for downloading updates to its censorship blacklists. Overall, our live-streaming data consists of over 20,000 blacklisted keywords.
-
Every Rose Has Its Thorn: Censorship and Surveillance on Social Video Platforms in China
-
Harmonized Histories? A year of fragmented censorship across Chinese live streaming applications
Data: Original live-streaming data (2015), Updated live-streaming data (2017)
Our research on mobile games analyzes domestic Chinese games as well as international games that have been altered to comply with Chinese regulations. Overall, we found hundreds of mobile games performing censorship, collectively censoring over 100,000 unique blacklisted keywords.
Data: Mobile games
This research analyzes Chinese censorship in open source projects. We extracted over 1,000 Chinese keyword blacklists from open source projects on GitHub, collectively spanning over 200,000 unique blacklisted keywords.
Data: Open source blacklists
Our research on WeChat censorship uses sample testing to determine what type of content, such as words, URLs, and images, can be communicated over the platform and which content is censored. We have studied what categorical content WeChat generally filters in addition to what content WeChat filters in response to specific events.
- One App, Two Systems How WeChat uses one censorship policy in China and another internationally
- We (can’t) Chat “709 Crackdown” Discussions Blocked on Weibo and WeChat
- Remembering Liu Xiaobo Analyzing censorship of the death of Liu Xiaobo on WeChat and Weibo
- Managing the Message: What you can’t say about the 19th National Communist Party Congress on WeChat
- (Can’t) Picture This: An Analysis of Image Filtering on WeChat Moments (paper)
Data: Keywords and URLs (November 2016), 709 Crackdown keywords and images (April 2017), Liu Xiaobo keywords and images (July 2017), 19th Party Congress keywords (November 2017), Image filtering test data (May 2018)
Our research measuring Apple's filtering of product engravings uses sample testing to discover keywords that cannot be engraved in each of six different regions: United States, Canada, Japan, Taiwan, Hong Kong, and mainland China.
Data: Keyword filtering rules
On Tencent's QQMail, we discover that certain combinations of keywords being present in email messages triggers their censorship. However, the presence of other combinations, which we call extenuating combinations, deactivates the censorship of some censored keywords.
Data: Censored and extenuating keyword combinations
Datasets include raw keyword lists collected from the applications. Many also include processed data including translations and categorization of keywords. Keywords were translated to English using a combination of machine and human translation. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general themes according to a code book.
All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.