The goal of this assignment is to undestand the concepts of bias by exploring data on Wikipedia articles. We are going to be particularly looking at articles on polictal figures from various countries. This is combined with the dataset of country populations and a machine learning service ORES to used to estimate the quality of the article.
With this combined data, we conduct the following analysis
- Find countries with the greatest and least coverage of politicians on Wikipedia compared to population
- countries with highest and lowest proportion of high quality articles
- rank geographic regions by articles per person and proportion of high quality articles
Data acquisition, processing and analysis steps are all recorded in this IPython Notebook
.
├── README.md
├── clean
│ ├── articles_no_ratings.csv
│ ├── wp_wpds_countries-no_match.csv
│ └── wp_wpds_politicians_by_country.csv
├── hcds-a2-bias.ipynb
└── raw
├── WPDS_2018_data.csv
├── WPDS_2018_data_continents.csv
├── ores_data.csv
└── page_data.csv
The data is obtained from multiple sources as listed below.
- Wikipedia politicians by country dataset is stored at raw/page_data.csv
Wikipedia articles data is found at this https://figshare.com/articles/Untitled_Item/5513449. The data contains articles of political figures by country. It is titled page_data.csv. The data is licensed under CC-BY 4.0 license.
Column | Description |
---|---|
page | Title of the wikipedia article |
country | country of origin |
rev_id | revision id of the article |
- Population data is stored at raw/WPDS_2018_data.csv
The population data contains world populations for 207 countries as of 2018. The data was provided to us as part of the assignment. This data is obtained form the world population datasheet published by the Population Reference bureau. There is no License attributed to this data, by default all rights are reserved to the owners of this data.
Column | Description |
---|---|
Geography | Country or region |
Population mid-2018 (millions) | population in millions |
- Population data with continents raw/WPDS_2018_data_continents.csv
This dataset is an extension of the previous one, a new column continent is added and this is tagging is done manually on Excel
Column | Description |
---|---|
Geography | Country or region |
Population mid-2018 (millions) | population in millions |
continent | region the country belongs to |
- ORES ratings
Objective Revision Evaluation Service is used to estimate the quality of an article. The documentation for ORES API can be found at https://www.mediawiki.org/wiki/ORES. We make RESTful API calls to this service to find out the quality of Wikipedia articles and store this data in raw/ores_data.csv
Column | Description |
---|---|
rev_id | rev_id of the wikipedia article |
ratings | ratings as defined below |
The ratings of the articles are classified below from high quality to low
FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article
As part of the data processing step we perform the following operations
- Merge ores ratings and wikipedia articles dataset
- Combine wiki ratings and country data
- Clean up the merged data
The final data is stored as
The final dataset is found at here
Column | Description |
---|---|
country | country name |
article_name | article name |
revision_id | revision id of the article |
article_quality | quality score obtained from ORES API |
population | The population of the country. |
We perform the following analysis and report the results:
- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
country | population | total_articles | coverage | |
---|---|---|---|---|
166 | tuvalu | 10000 | 54 | 0.540000 |
115 | nauru | 10000 | 52 | 0.520000 |
135 | san marino | 30000 | 81 | 0.270000 |
108 | monaco | 40000 | 40 | 0.100000 |
93 | liechtenstein | 40000 | 28 | 0.070000 |
161 | tonga | 100000 | 63 | 0.063000 |
103 | marshall islands | 60000 | 37 | 0.061667 |
68 | iceland | 400000 | 201 | 0.050250 |
3 | andorra | 80000 | 34 | 0.042500 |
61 | grenada | 100000 | 36 | 0.036000 |
- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
country | population | total_articles | coverage | |
---|---|---|---|---|
69 | india | 1371300000 | 978 | 0.000071 |
70 | indonesia | 265200000 | 209 | 0.000079 |
34 | china | 1393800000 | 1126 | 0.000081 |
173 | uzbekistan | 32900000 | 28 | 0.000085 |
51 | ethiopia | 107500000 | 101 | 0.000094 |
82 | korea, north | 25600000 | 35 | 0.000137 |
178 | zambia | 17700000 | 25 | 0.000141 |
159 | thailand | 66200000 | 112 | 0.000169 |
112 | mozambique | 30500000 | 58 | 0.000190 |
13 | bangladesh | 166400000 | 319 | 0.000192 |
- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
country | total_articles | high_quality_articles | ratio | |
---|---|---|---|---|
82 | korea, north | 35 | 6 | 17.142857 |
104 | mauritania | 48 | 6 | 12.500000 |
31 | central african republic | 66 | 8 | 12.121212 |
132 | romania | 336 | 39 | 11.607143 |
137 | saudi arabia | 116 | 13 | 11.206897 |
166 | tuvalu | 54 | 5 | 9.259259 |
19 | bhutan | 33 | 3 | 9.090909 |
44 | dominica | 12 | 1 | 8.333333 |
155 | syria | 128 | 10 | 7.812500 |
18 | benin | 91 | 7 | 7.692308 |
- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
country | total_articles | high_quality_articles | ratio | |
---|---|---|---|---|
143 | slovakia | 116 | 0 | 0.0 |
30 | cape verde | 37 | 0 | 0.0 |
112 | mozambique | 58 | 0 | 0.0 |
38 | costa rica | 147 | 0 | 0.0 |
108 | monaco | 40 | 0 | 0.0 |
43 | djibouti | 37 | 0 | 0.0 |
107 | moldova | 423 | 0 | 0.0 |
167 | uganda | 184 | 0 | 0.0 |
49 | eritrea | 16 | 0 | 0.0 |
50 | estonia | 149 | 0 | 0.0 |
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
total_articles | population | coverage | |
---|---|---|---|
continent | |||
OCEANIA | 3119 | 3.978000e+07 | 0.000078 |
EUROPE | 15829 | 7.345900e+08 | 0.000022 |
LATIN AMERICA AND THE CARIBBEAN | 5166 | 6.282700e+08 | 0.000008 |
AFRICA | 6839 | 1.172400e+09 | 0.000006 |
NORTHERN AMERICA | 1913 | 3.652000e+08 | 0.000005 |
ASIA | 11506 | 4.513100e+09 | 0.000003 |
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality
high_quality_articles | total_articles | ratio | |
---|---|---|---|
continent | |||
NORTHERN AMERICA | 99 | 1913 | 0.051751 |
ASIA | 305 | 11506 | 0.026508 |
OCEANIA | 64 | 3119 | 0.020519 |
EUROPE | 316 | 15829 | 0.019963 |
AFRICA | 124 | 6839 | 0.018131 |
LATIN AMERICA AND THE CARIBBEAN | 69 | 5166 | 0.013357 |
To perform the above steps the following Python libraries were used
- requests - To perform API requests
- pandas - Python data processing library
- numpy - scientific computing package
On first glance at page_data.csv some of the titles are clearly not Politicans. Some of these titles include "Information Minister of the Palestinian Nation" , "Finance Minister of ..", "List of politicians in Poland". We are trying to analyze the number of articles on 'political figures' and currently we are not filtering out such titles. We are not currently accounting for these kinds of titles. This can impact every step of the downstream analysis.
Since the analysis is performed only on English Wikipedia articles, we are not accounting for the articles written in the countries native language. I would expect to find more number of articles from English speaking countries. Since the analysis is performed on English Wikipedia articles, there is bias in this analysis. The articles would be of high-quality from English speaking countries since English is their first language. The articles written in non-English speaking countries, might find their politician articles to be of higher quality in their native language. Not accounting for articles written in other language might introduce a sampling bias in the analysis.
2. What potential sources of bias did you discover in the course of your data processing and analysis?
By looking at the Top 10 countries by relative quality. We see North Korea is at the top of the list. This result is quite suspect, since North Korea has quite a bad reputation in the public media and their goverments are generally oppresive. It is also not surprising to see countries with the lowest populations have the highest coverage. Since they would have the best high quality articles proportion.
Most populous countries such as India, Indonesia and China have the least coverage relative to population. This leads me to think, if the metric we are using is the right one? Populations might not be a good measure for calculating coverage. If the population increases x2 it doesn't correlate to twice the number of politicians or twice the number of English Wikipedia articles. Also, the scale at which populations work (millions and billions) is not comparable to the number of high quality articles (hundreds and thousands). This is dependent on many other factors as described in the next answer.
3. How might a researched supplement or transform this dataset to potentially correct for the limitations/biases you created?
The data in the analysis can be enriched by adding more context to each article, such as who is the author and where does the author reside? It might be helpful to know if the person writing the article actually is a citizen of the country.
The output of articles from a nation is dependent on other factors such as the literacy rate, access to Internet/ Wikipedia and the censorship laws in that country. When treating Wikipedia as the dataset, one should also consider the articles written in languages written in the other languages. Having such additional information can help us address bias and make sure we are not reporting false information
Assignment Intructions Citation: Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset.