Skip to content

Latest commit

 

History

History

data-512-a2

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

A2: Bias in Data

The goal of this assignment is to undestand the concepts of bias by exploring data on Wikipedia articles. We are going to be particularly looking at articles on polictal figures from various countries. This is combined with the dataset of country populations and a machine learning service ORES to used to estimate the quality of the article.

With this combined data, we conduct the following analysis

  • Find countries with the greatest and least coverage of politicians on Wikipedia compared to population
  • countries with highest and lowest proportion of high quality articles
  • rank geographic regions by articles per person and proportion of high quality articles

Data acquisition, processing and analysis steps are all recorded in this IPython Notebook

Directory Structure

.
├── README.md
├── clean
│   ├── articles_no_ratings.csv
│   ├── wp_wpds_countries-no_match.csv
│   └── wp_wpds_politicians_by_country.csv
├── hcds-a2-bias.ipynb
└── raw
    ├── WPDS_2018_data.csv
    ├── WPDS_2018_data_continents.csv
    ├── ores_data.csv
    └── page_data.csv

Data

The data is obtained from multiple sources as listed below.

  1. Wikipedia politicians by country dataset is stored at raw/page_data.csv

Wikipedia articles data is found at this https://figshare.com/articles/Untitled_Item/5513449. The data contains articles of political figures by country. It is titled page_data.csv. The data is licensed under CC-BY 4.0 license.

Column Description
page Title of the wikipedia article
country country of origin
rev_id revision id of the article
  1. Population data is stored at raw/WPDS_2018_data.csv

The population data contains world populations for 207 countries as of 2018. The data was provided to us as part of the assignment. This data is obtained form the world population datasheet published by the Population Reference bureau. There is no License attributed to this data, by default all rights are reserved to the owners of this data.

Column Description
Geography Country or region
Population mid-2018 (millions) population in millions
  1. Population data with continents raw/WPDS_2018_data_continents.csv

This dataset is an extension of the previous one, a new column continent is added and this is tagging is done manually on Excel

Column Description
Geography Country or region
Population mid-2018 (millions) population in millions
continent region the country belongs to
  1. ORES ratings

Objective Revision Evaluation Service is used to estimate the quality of an article. The documentation for ORES API can be found at https://www.mediawiki.org/wiki/ORES. We make RESTful API calls to this service to find out the quality of Wikipedia articles and store this data in raw/ores_data.csv

Column Description
rev_id rev_id of the wikipedia article
ratings ratings as defined below

The ratings of the articles are classified below from high quality to low

FA - Featured article

GA - Good article

B - B-class article

C - C-class article

Start - Start-class article

Stub - Stub-class article

Data Processing

As part of the data processing step we perform the following operations

  1. Merge ores ratings and wikipedia articles dataset
  2. Combine wiki ratings and country data
  3. Clean up the merged data

The final data is stored as

Final Dataset

The final dataset is found at here

Column Description
country country name
article_name article name
revision_id revision id of the article
article_quality quality score obtained from ORES API
population The population of the country.

Data Analysis

We perform the following analysis and report the results:

  1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
country population total_articles coverage
166 tuvalu 10000 54 0.540000
115 nauru 10000 52 0.520000
135 san marino 30000 81 0.270000
108 monaco 40000 40 0.100000
93 liechtenstein 40000 28 0.070000
161 tonga 100000 63 0.063000
103 marshall islands 60000 37 0.061667
68 iceland 400000 201 0.050250
3 andorra 80000 34 0.042500
61 grenada 100000 36 0.036000
  1. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
country population total_articles coverage
69 india 1371300000 978 0.000071
70 indonesia 265200000 209 0.000079
34 china 1393800000 1126 0.000081
173 uzbekistan 32900000 28 0.000085
51 ethiopia 107500000 101 0.000094
82 korea, north 25600000 35 0.000137
178 zambia 17700000 25 0.000141
159 thailand 66200000 112 0.000169
112 mozambique 30500000 58 0.000190
13 bangladesh 166400000 319 0.000192
  1. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
country total_articles high_quality_articles ratio
82 korea, north 35 6 17.142857
104 mauritania 48 6 12.500000
31 central african republic 66 8 12.121212
132 romania 336 39 11.607143
137 saudi arabia 116 13 11.206897
166 tuvalu 54 5 9.259259
19 bhutan 33 3 9.090909
44 dominica 12 1 8.333333
155 syria 128 10 7.812500
18 benin 91 7 7.692308
  1. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
country total_articles high_quality_articles ratio
143 slovakia 116 0 0.0
30 cape verde 37 0 0.0
112 mozambique 58 0 0.0
38 costa rica 147 0 0.0
108 monaco 40 0 0.0
43 djibouti 37 0 0.0
107 moldova 423 0 0.0
167 uganda 184 0 0.0
49 eritrea 16 0 0.0
50 estonia 149 0 0.0
  1. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
total_articles population coverage
continent
OCEANIA 3119 3.978000e+07 0.000078
EUROPE 15829 7.345900e+08 0.000022
LATIN AMERICA AND THE CARIBBEAN 5166 6.282700e+08 0.000008
AFRICA 6839 1.172400e+09 0.000006
NORTHERN AMERICA 1913 3.652000e+08 0.000005
ASIA 11506 4.513100e+09 0.000003
  1. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality
high_quality_articles total_articles ratio
continent
NORTHERN AMERICA 99 1913 0.051751
ASIA 305 11506 0.026508
OCEANIA 64 3119 0.020519
EUROPE 316 15829 0.019963
AFRICA 124 6839 0.018131
LATIN AMERICA AND THE CARIBBEAN 69 5166 0.013357

To perform the above steps the following Python libraries were used

  1. requests - To perform API requests
  2. pandas - Python data processing library
  3. numpy - scientific computing package

Reflection

1. What biases did you expect to find in the data, and why?

On first glance at page_data.csv some of the titles are clearly not Politicans. Some of these titles include "Information Minister of the Palestinian Nation" , "Finance Minister of ..", "List of politicians in Poland". We are trying to analyze the number of articles on 'political figures' and currently we are not filtering out such titles. We are not currently accounting for these kinds of titles. This can impact every step of the downstream analysis.

Since the analysis is performed only on English Wikipedia articles, we are not accounting for the articles written in the countries native language. I would expect to find more number of articles from English speaking countries. Since the analysis is performed on English Wikipedia articles, there is bias in this analysis. The articles would be of high-quality from English speaking countries since English is their first language. The articles written in non-English speaking countries, might find their politician articles to be of higher quality in their native language. Not accounting for articles written in other language might introduce a sampling bias in the analysis.

2. What potential sources of bias did you discover in the course of your data processing and analysis?

By looking at the Top 10 countries by relative quality. We see North Korea is at the top of the list. This result is quite suspect, since North Korea has quite a bad reputation in the public media and their goverments are generally oppresive. It is also not surprising to see countries with the lowest populations have the highest coverage. Since they would have the best high quality articles proportion.

Most populous countries such as India, Indonesia and China have the least coverage relative to population. This leads me to think, if the metric we are using is the right one? Populations might not be a good measure for calculating coverage. If the population increases x2 it doesn't correlate to twice the number of politicians or twice the number of English Wikipedia articles. Also, the scale at which populations work (millions and billions) is not comparable to the number of high quality articles (hundreds and thousands). This is dependent on many other factors as described in the next answer.

3. How might a researched supplement or transform this dataset to potentially correct for the limitations/biases you created?

The data in the analysis can be enriched by adding more context to each article, such as who is the author and where does the author reside? It might be helpful to know if the person writing the article actually is a citizen of the country.

The output of articles from a nation is dependent on other factors such as the literacy rate, access to Internet/ Wikipedia and the censorship laws in that country. When treating Wikipedia as the dataset, one should also consider the articles written in languages written in the other languages. Having such additional information can help us address bias and make sure we are not reporting false information

References

Assignment Intructions Citation: Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset.

MIT License