data-512/data-512-a2 at master · tharunsikhinam/data-512

History

Name		Name	Last commit message	Last commit date
parent directory ..
clean		clean
raw		raw
README.md		README.md
hcds-a2-bias.ipynb		hcds-a2-bias.ipynb

README.md

A2: Bias in Data

The goal of this assignment is to undestand the concepts of bias by exploring data on Wikipedia articles. We are going to be particularly looking at articles on polictal figures from various countries. This is combined with the dataset of country populations and a machine learning service ORES to used to estimate the quality of the article.

With this combined data, we conduct the following analysis

Find countries with the greatest and least coverage of politicians on Wikipedia compared to population
countries with highest and lowest proportion of high quality articles
rank geographic regions by articles per person and proportion of high quality articles

Data acquisition, processing and analysis steps are all recorded in this IPython Notebook

Directory Structure

.
├── README.md
├── clean
│   ├── articles_no_ratings.csv
│   ├── wp_wpds_countries-no_match.csv
│   └── wp_wpds_politicians_by_country.csv
├── hcds-a2-bias.ipynb
└── raw
    ├── WPDS_2018_data.csv
    ├── WPDS_2018_data_continents.csv
    ├── ores_data.csv
    └── page_data.csv

Data

The data is obtained from multiple sources as listed below.

Wikipedia politicians by country dataset is stored at raw/page_data.csv

Wikipedia articles data is found at this https://figshare.com/articles/Untitled_Item/5513449. The data contains articles of political figures by country. It is titled page_data.csv. The data is licensed under CC-BY 4.0 license.

Column	Description
page	Title of the wikipedia article
country	country of origin
rev_id	revision id of the article

Population data is stored at raw/WPDS_2018_data.csv

The population data contains world populations for 207 countries as of 2018. The data was provided to us as part of the assignment. This data is obtained form the world population datasheet published by the Population Reference bureau. There is no License attributed to this data, by default all rights are reserved to the owners of this data.

Column	Description
Geography	Country or region
Population mid-2018 (millions)	population in millions

Population data with continents raw/WPDS_2018_data_continents.csv

This dataset is an extension of the previous one, a new column continent is added and this is tagging is done manually on Excel

Column	Description
Geography	Country or region
Population mid-2018 (millions)	population in millions
continent	region the country belongs to

ORES ratings

Objective Revision Evaluation Service is used to estimate the quality of an article. The documentation for ORES API can be found at https://www.mediawiki.org/wiki/ORES. We make RESTful API calls to this service to find out the quality of Wikipedia articles and store this data in raw/ores_data.csv

Column	Description
rev_id	rev_id of the wikipedia article
ratings	ratings as defined below

The ratings of the articles are classified below from high quality to low

FA - Featured article

GA - Good article

B - B-class article

C - C-class article

Start - Start-class article

Stub - Stub-class article

Data Processing

As part of the data processing step we perform the following operations

Merge ores ratings and wikipedia articles dataset
Combine wiki ratings and country data
Clean up the merged data

The final data is stored as

Final Dataset

The final dataset is found at here

Column	Description
country	country name
article_name	article name
revision_id	revision id of the article
article_quality	quality score obtained from ORES API
population	The population of the country.

Data Analysis

We perform the following analysis and report the results:

Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

	country	population	total_articles	coverage
166	tuvalu	10000	54	0.540000
115	nauru	10000	52	0.520000
135	san marino	30000	81	0.270000
108	monaco	40000	40	0.100000
93	liechtenstein	40000	28	0.070000
161	tonga	100000	63	0.063000
103	marshall islands	60000	37	0.061667
68	iceland	400000	201	0.050250
3	andorra	80000	34	0.042500
61	grenada	100000	36	0.036000

Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

	country	population	total_articles	coverage
69	india	1371300000	978	0.000071
70	indonesia	265200000	209	0.000079
34	china	1393800000	1126	0.000081
173	uzbekistan	32900000	28	0.000085
51	ethiopia	107500000	101	0.000094
82	korea, north	25600000	35	0.000137
178	zambia	17700000	25	0.000141
159	thailand	66200000	112	0.000169
112	mozambique	30500000	58	0.000190
13	bangladesh	166400000	319	0.000192

Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

	country	total_articles	high_quality_articles	ratio
82	korea, north	35	6	17.142857
104	mauritania	48	6	12.500000
31	central african republic	66	8	12.121212
132	romania	336	39	11.607143
137	saudi arabia	116	13	11.206897
166	tuvalu	54	5	9.259259
19	bhutan	33	3	9.090909
44	dominica	12	1	8.333333
155	syria	128	10	7.812500
18	benin	91	7	7.692308

Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

	country	total_articles
143	slovakia	116
30	cape verde	37
112	mozambique	58
38	costa rica	147
108	monaco	40
43	djibouti	37
107	moldova	423
167	uganda	184
49	eritrea	16
50	estonia	149

Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

	total_articles	population	coverage
continent
OCEANIA	3119	3.978000e+07	0.000078
EUROPE	15829	7.345900e+08	0.000022
LATIN AMERICA AND THE CARIBBEAN	5166	6.282700e+08	0.000008
AFRICA	6839	1.172400e+09	0.000006
NORTHERN AMERICA	1913	3.652000e+08	0.000005
ASIA	11506	4.513100e+09	0.000003

Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

	high_quality_articles	total_articles	ratio
continent
NORTHERN AMERICA	99	1913	0.051751
ASIA	305	11506	0.026508
OCEANIA	64	3119	0.020519
EUROPE	316	15829	0.019963
AFRICA	124	6839	0.018131
LATIN AMERICA AND THE CARIBBEAN	69	5166	0.013357

To perform the above steps the following Python libraries were used

requests - To perform API requests
pandas - Python data processing library
numpy - scientific computing package

Reflection

1. What biases did you expect to find in the data, and why?

On first glance at page_data.csv some of the titles are clearly not Politicans. Some of these titles include "Information Minister of the Palestinian Nation" , "Finance Minister of ..", "List of politicians in Poland". We are trying to analyze the number of articles on 'political figures' and currently we are not filtering out such titles. We are not currently accounting for these kinds of titles. This can impact every step of the downstream analysis.

Since the analysis is performed only on English Wikipedia articles, we are not accounting for the articles written in the countries native language. I would expect to find more number of articles from English speaking countries. Since the analysis is performed on English Wikipedia articles, there is bias in this analysis. The articles would be of high-quality from English speaking countries since English is their first language. The articles written in non-English speaking countries, might find their politician articles to be of higher quality in their native language. Not accounting for articles written in other language might introduce a sampling bias in the analysis.

2. What potential sources of bias did you discover in the course of your data processing and analysis?

By looking at the Top 10 countries by relative quality. We see North Korea is at the top of the list. This result is quite suspect, since North Korea has quite a bad reputation in the public media and their goverments are generally oppresive. It is also not surprising to see countries with the lowest populations have the highest coverage. Since they would have the best high quality articles proportion.

Most populous countries such as India, Indonesia and China have the least coverage relative to population. This leads me to think, if the metric we are using is the right one? Populations might not be a good measure for calculating coverage. If the population increases x2 it doesn't correlate to twice the number of politicians or twice the number of English Wikipedia articles. Also, the scale at which populations work (millions and billions) is not comparable to the number of high quality articles (hundreds and thousands). This is dependent on many other factors as described in the next answer.

3. How might a researched supplement or transform this dataset to potentially correct for the limitations/biases you created?

The data in the analysis can be enriched by adding more context to each article, such as who is the author and where does the author reside? It might be helpful to know if the person writing the article actually is a citizen of the country.

The output of articles from a nation is dependent on other factors such as the literacy rate, access to Internet/ Wikipedia and the censorship laws in that country. When treating Wikipedia as the dataset, one should also consider the articles written in languages written in the other languages. Having such additional information can help us address bias and make sure we are not reporting false information

References

Assignment Intructions Citation: Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset.

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-512-a2

data-512-a2

README.md

A2: Bias in Data

Directory Structure

Data

Data Processing

Final Dataset

Data Analysis

Reflection

1. What biases did you expect to find in the data, and why?

2. What potential sources of bias did you discover in the course of your data processing and analysis?

3. How might a researched supplement or transform this dataset to potentially correct for the limitations/biases you created?

References

Files

data-512-a2

Directory actions

More options

Directory actions

More options

Latest commit

History

data-512-a2

Folders and files

parent directory

README.md

A2: Bias in Data

Directory Structure

Data

Data Processing

Final Dataset

Data Analysis

Reflection

1. What biases did you expect to find in the data, and why?

2. What potential sources of bias did you discover in the course of your data processing and analysis?

3. How might a researched supplement or transform this dataset to potentially correct for the limitations/biases you created?

References