Wikipedia Page View History

Goal

This project displays the monthly page views of all English Wikipedia pages over time. Data is split out between the site it was accessed on, such as mobile or desktop, and also joins the newer pageview tracking numbers with older legacy tracking.

Setup

Environment

All scripts are run in Jupyter Notebook using python 3.7.9. The full list of dependencies can be found in the requirements.txt file

How to Run

All code used is contained in the hcds-a1-data-curation.ipynb notebook. Running all cells will collect and process all data. See the description in the notebook to skip the data acquisition or processing stages.

Stages

1. Data Acquisition

Five different API requests are made, requesting desktop and mobile usage from the legacy endpoint, and desktop, mobile-site and mobile-app usage from the pageviews endpoint. All responses are saved as json files in the data folder.

2. Data Processing

The data from the api requests in stage is collected and combined, merging on Year and Month. The data is then pivoted to be intabular format, with one column for each API data type. Finally, the "mobile-site" and "mobile-app" counts from the pageview api are combined into just "mobile", and the total number of views, desktop + mobile, is calculated for the pageview api and legacy api. The resulting data is saved in the 'en-wikipedia_traffic_200712-202108.csv' file

3. Analysis

Analysis involved graphing the values created in the processing step. Using the csv file created in Stage 2, the values are graphed on a line chart using pandas and matplotlib. Year and month are combined into dates for the x-axis, and other values are converted into billions to display on the y-axis. API (Pageview or Legacy) are encoded as blue and green lines respectively, while the different sites accessed (All, Desktop, Mobile) are encoded with line style. The result is saved as "english_wikipedia_traffic.png".

Data

Permissions

All data is collected from the Wikimedia REST API using the Creative Commons Attribution-ShareAlike 3.0 license.

Source

Data was collected through two API Endpoints:

Pageviews

This API makes available the pageviews from July 2015 and onward. Data is broken out between desktop,mobile-site and mobile-app access, and is can be filtered to exclude automated views such as web crawlers.

Legacy Pagecounts

This API makes available the pagecounts from January 2008 to July 2016. Data is broken out between desktop and mobile access, and is not filtered for web crawlers or bots.

Description

Raw Data

All files in the data folder contain the exact responses of the API requests in JSON format. Files are named "apiname_accesstype_firstmonth-lastmonth.json"

Processed Data

The processed data contains 164 rows and 6 columns, and is stored in the csv file en-wikipedia_traffic_200712-202108.csv

year: The year in YYYY format
month: The month in MM format
pageview_desktop_views: The number of page views for that month accessed from a desktop computer, from the pageview API, excluding crawlers and beginning in July 2015
pagecount_desktop_views: The number of page views for that month accessed from a desktop computer, from the legacy API, including crawlers and spanning from January 2008 to July 2016
pagecount_mobile_views: The number of page views for that month accessed from a mobile phone, from the legacy API, including crawlers and spanning from January 2008 to July 2016
pageview_mobile_views: The number of page views for that month accessed from a mobile phone or app, from the pageview API, excluding crawlers and beginning in July 2015
pagecount_all_views: The number of page views for that month accessed in any form, from the legacy API, including crawlers and spanning from January 2008 to July 2016
pageview_all_views: The number of page views for that month accessed fin any form, from the pageview API, excluding crawlers and beginning in July 2015

Result

Visualization

Analysis

A few trends are apparent from this visualization:

The growth of total page views until around 2017, then a relative plateau.
The rise of mobile page views from the beginning of data in 2015.
The impact of excluding web crawlers on the totals, visible as the gap between the solid blue and solid green line in 2016.

Considerations

Due to the change in measurement between the old and new APIs, direct comparison for data before and after 2015 is not available. All Legacy API data includes bot and web crawlers, so will always present a higher number than the newer measurements would. Thus the blue and green line can be compared where they overlap to see the impact of automated page visits, but should not be directly used to form a trend.

All data is self-reported from the Wikimedia Foundation, and has not been tested by this author for accuracy or reliability.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
en-wikipedia_traffic_200712-202108.csv		en-wikipedia_traffic_200712-202108.csv
english_wikipedia_traffic.png		english_wikipedia_traffic.png
hcds-a1-data-curation.ipynb		hcds-a1-data-curation.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Page View History

Goal

Setup

Environment

How to Run

Stages

1. Data Acquisition

2. Data Processing

3. Analysis

Data

Permissions

Source

Pageviews

Legacy Pagecounts

Description

Raw Data

Processed Data

Result

Visualization

Analysis

Considerations

About

Releases

Packages

Languages

License

CaseCal/data-512-a1

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Page View History

Goal

Setup

Environment

How to Run

Stages

1. Data Acquisition

2. Data Processing

3. Analysis

Data

Permissions

Source

Pageviews

Legacy Pagecounts

Description

Raw Data

Processed Data

Result

Visualization

Analysis

Considerations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages