Skip to content

Latest commit

 

History

History
653 lines (472 loc) · 31.1 KB

README.md

File metadata and controls

653 lines (472 loc) · 31.1 KB

How to self-host a mirror of Wikipedia.org:
with Nginx, Kiwix, or MediaWiki/XOWA + Docker

Originally published 2019-09-08 on docs.sweeting.me.
The pretty HTML version is here and the source for this guide is on Github.


A summary of how to set up a full Wikipedia.org mirror using three different approaches.
DEMO: https://other-wiki.zervice.io

Intro

Did you know that Wikipedia.org just runs a mostly-traditional LAMP stack on ~350 servers? (as of 2019)

Unfortunately, Wikipedia attracts lots of hate from people and nation-states who object to certain articles or want to hide information from the public eye.

Wikipedia's infrastructure (2 racks the USA, 1 in Holland, and 1 in Singapore, + CDNs) cant always stand up to large DDoS attacks, but thankfully they provide regular database dumps and static HTML archives to the public, and have permissive licensing that allows for rehosting with modification (even for profit!).

Growing up in China behind the GFC I often experienced Wikipedia unavailability, and in light of the recent DDoS I decided to make a guide for people to help demystify the process of running a mirror. I'm also a big advocate for free access to information, and I'm the maintainer of a major internet archiving project called ArchiveBox (a self-hosted internet archiver powered by headless Chromium).

This aim of this guide is to encourage people to use these publicly available dumps to host Wikipedia mirrors, so that malicious actors don't succeed in limiting public access to one of the world's best sources of information.


Quickstart

A full English Wikipedia.org clone in 3 steps.

DEMO: https://other-wiki.zervice.io

# 1. Download the Kiwix-Serve static binary from https://www.kiwix.org/en/downloads/kiwix-serve/
wget 'https://download.kiwix.org/release/kiwix-tools/kiwix-tools_linux-x86_64.tar.gz'
tar -xzf kiwix-tools_linux-x86_64-3.0.1.tar.gz && cd kiwix-tools_linux-x86_64-3.0.1

# 2. Download a compressed Wikipedia dump from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ (79GB, images included!)
wget --continue "https://download.kiwix.org/zim/wikipedia_en_all_maxi.zim"

# 3. Start the kiwix server, then visit http://127.0.0.1:8888
./kiwix-serve --verbose --port 8888 "$PWD/wikipedia_en_all_maxi_2018-10.zim"

Getting Started

Wikipedia.org itself is powered by a PHP backend called WikiMedia, using MariaDB for data storage, Varnish and Memcached for request and query caching, and ElasticSearch for full-text search. Production Wikipedia.org also runs a number of extra plugins and modules on top of MediaWiki.

🖥 There are several ways to host your own mirror of Wikipedia (with varying complexity):

  1. Run a caching proxy in front of Wikipedia.org (disk used on-demand for cache, low CPU use)
  2. Serve the static HTML ZIM archive with Kiwix (10~80GB for compressed archive, low CPU use)
  3. Run a full MediaWiki server (hardest to set up, ~600GB for XML & database, high CPU use)

💅Don't expect it to look perfect on the first try

Setting up a Wikipidea mirror involves a complex dance between software, data, and devops, so beginners are encouraged to start with the static html archive or proxy and before attempting to run a full MediaWiki Server. Users should expect their mirrors to be able to serve articles with images and search, but should not expect it to look exactly like Wikipedia.org on the first try, or the second...

✅ Choosing an approach

Each method in this guide has its pros and cons. A caching proxy is the most lightweight option, but if the upstream servers go down and a request comes in that hasn't been seen before and cached it will 404, so it's not a fully redundant mirror. The static ZIM mirror is lightweight to download and host (and requests are easy to cache), it has full-text search, but it has no interactivity, talk page history, or Wikipedia-style category pages (though they are coming soon). MediaWiki/XOWA are the most complex, but they can provide a full working Wikipedia mirror complete with history revisions, users, talk pages, search, and more.

Running a full MediaWiki server is by far the hardest method to set up. Expect it to take multiple days/weeks depending on available system resources, and expect it to look fairly broken since the production Wikipedia.org team run many tweaks and plugins that take extra work to set up locally.

For more info, see the Wikipedia.org index of all dump types available, with descriptions.

Responsible Rehosting Warning

⚠️ Be aware that running a publicly-accessible mirror of Wikipedia.org with any kind of framing / content modifications / ads is strongly discouraged. Framing mirrors / proxy mirrors are still a good option for private use, but you need to take additional steps to mirror responsibly if you're setting up a proxy for public use (e.g. robots:noindex, takedown contact info, blocking unlicensed images, etc.).

Some mirrors load a page from the Wikimedia servers directly every time someone requests a page from them. They alter the text in some way, such as framing it with ads, then send it on to the reader. This is called remote loading, and it is an unacceptable use of Wikimedia server resources. Even remote loading websites with little legitimate traffic can generate significant load on our servers, due to search engine web crawlers. https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Remote_loading

Luckily, regardless of how you choose to rehost Wikipedia text, you are not breaking any terms and conditions or violating copyright law as long as you don't remove their copyright statements (however, note the article images and videos on Wikimedia.org may not be licensed for re-use).

Every contribution to the English Wikipedia has been licensed for re-use, including commercial, for-profit websites. Republication is not necessarily a breach of copyright, so long as the appropriate licenses are complied with. https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Things_you_need_to_know


[TOC]

See the HTML version of this guide for the best browsing experience. See pirate/wikipedia-mirror on Github for example config source, docker-compose files, binaries, folder structure, and more.


Tutorial


Prerequisites

  1. Provision a server to act as your Wikipedia mirror

    You can use a cheap VPS provider like DigitalOcean, Vultr, Hetzner, etc. For the static ZIM archive and MediaWiki server methods you will need significant disk space, so a home server with a cheap external HD may be a better option.

    The setup examples below are based on Ubuntu 19.04 running on a home server, however they should work across many other OS's with minimal tweaking (e.g. FreeBSD, macOS, Arch, etc.).

  2. Purchase a new domain or create a subdomain to host your mirror

    You can use Google Domains, NameCheap, GoDaddy, etc. any registrar will work.

    In the setup examples below, replace wiki.example.com with the domain you chose.

  3. Point the DNS records for the domain to your mirror server

    Configure these records via your DNS provider (e.g. NameCheap, DigitalOcean, CloudFlare, etc.):

    • wiki.example.com A -> your server's public ip (the root domain)
    • en.wiki.example.com CNAME -> wiki.example.com (the wiki domain)
    • upload.wiki.example.com CNAME -> wiki.example.com (the uploads/media domain)
  4. Create a directory to store the project, and a dotenv file for your config options

    Not all of these values are needed for all the methods, but it's easier to just define all of them in one place and remove things later that turn out to be unneeded.

    mkdir -p /opt/wiki                  # change PROJECT_DIR below to match
    nano /opt/wiki/.env

    Create the .env config file in dotenv/bash syntax with the contents below. Make sure to replace the example values like wiki.example.com with your own.

    PROJECT_DIR="/opt/wiki"                   # folder for all project state
    CONFIG_DIR="$PROJECT_DIR/etc/nginx"
    CACHE_DIR="$PROJECT_DIR/data/cache"
    CERTS_DIR="$PROJECT_DIR/data/certs"
    LOGS_DIR="$PROJECT_DIR/data/logs"
    
    LANG="en"                                 # Wikipedia language to mirror
    LISTEN_PORT_HTTP="80"                     # public-facing HTTP port to bind
    LISTEN_PORT_HTTPS="443"                   # public-facing HTTPS port to bind
    LISTEN_HOST="wiki.example.com"            # root domain to listen on
    LISTEN_WIKI="$LANG.$LISTEN_HOST"          # wiki domain to listen on
    LISTEN_MEDIA="upload.$LISTEN_HOST"        # uploads domain to listen on
    
    UPSTREAM_HOST="wikipedia.org"             # main upstream domain
    UPSTREAM_WIKI="$LANG.$UPSTREAM_HOST"      # upstream domain for wiki
    UPSTREAM_MEDIA="upload.wikimedia.org"     # upstream domain for uploads
    
    # Only needed if using an nginx reverse proxy:
    SSL_CRT="$CERTS_DIR/$LISTEN_HOST.crt"
    SSL_KEY="$CERTS_DIR/$LISTEN_HOST.key"
    SSL_DH="$CERTS_DIR/$LISTEN_HOST.dh"
    
    CACHE_SIZE="100G"                         # or "500GB", "1GB", "200MB", etc.
    CACHE_REQUESTS="GET HEAD POST"            # or "GET HEAD", "any", etc.
    CACHE_RESPONSES="200 206 302"             # or "200 302 404", "any", etc.
    CACHE_DURATION="max"                      # or "1d", "30m", "12h", etc.
    
    ACCESS_LOG="'$LOGS_DIR/nginx.out' trace"  # or "off", etc.
    ERROR_LOG="'$LOGS_DIR/nginx.err' warn"    # or "off", etc.

    The setup steps below depend on this file existing and the config values being correct, so make sure you create it and replace all example values with your own before proceeding!


Choosing a Wikipedia archive dump

ZIM Static HTML Dump

Wikipedia HTML dumps are provided in a highly-compressed web-archiving format called ZIM. They can be served using a ZIM server like Kiwix (the most common one), or ZimReader, GoZIM, & others.

ZIM archive dumps are usually published yearly, but the release schedule is not guaranteed. As of August 2019 the latest available dump containing all English articles is from October 2018:

wikipedia_en_all_mini_2019-09.zim (torrent) (10GB, mini English articles, no pictures or video)

wikipedia_en_all_nopic_2018-09.zim (torrent) (35GB, all English articles, no pictures or video)

wikipedia_en_all_maxi_2018-10.zim (torrent) (79GB, all English articles w/ pictures, no video)

wikipedia_en_simple_all_maxi_2020-01.zim (1.6GB, SimpleWiki English only, good for testing)

Download your chosen Wikipedia ZIM archive (e.g. wikipedia_en_all_maxi_2018-10.zim)

mkdir -p /opt/wiki/data/dumps && cd /opt/wiki/data/dumps

# Download via BitTorrent:
transmission-cli --download-dir . 'magnet:?xt=urn:btih:O2F3E2JKCEEBCULFP2E2MRUGEVFEIHZW'

# Or download via HTTPS from one of the mirrors:
wget -c 'https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'
wget -c 'https://ftpmirror.your.org/pub/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'
wget -c 'https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'

# Optionally after download, verify the length (fast) or MD5 checksum (slow):
stat --printf="%s" wikipedia_en_all_maxi_2018-10.zim | grep 83853668638
md5sum wikipedia_en_all_maxi_2018-10.zim | openssl dgst -md5 -binary | openssl enc -base64 | grep 01eMQki29P9vD5F2h6zWwQ

XML Database Dump

Database dumps are usually published monthly. As of August 2019, the latest dump containing all English articles is from July 2019:

enwiki-20190720-pages-articles.xml.bz2 (15GB, all English articles, no pictures/videos)

simplewiki-20170820-pages-meta-current.xml.bz2 (180MB, SimpleWiki only, good for testing)

Download your chosen Wikipedia XML dump (e.g. enwiki-20190720-pages-articles.xml.bz2)

mkdir -p /opt/wiki/data/dumps && cd /opt/wiki/data/dumps

# Download via BitTorrent:
transmission-cli --download-dir . 'magnet:?xl=16321006399&dn=enwiki-20190720-pages-articles.xml.bz2'

# Download via HTTP:
# lol no. no one wants to serve you a 15GB file via HTTP

Method #1: Run a caching proxy in front of Wikipedia.org

Complexity: Low
Minimal setup and operations requirements, no download of large dumps needed.
Disk space requirements: On-Demand
Disk is only used as pages are requested (can be 1gb up to 2TB+ depending on usage).
CPU requirements: Very Low
Lowest out of the three options, can be run on a tiny VPS or home-server.
Content freshness: Very Fresh
Configurable to cache content indefinitely or pull fresh data for every request.

a. Running with Nginx

Set the following options in your /opt/wiki/.env config file: UPSTREAM_HOST=wikipedia.org UPSTREAM_WIKI=en.wikipedia.org UPSTREAM_MEDIA=upload.wikimedia.org

Then run all the setup steps below under Nginx Reverse Proxy to set up Nginx.

Then restart nginx to apply your config with systemctl restart nginx.

Your mirror should now be running and proxying requests to Wikipedia.org!

Visit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).

b. Running with Caddy

Alternatively, check out a similar setup that uses Caddy instead of Nginx as the reverse proxy: https://github.com/CristianCantoro/wikiproxy


Method #2: Serve the static HTML ZIM archive with Kiwix

Complexity: Moderate
Static binary makes it easy to run, but it requires downloading a large dump file.
Disk space requirements: >80GB
The ZIM archive is a highly-compressed collection of static HTML articles only.
CPU requirements: Very Low
Low, especially with a CDN in front (more than a proxy, but less than a full server).
Content freshness: Often Stale
ZIM archives are published yearly (ish) by Wikipedia.org.

First download a ZIM archive dump like wikipedia_en_all_maxi_2018-10.zim into /opt/wiki/data/dumps as described above.

a. Running with Docker

Run kiwix-serve with docker like so:

docker run \
    -v '/opt/wiki/data/dumps:/data' \
    -p 8888:80 \
    kiwix/kiwix-serve \
    'wikipedia_en_all_maxi_2018-10.zim'

Or create /opt/wiki/docker-compose.yml and run docker-compose up:

version: '3'
services:
  kiwix:
    image: kiwix/kiwix-serve
    command: 'wikipedia_en_all_maxi_2018-10.zim'
    ports:
      - '8888:80'
    volumes:
      - "./data/dumps:/data"

b. Running with the static binary

  1. Download the latest kiwix-serve binary for your OS & CPU architecture

    Find the latest release for your architecture here and copy its URL to download it below: https://download.kiwix.org/release/kiwix-tools/

    cd /opt/wiki
    wget 'https://download.kiwix.org/release/kiwix-tools/kiwix-tools_linux-x86_64-3.0.1.tar.gz'
    tar -xzf 'kiwix-tools_linux-x86_64-3.0.1.tar.gz'
    mv 'kiwix-tools_linux-x86_64-3.0.1' 'bin'
  2. Run kiwix-serve, passing it a port to listen on and your ZIM archive file

    /opt/wiki/bin/kiwix-serve --port 8888 /opt/wiki/data/dumps/wikipedia_en_all_maxi_2018-10.zim

    Your server should now be running!

    Visit http://en.yourdomainhere.com:8888 to see it in action!

Optional Nginx Reverse Proxy

Set the following options in your /opt/wiki/.env config file:

UPSTREAM_HOST=localhost:8888
UPSTREAM_WIKI=localhost:8888
UPSTREAM_MEDIA=upload.wikimedia.org

Then run all the setup steps below under Nginx Reverse Proxy to set up Nginx. To run nginx inside docker-compose next to Kiwix, see the Run Nginx via docker-compose section below.

Your mirror should now be running and proxying requests to kiwix-serve!

Visit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).


Method #3: Run a full MediaWiki server

Complexity: Very High
Complex multi-component setup with an intricate setup process and high resource use.
Disk space requirements: >550GB (>2TB needed for import phase)
The uncompressed database is very large (multiple TB with revision history and stubs).
CPU requirements: Moderate (very high during import phase)
Depends on usage, but it's the most demanding out of the 3 options.
Content freshness: Very fresh
Udpated database dumps are published monthly (ish) by Wikipedia.org.

First download a database dump like enwiki-20190720-pages-articles.xml.bz2 into /opt/wiki/data/dumps as described above.

If you need to decompress it, pbzip2 is much faster than bzip2:

pbzip2 -v -d -k -m10000 enwiki-20190720-pages-articles.xml.bz2
# -m10000 tells it to use 10GB of RAM, adjust accordingly

a. Running with XOWA in Docker

https://github.com/QuantumObject/docker-xowa

docker run \
    -v /opt/wiki/data/xowa:/opt/xowa/ \
    -p 8888 \
    sblop/xowa_offline_wikipedia
version: '3'
services:
  xowa:
    image: sblop/xowa_offline_wikipedia
    ports:
      - 8888:80
    volumes:
      - './data/xowa:/opt/xowa'

b. Running with MediaWiki in Docker

Configure your docker-compose.yml file

Default MediaWiki config file: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/DefaultSettings.php

Create the following /opt/wiki/docker-compose.yml file then run docker-compose up:

version: '3'
services:
  database:
    image: mariadb
    command: --max-allowed-packet=256M
    environment:
      MYSQL_DATABASE: wikipedia
      MYSQL_USER: wikipedia
      MYSQL_PASSWORD: wikipedia
      MYSQL_ROOT_PASSWORD: wikipedia
      
  mediawiki:
    image: mediawiki
    ports:
      - 8080:80
    depends_on:
      - database
    volumes:
      - './data/html:/var/www/html'
      # After initial setup, download LocalSettings.php into ./data/html
      # and uncomment the following line, then docker-compose restart
      # - ./LocalSettings.php:/var/www/html/LocalSettings.php

Then import the XML dump into the MediaWiki database:

Do not attempt to import it directly with importDump.php, it will take months:

php /var/www/html/maintenance/importDump.php enwiki-20170320-pages-articles-multistream.xml

Instead, convert the XML dump into compressed chunks of SQL then import individually:

Warning: For large imports (e.g. English) this process can still take 5+ days depending on the system.

apt install -y openjdk-8-jre zstd pbzip2

# Download patched mwdumper version and pre/post import SQL scripts
wget "https://github.com/pirate/wikipedia-mirror/raw/master/bin/mwdumper-1.26.jar"
wget "https://github.com/pirate/wikipedia-mirror/raw/master/preimport.sql"
wget "https://github.com/pirate/wikipedia-mirror/raw/master/postimport.sql"

DUMP_NAME="enwiki-20190720-pages-articles"

# Decompress the XML dump using all available cores and 10GB of memory
pbzip2 -v -d -k -m10000 "$DUMP.xml.bz2"

# Convert the XML file into a SQL file using mwdumper
java -server \
    -jar ./wikipedia-importing-tools/mwdumper-1.26.jar \
    --format=sql:1.5 \
    "$DUMP.xml" \
> wikipedia.sql

# Split the generated SQL file into compressed chunks
split --additional-suffix=".sql" --lines=1000 wikipedia.sql
for partial in $(ls *.sql); do
    zstd -z $partial
done

# Fix a schema issue that may otherwise cause import bugs
docker-compose exec database \
    mysql --user=wikipedia --password=wikipedia --database=wikipedia \
        "ALTER TABLE page ADD page_counter bigint unsigned NOT NULL default 0;"

# Import the compressed chunks into the database
for partial in $(ls *.sql.zst); do
    zstd -dc preimport.sql.zst $partial postimport.sql.zst \
    | docker-compose exec database \
        mysql --force --user=wikipedia --password=wikipedia --database=wikipedia
done

Credit for these steps goes to https://github.com/wayneworkman/wikipedia-importing-tools.

Optional Nginx Reverse Proxy

Set the following options in your /opt/wiki/.env config file:

UPSTREAM_HOST=localhost:8888
UPSTREAM_WIKI=localhost:8888
UPSTREAM_MEDIA=upload.wikimedia.org

Then run all the setup steps below under Nginx Reverse Proxy to set up Nginx. To run nginx inside docker-compose next to MediaWiki, see the Run Nginx via docker-compose section below.

Your mirror should now be running and proxying requests to your wiki server!

Visit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).


Nginx Reverse Proxy

You can optionally set up an Nginx reverse proxy in front of kiwix-serve, Wikipedia.org, or a MediaWiki server to add caching and HTTPS support.

Make sure the options in /opt/wiki/.env are configured correctly for the type of setup you're trying to achieve.

  • To run nginx in front of kiwix-serve on localhost, set: UPSTREAM_HOST=localhost:8888 UPSTREAM_WIKI=localhost:8888 UPSTREAM_MEDIA=upload.wikimedia.org
  • To run nginx in front of Wikipedia.org, set: UPSTREAM_HOST=wikipedia.org UPSTREAM_WIKI=en.wikipedia.org UPSTREAM_MEDIA=upload.wikimedia.org
  • To run nginx in front of a MediaWiki server on localhost, set: UPSTREAM_HOST=localhost:8888 UPSTREAM_WIKI=localhost:8888 UPSTREAM_MEDIA=upload.wikimedia.org
  • To run nginx in front of a docker container via docker-compose: See Run Nginx via docker-compose section below.

Install LetsEncrypt and Nginx

# Install the dependencies: nginx and certbot
add-apt-repository -y -n universe
add-apt-repository -y -n ppa:certbot/certbot
add-apt-repository -y -n ppa:nginx/stable
apt update -qq
apt install -y nginx-extras certbot python3-certbot-nginx
systemctl enable nginx
systemctl start nginx

Obtain an SSL certificate via LetsEncrypt

# Load your config values from step 4 into the environment, and create dirs
source /opt/wiki/.env
mkdir -p "$CONFIG_DIR" "$CACHE_DIR" "$CERTS_DIR" "$LOGS_DIR" 

# Get an SSL certificate and generate the Diffie-Hellman parameters file
certbot certonly \
    --nginx \
    --agree-tos \
    --non-interactive \
    -m "ssl@$LISTEN_HOST" \
    --domain "$LISTEN_HOST,$LISTEN_WIKI,$LISTEN_MEDIA"
openssl dhparam -out "$PROJECT_DIR/data/certs/$DOMAIN.dh" 2048

# Link the certs into your project directory
ln -s /etc/letsencrypt/live/$DOMAIN/fullchain.pem $PROJECT_DIR/data/certs/$DOMAIN.crt
ln -s /etc/letsencrypt/live/$DOMAIN/privkey.pem $PROJECT_DIR/data/certs/$DOMAIN.key

LetsEncrypt certs must be renewed every 90 days or they'll expire and you'll get "Invalid Certificate" errors. To have certs automatically renewed periodically, add a systemd timer or cron job to run certbot renew. Here's an example tutorial on how to do that: https://gregchapple.com/2018/02/16/auto-renew-lets-encrypt-certs-with-systemd-timers/

Populate the nginx.conf template with your config

# Load your config options into the environment
source /opt/wiki/.env


# Download the nginx config template
curl --silent \
    "https://github.com/pirate/wikipedia-mirror/raw/master/etc/nginx/nginx.conf.template" \
    > "$CONFIG_DIR/nginx.conf.template"

# Fill your config options into nginx.conf.template to create nginx.conf
envsubst \
    "$(printf '${%s} ' $(bash -c "compgen -A variable"))"\
    < "$CONFIG_DIR/nginx.conf.template" \
    > "$CONFIG_DIR/nginx.conf"

Run Nginx via systemd

# Link the your nginx.conf into the system's default nginx config location
ln -s -f "$CONFIG_DIR/nginx.conf" "/etc/nginx/nginx.conf"

# Restart nginx to load the new config
systemctl restart nginx

Now you can visit https://en.yourdomainhere.com to see it in action with HTTPS!

For troubleshooting, you can find the nginx logs here: /opt/wiki/data/logs/nginx.err /opt/wiki/data/logs/nginx.out

Run Nginx via docker-compose

Set the config values in your /opt/wiki/.env file to correspond to the docker container's hostname that you want to proxy, and tweak the directory paths to be the paths inside the container. e.g. for mediawiki:

UPSTREAM_HOST=mediawiki:8888`
UPSTREAM_WIKI=mediawiki:8888`
UPSTREAM_MEDIA=upload.wikimedia.org

CERTS_DIR=/certs
CACHE_DIR=/cache
LOGS_DIR=/logs

Then regenerate your nginx.conf file with envsubst as described in Nginx Reverse Proxy below.

Then add the nginx service to your existing /opt/wiki/docker-compose.yml file:

version: '3'
services:
    
  ...

  nginx:
    image: nginx:latest
    volumes:
      - ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf
      - ./data/certs:/certs
      - ./data/cache:/cache
      - ./data/logs:/logs
    ports:
      - 80:80
      - 443:443

Further Reading