-
-
Notifications
You must be signed in to change notification settings - Fork 219
Pywb Proxy Mode Usage
In addition to replaying web archive content by rewriting urls to point to the archive (known as 'archival mode'), pywb also supports 'proxy mode' replay where pywb acts as a proxy server. Replay in proxy mode poses a few challenges, particularly with https support, as well as collection and date selection. This page lists the latest efforts for supporting proxy mode replay.
To use proxy mode, ensure that enable_http_proxy: true
setting is set in the config.yaml
Configure the browser to use pywb_path/proxy.pac as the Proxy Auto-Configuration (PAC) script.
For example, if pywb is running on http://localhost:8080/, set the browser to http://localhost:8080/proxy.pac
Currently available only in https-proxy branch: https://github.com/ikreymer/pywb/tree/https-proxy
To also enable proxy mode with https support, ensure the following is present in the config:
enable_http_proxy: True
proxy_options:
enable_https_proxy: true
unaltered_replay: true
# optional settings with defaults
# root_ca_file: ./pywb_ca.pem
# root_ca_name: pywb https proxy replay CA
# certs_dir: ./pywb-certs/
The unaltered_replay
option will ensure the replay is performed with no rewriting, which is optimal for proxy mode use. (TODO: Add support for banner insert but no url rewriting).
To support https replay, pywb will sign each host with its own root certificate. As a one-time setup, the browser must be configured to trust the root certificate. This is a necessary limitation of https proxy replay. The root certificate can be created as a one-time operation using the proxy-cert-auth
tool:
proxy-cert-auth ./pywb-root-ca.pem -n "Sample Proxy Replay Certificate"
This will write the new certificate to./pywb-root-ca.pem with the specified name. This cert can then be set in a browser to trust https proxy requests. Be sure to set the root_ca_file
properties above to match the new certificate. (If the root certificate doesn't exist, it will automatically be created using the root_ca_file
and root_ca_name
settings. However, it is recommended to create the certificate before starting pywb).
Once the certificate has been imported, the browser should accept HTTPS requests to pywb. (Note that from perspective of pywb, the protocol scheme is ignored when performing replay so http and https requests should yield the same results).
HTTPS support is dependent upon being able to access the underlying socket and wrap it in an SSL socket. This functionality is dependent upon the WSGI container, and fortunately, this is possible to do in uWSGI, gUnicorn and wsgiref (and possibly others as well). Currently, HTTPS support is available only when running in uWSGI, gUnicorn or wsgiref although other containers may work as well or could be supported in the future.
pywb is able to support non-proxy, http and https proxy on the same port by routing the distinct HTTP requests:
Non-Proxy (Normal) HTTP request for: http://localhost:8080/pywb/example.com/
GET /pywb/example.com/
...
Host: localhost:8080/
HTTP Proxy Request for: http://example.com/
GET http://example.com/
...
HTTPS Proxy Request for: https://example.com/
CONNECT example.com:443
...
GET /
The proxy handler in pywb reads the CONNECT request and unwraps the underlying request in a SSL/TLS tunnel.
The SSL tunnel is created by using an on-the-fly generated certificate signed for the host (stored in certs_dir
), signed with the specified root_ca_file
In archival mode, a replay collection can be selected simply altering the path, eg: /A/http://example.com or /B/http://example.com can be used to view contents of http://example.com/ in collections A or B, respectively.
In proxy mode, the collection needs to be specified in different ways. Obviously, this only applies if there is more than one collection being used in pywb.
When multiple collections are involved, it is possible to specify a default collection:
collections:
coll_A: ...
coll_B: ...
proxy_options:
...
use_default_coll: coll_B
This will ensure that proxy requests default to coll_B
unless a collection has been specified in another way.
The 'ideal' way to specify the collection is via some user setting, e.g. a cookie. However, a cookie can not be shared across multiple domains in proxy mode (there is no proxy-cookie, unfortunately), so the cookie must somehow be passed to each domain.
This can be accomplished through the use of a magic proxy host, like pywb.proxy
and have the cookie be propagated to individual hosts via series of redirects.
To allow changing the collection, it turns out the collection itself can not be stored in the cookie, but an indirect session id is required. The downside is that this requires some state on the server.
The request may be set as follows: http://example.com
-> select.pywb.proxy/http://example.com
-> user makes selection collA-> collA-set.pywb.proxy/http://example.com/
-> seshId-sethost.pywb.proxy.example.com/
-> http://example.com/ (with seshId cookie which resolves to collA)
TODO: improve explanation.
Another way to specify the collection is to overload the Proxy-Authentication feature, which provides a consistent user-supplied username/password that is set with each proxy request.
If there is no default collection, or use_default_coll: false
is set, the first request to a proxy resource results in a 407 Proxy Authentication required message, requesting the user to enter a username/password.
The username is the collection name, eg: coll_A or coll_B and the password is ignored.
Once set, the user will not be asked again for the collection for the remainder of the session,and should work for both http and https requests (if enabled).
The main downside of Proxy-Auth is that switching collection requires going through reauthentication, which requires an ugly browser popup window -- there is no way to change collections via link for instance.