Tufts University - Fall 2019
Benjamin Auerbach, Andrew Gross, Trung Truong
A Python tool that crawls and indexes Wikipedia sites concurrently
Clone this repository. To install all dependencies and run the program, pip3
and python3
are required. Navigate to root folder and run the following
instructions to install dependencies.
pip3 install -r requirements.txt
To run the concurrent crawler, (-s default 1000, -t default 20)
python3 src/concurrent-spider.py <starting wikipedia url> -t
<number crawler threads> -s <number sites to crawl>
To run the breath-first-search algorithm on a graph,
python3 src/bfs.py graph.json <start> <end>
===========================================
python3 src/bfs.py graph.json Businessperson California
================================================================================
Path between Businessperson and California
================================================================================
> https://en.wikipedia.org/wiki/Businessperson
> https://en.wikipedia.org/wiki/National_capitalism
> https://en.wikipedia.org/wiki/File:A_coloured_voting_box.svg
> https://en.wikipedia.org/wiki/Abraham_Lincoln
> https://en.wikipedia.org/wiki/Democratic_Party_(United_States)
> https://en.wikipedia.org/wiki/California
To run the visualization prgram,
python3 src/viz.py graph_small.json
This project is licensed under the MIT License - see the LICENSE.txt file for details