Concurrent Wikipedia Web Crawlers

Tufts University - Fall 2019

Benjamin Auerbach, Andrew Gross, Trung Truong

Description

A Python tool that crawls and indexes Wikipedia sites concurrently

Installing

Clone this repository. To install all dependencies and run the program, pip3 and python3 are required. Navigate to root folder and run the following instructions to install dependencies.

pip3 install -r requirements.txt

To run the concurrent crawler, (-s default 1000, -t default 20)

python3 src/concurrent-spider.py <starting wikipedia url> -t
        <number crawler threads> -s <number sites to crawl>

To run the breath-first-search algorithm on a graph,

python3 src/bfs.py graph.json <start> <end>
===========================================

python3 src/bfs.py graph.json Businessperson California
================================================================================
Path between Businessperson and California
================================================================================
> https://en.wikipedia.org/wiki/Businessperson
> https://en.wikipedia.org/wiki/National_capitalism
> https://en.wikipedia.org/wiki/File:A_coloured_voting_box.svg
> https://en.wikipedia.org/wiki/Abraham_Lincoln
> https://en.wikipedia.org/wiki/Democratic_Party_(United_States)
> https://en.wikipedia.org/wiki/California

To run the visualization prgram,

python3 src/viz.py graph_small.json

Sample graphs

. .

Authors

License

This project is licensed under the MIT License - see the LICENSE.txt file for details

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
media		media
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
One pager.pdf		One pager.pdf
README.md		README.md
graph.json		graph.json
graph_small.json		graph_small.json
project-proposal.text		project-proposal.text
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concurrent Wikipedia Web Crawlers

Description

Installing

Sample graphs

Authors

License

About

Releases

Packages

Contributors 2

Languages

License

Agross09/concurrent-wiki-crawlers

Folders and files

Latest commit

History

Repository files navigation

Concurrent Wikipedia Web Crawlers

Description

Installing

Sample graphs

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages