Skip to content
Grant Burgess edited this page Aug 3, 2018 · 1 revision

Commands

The below commands require a list of websites/seeds to harvest data from. These are contained in the seeds.txt file in the below examples. These are just sample commands, these can be further modified to user requirements through the CLI. See LDSpider CLI.

Full RDF dump to hard drive, target/dump/archive0.zip

java -jar ldspider-1.3-with-dependencies.jar -d dump -a log.log -s seeds.txt

n-Quad dump to hard drive, target/dump/crawl.nq

java -jar ldspider-1.3-with-dependencies.jar -c 1000000 -o dump/crawl.nq -s seeds.txt

n-Quad output to SPARQL endpoint using main memory

java -jar ldspider-1.3-with-dependencies.jar -c 1000000 -oe http://(Insert SPARQL endpoint update address) -s seeds.txt

n-Quad output to SPARQL endpoint using hard drive memory

java -jar ldspider-1.3-with-dependencies.jar -m frontier -ds seen -c 1000000 -oe http://(Insert SPARQL endpoint update address) -s seeds.txt

Useful additions

This can be a quite slow running application depending on the size of the seeds list and crawling strategy, the above examples all use a load balanced strategy and crawl up to 1000000 URI's. Note: URI's can exceed 1000000 as the crawler only checks this count periodically. If possible, use the -t option to increase the amount of threads available to the crawler to use (default is 2).