-
Notifications
You must be signed in to change notification settings - Fork 0
Run Options
Commands
The below commands require a list of websites/seeds to harvest data from. These are contained in the seeds.txt file in the below examples. These are just sample commands, these can be further modified to user requirements through the CLI. See LDSpider CLI.
Full RDF dump to hard drive, target/dump/archive0.zip
java -jar ldspider-1.3-with-dependencies.jar -d dump -a log.log -s seeds.txt
n-Quad dump to hard drive, target/dump/crawl.nq
java -jar ldspider-1.3-with-dependencies.jar -c 1000000 -o dump/crawl.nq -s seeds.txt
n-Quad output to SPARQL endpoint using main memory
java -jar ldspider-1.3-with-dependencies.jar -c 1000000 -oe http://(Insert SPARQL endpoint update address) -s seeds.txt
n-Quad output to SPARQL endpoint using hard drive memory
java -jar ldspider-1.3-with-dependencies.jar -m frontier -ds seen -c 1000000 -oe http://(Insert SPARQL endpoint update address) -s seeds.txt
Useful additions
This can be a quite slow running application depending on the size of the seeds list and crawling strategy, the above examples all use a load balanced strategy and crawl up to 1000000 URI's. Note: URI's can exceed 1000000 as the crawler only checks this count periodically. If possible, use the -t
option to increase the amount of threads available to the crawler to use (default is 2).