Use enwiki data set #4

tonysun83 · 2019-05-08T07:17:40Z

This performance test loads the enwiki data set defined by
docs.file in the content-source.properties file.
DocMaker: https://lucene.apache.org/core/8_0_0/benchmark/org/apache/lucene/benchmark/byTask/feeds/DocMaker.html

parses the xml files and creates ready to be indexed documents. Any wikipedia dataset downloaded from https://dumps.wikimedia.org/enwiki/20190501/ is usable as long as it is specified correctly in the properties file.

We can override the getContentSource to use other content sources:
https://lucene.apache.org/core/7_3_1/benchmark/org/apache/lucene/benchmark/byTask/feeds/ContentSource.html
for different benchmarking needs.

I left the lucene-test-framework library intact in our pom.xml because there are older performance benchmarks that still rely on it. We should remove it once we fully transition over to the lucene benchmark data set

Next steps:

Use a different content source
Run queries with https://lucene.apache.org/core/7_4_0/benchmark/org/apache/lucene/benchmark/byTask/feeds/EnwikiQueryMaker.html

rnewson and others added 3 commits May 8, 2019 00:10

use enwiki docs

21b01c7

whitespace

87cb2e2

add reuters benchmark

d9a3bd8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use enwiki data set #4

Use enwiki data set #4

tonysun83 commented May 8, 2019 •

edited

Loading

Use enwiki data set #4

Are you sure you want to change the base?

Use enwiki data set #4

Conversation

tonysun83 commented May 8, 2019 • edited Loading

tonysun83 commented May 8, 2019 •

edited

Loading