Indexing corpora

Indexing is the step to read the source data of the corpus and load it into elasticsearch. Elasticsearch creates an index of the data, which makes it available for efficient searching and aggregations.

This step is necessary to make a dataset available in the I-analyzer interface. Note that indexing can take a significant amount of time (depending on the amount of data).

You can start indexing once you have:

Created a definition for the corpus
If it is a Python corpus: added necessary settings to your project, such as the source data directory.
Imported the definition into the database. For Python corpora, run yarn django loadcorpora to do this.

The basic indexing command is:

yarn django index my-corpus

Use yarn django index --help to see all possible flags. Some useful options are highlighted below.

Development

For development environments, we usually maintain a single index per corpus, rather than creating versioned indices. New indices are also created with number_of_replicas set to 0 (this is to make index creation easier/lighter).

Some options that may be useful for development:

Delete index before starting

--delete / -d deletes an existing index of this name, if there is one. Without this flag, you will add your data to the existing index.

Date selection

--start / -s and --end / -e respectively give a start and end date to select source files. Note that this only works if the sources function in your corpus definition makes use of these options; not all corpora have this defined. (It is not always possible to infer dates from source file metadata without parsing the file.)

Production

See Indexing on server for more information about production-specific settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing-corpora.md

Indexing-corpora.md

Indexing corpora

Development

Delete index before starting

Date selection

Production

Files

Indexing-corpora.md

Latest commit

History

Indexing-corpora.md

File metadata and controls

Indexing corpora

Development

Delete index before starting

Date selection

Production