Concept for programmatically generation of elasticsearch index mapping based on COLID Editor metadata ontology.
Leverage available metadata from COLID Editor to determine how the elasticsearch index should look like. With the information which describes the data from COLID Editor and some mapping rules it is possible to generate automatically the elasticsearch index mapping.
The big advantage of automatization of this process is the time which can be saved. Changes in the data from COLID Editor does not require manually changes by a developer in the elasticsearch index mapping. Fields can be added, changed or removed. Index will be automatically generated based on the current metadata. The Data Marketplace application gets a high flexibility and changes in the COLID Editor data can be quickly applied in Data Marketplace without any manual changes.
A schema is a description of one or more fields that describes how to handle different fields of a document. In elasticsearch the schema is a mapping that defines how a document, and the fields it contains, are stored and indexed. Because of this the schema is called “mapping” in elasticsearch.
Elasticsearch support two different concepts for mapping. The first one is dynamic mapping. Fields and mapping type do not to be defined before being used. Thanks to dynamic mapping, new field names will be added automatically, just by indexing a document. This concept is useful to get started, but for Data Marketplace rich features not sufficient. Fields have to be analyzed with special analyzers and stored in different datatypes to meet the requirements for the application. Instead the concept of specifying an explicit mapping is used.
An explicit mapping is provided to elasticsearch as hierarchically structured JSON during index creation. This file contains a list of fields and properties pertinent to the document which should be indexed.
Below are all steps listed for the creation of a new index.
-
Create Document Index
Create a new index on elasticsearch to store and search resources. Index will be created with static index settings. Index settings are described futher down on this page.
After successful creation of the index the
DocumentUpdateAlias
will be updated to the just created index.Rollback Mechanism:
- Delete Index
- Undo update on
DocumentUpdateAlias
-
Create Index Mapping for search index
Based on metadata the mapping for the created index in step 1 is dynamically created.
Rollback Mechanism:
- Not needed, because in case of an exception the index will be deleted.
-
Create Metadata Index
Creates a new index without a mapping to store the metadata.
After successful creation of the index the
MetadataUpdateAlias
will be updated to the just created index.Rollback Mechanism:
- Delete Index
- Undo update on
MetadataUpdateAlias
-
Index Metadata
Indexes the current metadata set to the update metadata index.
Rollback Mechanism:
- Not needed, because in case of an exception the index will be deleted.
-
Update Aliases
Updates the
MetadataSearchAlias
andResourceSearchAlias
to currently created indices.Rollback Mechanism:
- Undo update on
MetadataSearchAlias
- Undo update on
ResourceSearchAlias
- Undo update on
-
Delete old indices
Deletes all old indices after successful creation of all indices and update of aliases.
Rollback Mechanism:
- It is ignored whether the deletion was successful or not.
This section contains relevant information and documents related to the implementation.
Rules for creating elasticsearch mapping based on metadata.
First of all turn off dynamic mapping. Any unknown fields will not be added to the mapping and will not be searchable. Mapping explosion will be avoided.
PUT <index>/_mapping/_doc
{
"dynamic":false
}
Based on the available metadata of COLID Editor resources the index mapping should be generated dynamically. Therefore the metadata will be used as input and the following mapping rules will determine how the resources will be indexed as documents in elasticsearch.
-
Add resourceID as unique identifer of a document in the index.
"resourceId": { "type": "keyword" }
-
Add fields for autocompletion functionality.
"autocomplete_filter": { "type": "text", "analyzer": "autocomplete_ngram", "search_analyzer": "search_autocomplete" }, "autocomplete_terms": { "type": "text", "fielddata": true, "analyzer": "autocomplete_shingle" }, "dym_trigram": { "type": "text", "analyzer": "dym_trigram" }
-
Mapping of URI fields
If field is of type URI, add it as keyword.
Check metadata:
Key Value http://www.w3.org/2000/01/rdf-schema#range http://www.w3.org/2001/XMLSchema#anyURI Definition:
"properties": { "value": { "type": "keyword", "normalizer": "lowercase", "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } }
-
Mapping of dates
Check metadata
Key Value http://www.w3.org/2000/01/rdf-schema#range http://www.w3.org/2001/XMLSchema#dateTime "properties": { "value": { "type": "date" } }
-
Mapping of version
Currently only for the version field.
Check metadata:
Key Value http://www.w3.org/2000/01/rdf-schema#range http://www.w3.org/2001/XMLSchema#float "properties": { "value": { "type": "keyword" } }
-
Mapping of boolean
Check metadata:
Key Value http://www.w3.org/2000/01/rdf-schema#range http://www.w3.org/2001/XMLSchema#boolean "properties": { "value": { "type": "keyword" } }
-
Mapping of persons
Check metadata
Key Value http://www.w3.org/2000/01/rdf-schema#range http://COLID Editor.bayer.com/kos/19014/Person "properties": { "value": { "type": "keyword", "normalizer": "lowercase", "copy_to": ["dym_trigram"], "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } }
-
Mapping of literals/text
Check Metadata:
Key Value "http://www.w3.org/ns/shacl#nodeKind" http://www.w3.org/ns/shacl#Literal AND NOT
Key Value http://www.w3.org/2000/01/rdf-schema#range <http://COLID Editor.bayer.com/kos/19014/Person> "properties": { "value": { "type": "keyword", "copy_to": ["dym_trigram", "autocomplete_filter", "autocomplete_terms"], "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } }
-
Mapping of IRIs
Check metadata:
Key Value http://www.w3.org/ns/shacl#nodeKind http://www.w3.org/ns/shacl#IRI "properties": { "uri": { "type": "keyword" }, "value": { "type": "keyword", "copy_to": ["dym_trigram"], "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } }
-
Mapping of distribution endpoints
Check Metadata
Key Value <http://COLID Editor.bayer.com/kos/19014/hasCOLID Editor> <https://COLID Editor.bayer.com/kos/19050/distribution> "properties": { "uri": { "type": "keyword" }, "value": { "type": "keyword", "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } }, "nested": { "type": "nested", "include_in_parent": true, "properties": { "http://www.w3.org/1999/02/22-rdf-syntax-ns#type":{ "properties": { "uri": { "type": "keyword" }, "value": { "type": "keyword", "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } } }, "https://COLID Editor.bayer.com/kos/19050/hasNetworkedResourceLabel":{ "properties": { "uri": { "type": "keyword" }, "value": { "type": "keyword", "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } } }, "http://COLID Editor.bayer.com/kos/19014/hasNetworkAddress":{ "properties": { "uri": { "type": "keyword" }, "value": { "type": "keyword", "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } } }, "http://COLID Editor.bayer.com/kos/19014/hasCOLID Editor":{ "properties": { "uri": { "type": "keyword" }, "value": { "type": "keyword", "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } } }, "https://COLID Editor.bayer.com/kos/19050/hasContactPerson":{ "properties": { "uri": { "type": "keyword" }, "value": { "type": "keyword", "normalizer": "lowercase", "copy_to": ["dym_trigram"], "fields": { "text": { "type": "text", "analyzer": "dmp_standard_english" }, "ngrams":{ "type": "text", "analyzer": "dmp_ngram_analyzer", "search_analyzer": "dmp_standard_english" } } } } } } } }
Index settings will determine the behaviour of the index. For example custom analyzers will be defined in the settings. Index settings are static and can read from a file.
Current index settings:
filter
filter_autocomplete_ngram
filter_autocomplete_shingle
english_stopwords
normalizer
lowercase
analyzer
autocomplete_ngram
Tokenizers: lowercase
Filters: lowercase, asciifolding, filter_autocomplete_ngram
search_autocomplete
Tokenizers: lowercase
Filters: lowercase, asciifolding
dym_trigram
Tokenizers: lowercase
Filters: standard, lowercase, asciifolding, filter_autocomplete_shingle
autocomplete_shingle
Tokenizers: lowercase
Filters: lowercase, asciifolding, filter_autocomplete_shingle
dmp_ngram_analyzer
Tokenizer: dmp_ngram_tokenizer
Filters: lowercase, asciifolding
dmp_standard_english
Tokenizers: standard
Filters: lowercase, english_stopword, asciifolding
tokenizer
dmp_ngram_tokenizer