Hybrid Search #3549

t83714 · 2024-07-01T01:16:34Z

Power existing search APIs with Hybrid Search (Combining Semantic / Vector Search and Full-text keyword-based search for better search results).

This is the first step in LLM-powered search engine development. The motivation for this development is:

Enhance existing search API / interface with the hybrid search without breaking changes
Kick off the development of supporting infrastructure (e.g. embedding API)
Offer an interface powered by semantic search (part of the hybrid search query) that can be used by agent-driven user interaction
- This interface is only for the initial development of the agent-driven user interaction.
- The information covered by semantic search would be very limited.
- We will develop a more powerful interface with a more flexible framework (e.g. allow different indexing strategies for different types of resources) later with ticket LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

No changes to existing search API interfaces
Reasonable memory requirements
Offer reasonable performance similar to current versions
Vector search covers the following fields for now:
- Dataset title
- Dataset description
- Distribution title
- Distribution description

Based on the recent research & evaluation:

Instead of letting OpenSearch generate embeddings using local or remote models, we will create our own embedding service. The deployment of this service has been moved to here: https://github.com/magda-io/magda-embedding-api
- No matter whether running local modes / remote models via APIs, OpenSearch's model registry / deploy process involves a few Async API calls, which is hard to manage.
- `OpenSearch's local model serving is built on a Java solution that seems not stable (as of v2.15.0). I am having issues/errors with large models.
- Performance considerations (based on self not very accurate tests):
  - Based on tests, node.js-based solution serving ONNX format model (gte-base-en-v1.5) is able to produce embeddings for 3 short strings around 40ms - 60ms.
  - Python solution (use SentenceTransformers): around 500ms
  - OpenSearch local models: around 140ms - 160ms
- It might be easier to accommodate a customised search/indexing design with our own solution.
Recent models come with larger max seq. Limit (8k to 32k) and can offer similar or better performance for different tasks with query side instructions only (i.e. no index side instructions). See huggingface MTEB leaderboard
- No need to have different vector fields for different tasks
- For performance consideration & deploy cost (especially RAM), can use a single vector field to store embedding of text aggregates information from multiple fields

Blocked by magda-io/magda-embedding-api#1

The text was updated successfully, but these errors were encountered:

t83714 added the feature request label Jul 1, 2024

t83714 mentioned this issue Jul 6, 2024

Initial implementation of embedding service magda-io/magda-embedding-api#1

Closed

t83714 self-assigned this Jul 6, 2024

t83714 mentioned this issue Aug 5, 2024

Issue/3549 Hybrid Search #3555

Open

2 tasks

t83714 added this to Magda dev plan Oct 22, 2024

t83714 moved this to In Progress in Magda dev plan Oct 22, 2024

t83714 mentioned this issue Oct 31, 2024

Simple Frontend Chatbot Leverage Hybrid Search #3570

Open

Provide feedback