Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid Search #3549

Open
t83714 opened this issue Jul 1, 2024 · 0 comments
Open

Hybrid Search #3549

t83714 opened this issue Jul 1, 2024 · 0 comments
Assignees

Comments

@t83714
Copy link
Contributor

t83714 commented Jul 1, 2024

Hybrid Search

Power existing search APIs with Hybrid Search (Combining Semantic / Vector Search and Full-text keyword-based search for better search results).

This is the first step in LLM-powered search engine development. The motivation for this development is:

  • Enhance existing search API / interface with the hybrid search without breaking changes
  • Kick off the development of supporting infrastructure (e.g. embedding API)
  • Offer an interface powered by semantic search (part of the hybrid search query) that can be used by agent-driven user interaction

Acceptance Criteria

  • No changes to existing search API interfaces
  • Reasonable memory requirements
  • Offer reasonable performance similar to current versions
  • Vector search covers the following fields for now:
    • Dataset title
    • Dataset description
    • Distribution title
    • Distribution description

Technical Notes

Based on the recent research & evaluation:

  • Instead of letting OpenSearch generate embeddings using local or remote models, we will create our own embedding service. The deployment of this service has been moved to here: https://github.com/magda-io/magda-embedding-api
    • No matter whether running local modes / remote models via APIs, OpenSearch's model registry / deploy process involves a few Async API calls, which is hard to manage.
    • `OpenSearch's local model serving is built on a Java solution that seems not stable (as of v2.15.0). I am having issues/errors with large models.
    • Performance considerations (based on self not very accurate tests):
      • Based on tests, node.js-based solution serving ONNX format model (gte-base-en-v1.5) is able to produce embeddings for 3 short strings around 40ms - 60ms.
      • Python solution (use SentenceTransformers): around 500ms
      • OpenSearch local models: around 140ms - 160ms
    • It might be easier to accommodate a customised search/indexing design with our own solution.
  • Recent models come with larger max seq. Limit (8k to 32k) and can offer similar or better performance for different tasks with query side instructions only (i.e. no index side instructions). See huggingface MTEB leaderboard
    • No need to have different vector fields for different tasks
    • For performance consideration & deploy cost (especially RAM), can use a single vector field to store embedding of text aggregates information from multiple fields

embedding api drawio

Blocked by magda-io/magda-embedding-api#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

1 participant