SBERT Runtime

This custom runtime can be used to serve Sentence-Transformers models.

It provides an OpenAI-compatible API, so you can call the model at your-endpoint/v1/embeddings.

The request format is:

{
  "encoding_format": "float",
  "input": [
    "I am a sentence",
    "I am another sentence"
  ],
  "model": "model_name"
}

"input" can be a string or an array of string.
"model" can be any string, even empty, as the runtime serves only one model. It is there for OpenAI API compatibility.
"encoding_format" is optional and defaults to "float".

Finally, a Swagger documentation and test interface is available at your-endpoint/docs.

Installation

You must first make sure that you have properly installed the necessary component of the Single-Model Serving stack, as documented here.

Once the stack is installed, adding the runtime is pretty straightforward:

As an OpenShift AI admin, in the OpenShift AI Dashboard, open the menu Settings -> Serving runtimes.
Click on Add serving runtime.
For the type of model serving platforms this runtime supports, select Single model serving platform.
Upload the file sbert-runtime.yaml from the current folder, or click Start from scratch and copy/paste its content. A CPU-only version of the runtime is also available in the corresponding file.

Two arguments are available in the runtime definition:

--model_path: indicates where the model is stored. Defaults to /mnt/models for compatibility with OpenShift AI Model Serving.
--trust_remote_code: may be needed to be set to true for some models. Defaults to false.

The runtime is now available when deploying a model.

This runtime can be used in the exact same way as the out of the box ones:

Copy your model files in an object store bucket.
Deploy the model from the Dashboard.
Once the model is loaded, you can access the inference endpoint provided through the dashboard.

A notebook example is available here.