Vector Search Benchmarking along with OLTP #716

wahajali · 2024-07-03T05:29:51Z

wahajali
Jul 3, 2024

Recent enhancement in the postgres space has added multiple database extensions that allow for vector search using postgres. One such example is pgvector. Although there are dedicated benchmarks that are used for semantic workloads only (such as ann-benchmark), I'm interested in using HammerDB and adding some Vector queries to the mix, since this is one of the pros of using PG (i.e. getting both OLTP/OLAP and Vector capabilities in a single database). For example, I use the tpc-c benchmark scripts and add a vector query in the mix.
The above seems doable, however, in addition to TPS there is another metric that is important in the space of benchmarking vector dataset - which is "recall". Recall requires post processing on the response of the SQL select queries and then comparing it to some precomputed data for that query. This is where I'm unsure if HammerDB has support (or examples) where something similar has been done before, or even the option to add a new metric.

sm-shaw · 2024-07-03T09:16:26Z

sm-shaw
Jul 3, 2024
Maintainer

So yes, hammerdb is scripted so you can change the workloads to do anything you wish. Also yes vector databases in general are becoming popular and I agree that the use of vector queries in regular RDBMS databases rather than standalone databases is the direction things appear to be heading, there is a lot of existing data already in PostgreSQL, MySQL/MariaB e.g. https://mariadb.org/projects/mariadb-vector/

For recall metrics it would be best to provide an example of what you are doing and what the metrics would look like?. Such as the percentage of relevant results returned, and latency? For latency we already have the xtprof timeprofiler which overloads proc. If you look in the module xtprof-1.0.tm you can see we limit this overload to TPROC-C stored procedures neword, payment etc however if you modified this and added eg a proc called "recall" to the list in the xtprof module and the driver script then hammerdb would also report the latencies for this proc as well. There is also an original more basic timing module called etprof that can be explored.

It is worth noting that we still have an open issue to add the TPROC-CH workload which is a mix of OLTP and OLAP #123 that it sounds like this would work well with.

4 replies

wahajali Jul 3, 2024
Author

For semantic search, there is normally a ground truth data file, which has a list of vectors (which is the set of query data) along with the K actual neighbours. The recall is the ratio of approximate nearest neighbours returned as a fraction of the actual neighbours. The other important metrics that vector database benchmarks report are QPS (similar to TPS) and latency.

Thanks for the pointers to xtprof, I will check this out.

sm-shaw Jul 4, 2024
Maintainer

Examples of how you are adding vector queries to the mix would be of interest, and if you can add timings with xtprof.
It seems like there is potential for #123 to develop TPROC-CHV a combined OLTP/OLAP/Vector workload.

wahajali Jul 8, 2024
Author

@sm-shaw I'm interested in pursing this further and contributing back initially for the OLTP and Vector workloads. I wanted to bounce off some ideas:

Loading vector data - I've looked at different Vector Database benchmarks and they have data sets either in parquet or HDF5 data format. These are significantly smaller compared to having the same dataset in say CSV or other text based formats. There is also support in python for these formats and I would like to re-use that where possible. I've read (not tried it yet), that HammerDB provides a python interface. I'm thinking to go this route for building the vector dataset. The second option is using the "Datagen" capabilities, however, I'm unsure how I can utilize parquet or other such binary formats with tcl. Does the first approach sound like a reasonable route?
Creating Index - There are already multiple vector extensions within postgres, and each supports different search and query parameters. Even within a single extension, pgvector, there are two index types (ivfflat and hnsw) which each have their own set of parameters. pgvectorscale and pgvecto.rs are two other vector search extensions in the postgres space. Index build and search parameters would need to be dynamic/configurable to support this diverse set of extensions and their configurations.
Vector Queries - The next tasks would be to add vector queries to the mix of OLTP. At this point, I'm thinking of modifying the user query profile to incorporate vector queries in the mix with OLTP. The other option could be to have a separate user profile that only does vector queries and runs in parallel to the virtual users running OLTP queries. Essentially we have X virtual users doing OLTP and Y virtual users running a separate query profile limited to vector queries. I'm unsure if there is an example of the later currently implemented -- i.e two different query sets being run in parallel.
Metrics - Especially talking about recall here as this is a new metric. Lets say if we add a new proc with a SQL query to fetch the nearest neighbours for a given vector. Per my understanding, current implementation checks if the query is a success, however, now we need to do some computation on the results of the query to calculate the recall for each based on some pre-computed data (i.e. the actual neighbours of the search vector). This adds some "computation" which delays the next query if done "in-line" synchronously. Alternately, we can store the results temporarily and do these computation post execution or in a separate thread in parallel?

sm-shaw Jul 9, 2024
Maintainer

Loading vector data - Python is used as an alternative interface for the CLI, however none of the database interactions run with Python because of the performance issues that the GIL creates, this is described here https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-faster-than-python-for-database-benchmarking/. For file formats there is an HDF5 interface for tcl here https://github.com/BessyHDFViewer/HDFpp Interface for Tcl (and optionally Python) to read HDF4 and HDF5 files which looks like the best approach. It would be necessary to add this to the compiled libraries in the ./libs directory.
Creating Index - Yes this would need to be configurable and HammerDB offers both a GUI and CLI interface to all workloads. The GUI sets the dictionary values that you can see in the CLI with "print dict". In fact, you can set a configuration with the GUI and then run the workload with the CLI, and it will use the parameters you set.
Vector Queries - At the moment there is 1 script editor (although there is no technical limit to adding more, however up to now the workloads have not needed them - the script in the script editor is sent to the virtual users (ie independent threads) to execute. Nevertheless different virtual users can (and do) do different things. If you look in the the timed driver script for example you can see the following:

set rema [ lassign [ findvuposition ] myposition totalvirtualusers ]
switch $myposition {
    1 {

So each virtual user knows its own position in the virtual user count and can do different things accordingly. The VUs can also interact with each other in a thread safe way.

Metrics - HammerDB already does asynchronous processing using the "promise" package to implement keying and thinking time. This way, HammerDB can scale up to 1000s of connections. When you select asynchronous, each virtual user then runs multiple sessions or asynch clients per virtual user. The key and think time is handled asynchronously and when complete the virtual user runs the transaction for that virtual user that has woken up. This way, one VU can run many connections. Therefore, if there are additional calculations that need to be done, this could be a good way to do as long as it is not too computationally intensive to prevent other work happening at the same time. Another alternative is co-routines https://wiki.tcl-lang.org/page/coroutine.

wahajali · 2024-09-30T17:37:30Z

wahajali
Sep 30, 2024
Author

@sm-shaw As part of implementing the above for Postgres, I'm trying to find out a good value for ramp up time. Since this is a mixed workload, there is naturally contention between the two different workloads for memory (shared_buffer in case of postgres). I was looking for some guidance around what should be the criteria for selecting the ramp up in such a scenario?

3 replies

sm-shaw Oct 1, 2024
Maintainer

There is no hard and fast answer. If we look at a (very old) screenshot I came across for the doing the new dark theme, you can see an example of a ramp-up with Oracle, in this case it took 3 mins

However it depends on multiple factors such as the hardware and also whether the workload has been run before, so the data is already cached. The default rampup is 2 mins and timing 5 mins which provides a compromise that has worked well so far so that would be a good starting point.

wahajali Oct 3, 2024
Author

Thanks @sm-shaw. I am varying the ratio of OLTP vs Vector workloads when doing mixed workload. Will it make sense to have different ramp ups for each ratio?

sm-shaw Oct 3, 2024
Maintainer

I would suggest the best solution would be to have it user configurable. The rampup is very dependent on CPU, memory and I/O so each individual case has the potential to be different.

JoshInnis · 2024-10-03T05:08:07Z

JoshInnis
Oct 3, 2024

Hello,

As a contributor to this repo and the core contributor to a Postgres extension repo that does ANN, I have a question: are the OLTP/OLAP and Vector runs isolated other than the fact that they happen on the same machine?

In other words, other than the fact that these runs happen on the same machine, is one benchmark expected to have an impact on the other benchmark?

4 replies

wahajali Oct 3, 2024
Author

Hi @JoshInnis
Currently, I've only done Vector and OLTP (no OLAP yet).

The workloads are isolated to the extent that the schema for the OLTP and Vector workload is independent (i.e. separate tables but in the same DB). Since they share the same DB and underlying resource (such as shared_buffers etc), there is naturally some contention for resources. So yes, one benchmark is going to have impact on the other -- I'm looking at NOPM vs Vector QPS for example).

From HammerDB point of view, we have multiple VUs, some that run the standard TPROC-C workload, and some that run only Vector workloads. This ratio is configurable.

For now, I have used VectorDBBench (another benchmark for Vector only workloads) for inserting the dataset and creating the index, due to lack of support for parquet file format in tcl.

Later, it would be interesting if there can be a schema where there is direct correlation between the OLTP data and the vector dataset.

sm-shaw Oct 3, 2024
Maintainer

Here is a Tcl module (BSD license) https://github.com/ray2501/DrillREST that could be added to the HammerDB modules directory. It says it supports "Querying a Parquet File". (We already have the json module).

Query the region.parquet and nation.parquet files in the sample-data directory on your local file system. To view the data in the region.parquet file, use the actual path to your Drill installation to construct this query:

package require DrillREST
package require json

set mydrill [DrillREST new http://localhost:8047]
set result [$mydrill query "SELECT * FROM \
    dfs.`/home/danilo/Programs/apache-drill-1.6.0/sample-data/region.parquet`"]

set parse_result [json::json2dict $result]
set rows [dict get $parse_result rows]
foreach row $rows {
    puts "=================="
    foreach {key value} $row {
        puts "$key - $value"
    }
}

There may be something here that can be used or modified.

sm-shaw Oct 3, 2024
Maintainer

Also, this comment from Sergei Golubchik of MariaDB in the article here does seem to validate the approach https://mariadb.com/resources/blog/how-fast-is-mariadb-vector/
At the end, I want to reiterate — do not use ann-benchmarks as your only criterion when choosing a vector database, it has a very limited scope. First, it inserts all vectors from the train dataset then it searches for vectors from the test dataset. It does not try updates and deletes, it does not try concurrent vector index modifications, it does not try ACID, does not try hybrid searches where nearest neighbor vector search is combined with conditions on other columns and joins. It does test two basic characteristics of the ANN algorithm — how fast it can build a vector index for a given recall and how fast it can search with a given recall. And it does that fairly well. But it is absolutely not enough to make a decision of what vector search algorithm (or vector search database) will work best in your real-life application.

JoshInnis Oct 5, 2024

Oh, I never saw that statement before but I think its right. Vector-based indices expect your dataset to be complete and done. That's kinda what makes vectorscale interesting. All vector index white papers at this point seem to start off with the premise: "Okay, so we have our dataset and now we are going to make an index off of it." and the implementations have to deal with the realities of that statement.

The index that can give a search that gives good recall and performance, AND can handle incremental updates is the index that will win... at least that's what I think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Search Benchmarking along with OLTP #716

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Vector Search Benchmarking along with OLTP #716

wahajali Jul 3, 2024

Replies: 3 comments · 11 replies

sm-shaw Jul 3, 2024 Maintainer

wahajali Jul 3, 2024 Author

sm-shaw Jul 4, 2024 Maintainer

wahajali Jul 8, 2024 Author

sm-shaw Jul 9, 2024 Maintainer

wahajali Sep 30, 2024 Author

sm-shaw Oct 1, 2024 Maintainer

wahajali Oct 3, 2024 Author

sm-shaw Oct 3, 2024 Maintainer

JoshInnis Oct 3, 2024

wahajali Oct 3, 2024 Author

sm-shaw Oct 3, 2024 Maintainer

sm-shaw Oct 3, 2024 Maintainer

JoshInnis Oct 5, 2024

wahajali
Jul 3, 2024

Replies: 3 comments 11 replies

sm-shaw
Jul 3, 2024
Maintainer

wahajali Jul 3, 2024
Author

sm-shaw Jul 4, 2024
Maintainer

wahajali Jul 8, 2024
Author

sm-shaw Jul 9, 2024
Maintainer

wahajali
Sep 30, 2024
Author

sm-shaw Oct 1, 2024
Maintainer

wahajali Oct 3, 2024
Author

sm-shaw Oct 3, 2024
Maintainer

JoshInnis
Oct 3, 2024

wahajali Oct 3, 2024
Author

sm-shaw Oct 3, 2024
Maintainer

sm-shaw Oct 3, 2024
Maintainer