Multicolumn ANN indexes on a hypertable #134

hamishc · 2024-09-23T09:09:04Z

Hi! I'm wanting to perform ANN search on time-series data, so I'm trying to index my tables on multiple columns: the embedding column and the timestamp column, in order to optimally take advantage of timescale hypertable functionality. I'm not able to find any documentation on how to do this.

e.g. I would like something like

CREATE INDEX my_index ON my_hypertable USING diskann (timestamp, embedding);

It seems like pgvector supports conditional indexing only (e.g. CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WHERE (category_id = 123);) but for obvious reasons this isn't available for time-based partitions.

It would be a major advantage for us to be able to query on long-term timeseries data, so we'd love to see this added if it's not already available. If it isn't, is this functionality possible or on the roadmap as an enhancement at some point?

The text was updated successfully, but these errors were encountered:

cevian · 2024-09-23T20:19:31Z

@hamishc Vector indexes cannot be multi-column right now. What you want to do instead is use time-based table partitioning using Timescale's hypertables and then have a regular diskann column on the embedding column. That way the query executions will be approximately as follows:

the query planner will exclude any chunks (partitions) that cannot have any data based on the time-based constraints in your query
for each chunk that matched, the index on that chunk will get the rows with the closest vectors
the executor will then filter out any rows that don't match the time filter

Step 1 makes sure most of the irrelevant data based on the time constraints are thrown away quickly. Step 2 uses the full power of the vector index. Step 3 does the final cleanup.

hamishc · 2024-09-24T02:52:01Z

Oh, so hypertables don't actually need indexes on the time column in order to use the partitions? When I created the hypertable I ran it with create_default_indexes => FALSE - so I assumed any indexing had to be on both the desired column and the time column (this is what the timescale docs seem to suggest).

I've validated with the query planner that it's using the indexes and only running on the requested partition, so it's working either way! Thanks for your help!

cevian added the question Further information is requested label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multicolumn ANN indexes on a hypertable #134

Multicolumn ANN indexes on a hypertable #134

hamishc commented Sep 23, 2024

cevian commented Sep 23, 2024

hamishc commented Sep 24, 2024

Multicolumn ANN indexes on a hypertable #134

Multicolumn ANN indexes on a hypertable #134

Comments

hamishc commented Sep 23, 2024

cevian commented Sep 23, 2024

hamishc commented Sep 24, 2024