forked from pingcap/docs-cn
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
TiDB vector search doc (pingcap#18502)
- Loading branch information
1 parent
4ffae15
commit 65ad55d
Showing
22 changed files
with
3,424 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,245 @@ | ||
--- | ||
title: 向量数据类型 | ||
summary: 本文介绍 TiDB 的向量数据类型。 | ||
--- | ||
|
||
# 向量数据类型 (Vector) | ||
|
||
向量指的是一组浮点数序列,例如 `[0.3, 0.5, -0.1, ...]`。针对 AI 应用中大量使用到的嵌入向量 (vector embedding) 数据,TiDB 专门提供了向量数据类型,以便高效地存储和访问这些数据。 | ||
|
||
> **警告:** | ||
> | ||
> 该功能目前为实验特性,不建议在生产环境中使用。该功能可能会在未事先通知的情况下发生变化。如果发现 bug,请在 GitHub 上提 [issue](https://github.com/pingcap/tidb/issues) 反馈。 | ||
目前支持的向量数据类型包括: | ||
|
||
- `VECTOR`:存储一组单精度浮点数 (Float) 向量,向量维度可以是任意的。 | ||
- `VECTOR(D)`:存储一组单精度浮点数 (Float) 向量,向量维度固定为 `D`。 | ||
|
||
与使用 [`JSON`](/data-type-json.md) 类型相比,使用向量类型具有以下优势: | ||
|
||
- 支持向量索引。可以通过构建[向量搜索索引](/vector-search-index.md)加速查询。 | ||
- 可指定维度。指定一个固定维度后,不符合维度的数据将被阻止写入到表中。 | ||
- 存储格式更优。向量数据类型针对向量数据进行了特别优化,在空间利用和性能效率上都优于 `JSON` 类型。 | ||
|
||
## 语法 | ||
|
||
可以使用以下格式的字符串来表示一个数据类型为向量的值: | ||
|
||
```sql | ||
'[<float>, <float>, ...]' | ||
``` | ||
|
||
示例: | ||
|
||
```sql | ||
CREATE TABLE vector_table ( | ||
id INT PRIMARY KEY, | ||
embedding VECTOR(3) | ||
); | ||
|
||
INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); | ||
|
||
INSERT INTO vector_table VALUES (2, NULL); | ||
``` | ||
|
||
插入不符合语法的字符串作为向量数据时,TiDB 会报错: | ||
|
||
```sql | ||
[tidb]> INSERT INTO vector_table VALUES (3, '[5, ]'); | ||
ERROR 1105 (HY000): Invalid vector text: [5, ] | ||
``` | ||
|
||
下面的示例中 `embedding` 向量列的维度在建表时已经定义为 `3`,因此当插入其他维度的向量数据时,TiDB 会报错: | ||
|
||
```sql | ||
[tidb]> INSERT INTO vector_table VALUES (4, '[0.3, 0.5]'); | ||
ERROR 1105 (HY000): vector has 2 dimensions, does not fit VECTOR(3) | ||
``` | ||
|
||
关于向量数据类型支持的所有函数和操作符,可参阅[向量函数与操作符](/vector-search-functions-and-operators.md)。 | ||
|
||
关于向量搜索索引的更多信息,可参阅[向量搜索索引](/vector-search-index.md)。 | ||
|
||
## 混合存储不同维度的向量 | ||
|
||
省略 `VECTOR` 类型中的维度参数后,就可以在同一列中存储不同维度的向量: | ||
|
||
```sql | ||
CREATE TABLE vector_table ( | ||
id INT PRIMARY KEY, | ||
embedding VECTOR | ||
); | ||
|
||
INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); -- 插入一个 3 维向量 | ||
INSERT INTO vector_table VALUES (2, '[0.3, 0.5]'); -- 插入一个 2 维向量 | ||
``` | ||
|
||
需要注意的是,存储了不同维度向量的列不支持构建[向量搜索索引](/vector-search-index.md),因为只有维度相同的向量之间才能计算向量距离。 | ||
|
||
## 比较 | ||
|
||
向量数据支持[比较运算符](/vector-search-functions-and-operators.md#扩展的内置函数和运算符),例如 `=`、`!=`、`<`、`>`、`<=` 和 `>=` 等。关于向量数据类型支持的所有函数和操作符,可参阅[向量函数与操作符](/vector-search-functions-and-operators.md)。 | ||
|
||
比较向量数据类型时,TiDB 会以向量中的各个元素为单位进行依次比较,如: | ||
|
||
- `[1] < [12]` | ||
- `[1,2,3] < [1,2,5]` | ||
- `[1,2,3] = [1,2,3]` | ||
- `[2,2,3] > [1,2,3]` | ||
|
||
当两个向量的维度不同时,TiDB 采用字典序 (Lexicographical Order) 进行比较,具体规则如下: | ||
|
||
- 两个向量中的各个元素逐一进行数值比较。 | ||
- 当遇到第一个不同的元素时,它们之间的数值比较结果即为两个向量之间的比较结果。 | ||
- 如果一个向量是另一个向量的前缀,那么维度小的向量**小于**维度大的向量。例如,`[1,2,3] < [1,2,3,0]`。 | ||
- 长度相同且各个元素相同的两个向量**相等**。 | ||
- 空向量**小于**任何非空向量。例如,`[] < [1]`。 | ||
- 两个空向量**相等**。 | ||
|
||
在进行向量比较时,请使用[显式转换](#类型转换-cast)将向量数据从字符串转换为向量类型,以避免 TiDB 直接基于字符串进行比较: | ||
|
||
```sql | ||
-- 因为给出的数据实际上是字符串,因此 TiDB 会按字符串进行比较 | ||
[tidb]> SELECT '[12.0]' < '[4.0]'; | ||
+--------------------+ | ||
| '[12.0]' < '[4.0]' | | ||
+--------------------+ | ||
| 1 | | ||
+--------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 显式转换为向量类型,从而按照向量的比较规则进行正确的比较 | ||
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); | ||
+--------------------------------------------------+ | ||
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | | ||
+--------------------------------------------------+ | ||
| 0 | | ||
+--------------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
## 运算 | ||
|
||
向量数据类型支持算术运算 `+` 和 `-`,对应的是两个向量以元素为单位进行的加法和减法。不支持对不同维度向量进行算术运算,执行这类运算会遇到报错。 | ||
|
||
以下是一些示例: | ||
|
||
```sql | ||
[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]'); | ||
+---------------------------------------------+ | ||
| VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]') | | ||
+---------------------------------------------+ | ||
| [9] | | ||
+---------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
[tidb]> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]'); | ||
+-----------------------------------------------------+ | ||
| VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]') | | ||
+-----------------------------------------------------+ | ||
| [1,1,1] | | ||
+-----------------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[1,2,3]'); | ||
ERROR 1105 (HY000): vectors have different dimensions: 1 and 3 | ||
``` | ||
|
||
## 类型转换 (Cast) | ||
|
||
### 向量与字符串之间的转换 | ||
|
||
可以使用以下函数在向量和字符串之间进行转换: | ||
|
||
- `CAST(... AS VECTOR)`:将字符串类型转换为向量类型 | ||
- `CAST(... AS CHAR)`:将向量类型转换为字符串类型 | ||
- `VEC_FROM_TEXT`:将字符串类型转换为向量类型 | ||
- `VEC_AS_TEXT`:将向量类型转换为字符串类型 | ||
|
||
出于易用性考虑,如果你使用的函数只支持向量数据类型(例如,向量相关距离函数),那么你也可以直接传入符合格式要求的字符串数据,TiDB 会进行隐式转换: | ||
|
||
```sql | ||
-- VEC_DIMS 只接受向量类型,因此你可以直接传入字符串类型,TiDB 会隐式转换为向量类型: | ||
[tidb]> SELECT VEC_DIMS('[0.3, 0.5, -0.1]'); | ||
+------------------------------+ | ||
| VEC_DIMS('[0.3, 0.5, -0.1]') | | ||
+------------------------------+ | ||
| 3 | | ||
+------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 也可以使用 VEC_FROM_TEXT 显式地将字符串转换为向量类型后传递给 VEC_DIMS 函数: | ||
[tidb]> SELECT VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')); | ||
+---------------------------------------------+ | ||
| VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')) | | ||
+---------------------------------------------+ | ||
| 3 | | ||
+---------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 也可以使用 CAST(... AS VECTOR) 进行显式转换: | ||
[tidb]> SELECT VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)); | ||
+----------------------------------------------+ | ||
| VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)) | | ||
+----------------------------------------------+ | ||
| 3 | | ||
+----------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
当你使用的运算符或函数接受多种数据类型时,TiDB 不会进行隐式转换,请先显式地将字符串类型转换为向量类型后,再传递给这些运算符或函数。例如,进行比较运算前,需要显式地将字符串转换为向量类型,否则 TiDB 将会按照字符串类型进行比较,而非按照向量类型进行比较: | ||
|
||
```sql | ||
-- 传入的类型是字符串,因此 TiDB 会按字符串进行比较: | ||
[tidb]> SELECT '[12.0]' < '[4.0]'; | ||
+--------------------+ | ||
| '[12.0]' < '[4.0]' | | ||
+--------------------+ | ||
| 1 | | ||
+--------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 转换为向量类型,以便使用向量类型的比较规则: | ||
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); | ||
+--------------------------------------------------+ | ||
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | | ||
+--------------------------------------------------+ | ||
| 0 | | ||
+--------------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
向量也可以显式地转换为字符串。以使用 `VEC_AS_TEXT()` 函数为例: | ||
|
||
```sql | ||
-- 字符串首先被隐式地转换成向量,然后被显式地转为字符串,因而返回了一个规范化的字符串格式: | ||
[tidb]> SELECT VEC_AS_TEXT('[0.3, 0.5, -0.1]'); | ||
+--------------------------------------+ | ||
| VEC_AS_TEXT('[0.3, 0.5, -0.1]') | | ||
+--------------------------------------+ | ||
| [0.3,0.5,-0.1] | | ||
+--------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
如需了解其他转换函数,请参阅[向量函数和操作符](/vector-search-functions-and-operators.md)。 | ||
|
||
### 向量与其他数据类型之间的转换 | ||
|
||
目前 TiDB 无法直接在向量和其他数据类型(如 `JSON`)之间进行转换,但你可以在执行的 SQL 语句中使用字符串作为中间类型进行转换。 | ||
|
||
需要注意的是,对于存储在表中的向量数据类型列,无法通过 `ALTER TABLE ... MODIFY COLUMN ...` 转换为其他数据类型。 | ||
|
||
## 使用限制 | ||
|
||
有关向量类型的限制,请参阅[向量搜索限制](/vector-search-limitations.md)以及[向量搜索索引的使用限制](/vector-search-index.md#使用限制)。 | ||
|
||
## MySQL 兼容性 | ||
|
||
向量数据类型只在 TiDB 中支持,MySQL 不支持。 | ||
|
||
## 另请参阅 | ||
|
||
- [向量函数和操作符](/vector-search-functions-and-operators.md) | ||
- [向量搜索索引](/vector-search-index.md) |
Oops, something went wrong.