Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Product Quantization #79

Merged
merged 26 commits into from
Feb 21, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
59f17e3
Implement product quantization in lantern cli
var77 Feb 10, 2024
3a6fd65
Process all splits in parallel
var77 Feb 11, 2024
3a32b9f
Add subvector-id argument and ability to horizontally scale the runni…
var77 Feb 11, 2024
dfb41c7
Fix progress tracking for pq
var77 Feb 11, 2024
6ccc001
Fix indexing bug
var77 Feb 11, 2024
ffa9a49
Parallelize vector compression
var77 Feb 12, 2024
41c12ec
Parallelize data fetching and export
var77 Feb 12, 2024
16011ba
Refactor and separate code parts
var77 Feb 14, 2024
d2fbdf8
Refactor code, pack arguments in a struct
var77 Feb 15, 2024
606ac54
Add gcp batch job flow
var77 Feb 15, 2024
dc6ef0f
Add tests for lantern_pq
var77 Feb 16, 2024
cfa1dab
Add action to push cli image to GCR
var77 Feb 16, 2024
4cc5977
Remove unnecessary arguments
var77 Feb 16, 2024
1155b57
Rename codebook table and params to match lantern pq
var77 Feb 19, 2024
50fdb62
Fix naming issues, add --dataset-limit argument
var77 Feb 20, 2024
d3e4575
Conditionaliy publish latest tag for cli docker image
var77 Feb 20, 2024
000e628
Use renamed lantern access method
Ngalstyan4 Feb 17, 2024
92aeef0
Release v0.2.0
Ngalstyan4 Feb 17, 2024
2a342e6
Temporarily change lantern tag for testing before lantern is released
Ngalstyan4 Feb 17, 2024
70f284c
Implement pq-quantization in external index construction
Ngalstyan4 Feb 20, 2024
19b7ba8
Fix codebook offset bug
Ngalstyan4 Feb 20, 2024
5adcf10
set pq parameter in index construction when importing
Ngalstyan4 Feb 20, 2024
a467cb0
Fix codebook lifetime bug in rust<->C interface
Ngalstyan4 Feb 21, 2024
d107f81
Prepare for release
Ngalstyan4 Feb 21, 2024
dcdfcd9
Fix naming for uppercase table names, check if codebook table exists …
var77 Feb 21, 2024
2dc901f
Add pq argument for external index reindexing
var77 Feb 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 34 additions & 3 deletions .github/workflows/publish-cli-docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,16 @@ name: publish-cli-docker
on:
workflow_dispatch:
inputs:
LATEST:
type: boolean
description: "Publish as latest release"
required: false
default: false
VERSION:
type: string
description: "CLI version"
required: true
default: "0.0.38"
default: "0.0.39"
IMAGE_NAME:
type: string
description: "Container image name to tag"
Expand Down Expand Up @@ -34,12 +39,38 @@ jobs:
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and push
- name: Login to GCR Container Registry
uses: docker/login-action@v3
with:
registry: ${{ secrets.GCP_REGION }}-docker.pkg.dev
username: _json_key_base64
password: ${{ secrets.GCP_CREDENTIALS_JSON_B64 }}
- name: Build and push without latest tags
uses: docker/build-push-action@v5
id: build_image
if: ${{ inputs.LATEST == false || inputs.LATEST == 'false' }}
with:
context: .
platforms: linux/amd64
file: Dockerfile.cli${{ (matrix.device == 'gpu' && '.cuda' || '') }}
push: true
tags: |
${{ inputs.IMAGE_NAME }}:${{ inputs.VERSION }}-${{ matrix.device }}
${{ secrets.GCP_REGION }}-docker.pkg.dev/${{ secrets.GCP_PROJECT_ID }}/${{ inputs.IMAGE_NAME }}:${{ inputs.VERSION }}-${{ matrix.device }}
- name: Build and push with latest tags
uses: docker/build-push-action@v5
id: build_image_latest
if: ${{ inputs.LATEST == true || inputs.LATEST == 'true' }}
with:
context: .
platforms: linux/amd64
file: Dockerfile.cli${{ (matrix.device == 'gpu' && '.cuda' || '') }}
push: true
# the :latest tag will refer to cpu version
tags: ${{ (matrix.device == 'cpu' && format('{0}:latest', inputs.IMAGE_NAME) || format('{0}:gpu', inputs.IMAGE_NAME)) }},${{ inputs.IMAGE_NAME }}:latest-${{ matrix.device }},${{ inputs.IMAGE_NAME }}:${{ inputs.VERSION }}-${{ matrix.device }}
tags: |
${{ (matrix.device == 'cpu' && format('{0}:latest', inputs.IMAGE_NAME) || format('{0}:gpu', inputs.IMAGE_NAME)) }}
${{ inputs.IMAGE_NAME }}:latest-${{ matrix.device }}
${{ inputs.IMAGE_NAME }}:${{ inputs.VERSION }}-${{ matrix.device }}
${{ (matrix.device == 'cpu' && format('{0}-docker.pkg.dev/{1}/{2}:latest', secrets.GCP_REGION, secrets.GCP_PROJECT_ID, inputs.IMAGE_NAME) || format('{0}-docker.pkg.dev/{1}/{2}:gpu', secrets.GCP_REGION, secrets.GCP_PROJECT_ID, inputs.IMAGE_NAME)) }}
${{ secrets.GCP_REGION }}-docker.pkg.dev/${{ secrets.GCP_PROJECT_ID }}/${{ inputs.IMAGE_NAME }}:latest-${{ matrix.device }}
${{ secrets.GCP_REGION }}-docker.pkg.dev/${{ secrets.GCP_PROJECT_ID }}/${{ inputs.IMAGE_NAME }}:${{ inputs.VERSION }}-${{ matrix.device }}
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ members = [
"lantern_cli",
"lantern_daemon",
"lantern_index_autotune",
"lantern_pq",
]

[profile.release]
Expand Down
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,3 +347,56 @@ CREATE TABLE "public"."index_parameter_experiment_results" (
build_time DOUBLE PRECISION NULL
);
```

## Lantern PQ

## Description

Use external product quantization to compress table vectors using kmeans clustering.

### Usage

Run `lantern-cli pq-table --help` to show the cli options.

Job can be run both on local instance and also using GCP batch jobs to parallelize the workload over handreds of VMs to speed up clustering.

To run locally use:

```bash
lantern-cli pq-table --uri 'postgres://[email protected]:5432/postgres' --table sift10k --column v --clusters 256 --splits 32
```

The job will be run on current machine utilizing all available cores.

For big datasets over 1M it is convinient to run the job using GCP batch jobs.
Make sure to have GCP credentials set-up before running this command:

```bash
lantern-cli pq-table --uri 'postgres://[email protected]:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --run-on-gcp
```

If you prefer to orchestrate task on your own on premise servers you need to do the following 3 steps:

1. Run setup job. This will create necessary tables and add `pqvec` column on target table

```bash
lantern-cli pq-table --uri 'postgres://[email protected]:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --skip-codebook-creation --skip-vector-compression
```

2. Run clustering job. This will create codebook for the table and export to postgres table

```bash
lantern-cli pq-table --uri 'postgres://[email protected]:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --skip-table-setup --skip-vector-compression --parallel-task-count 10 --subvector-id 0
```

In this case this command should be run 32 times for each subvector in range [0-31] and `--parallel-task-count` means at most we will run 10 tasks in parallel. This is used to not exceed max connection limit on postgres.

3. Run compression job. This will compress vectors using the generated codebook and export results under `pqvec` column

```bash
lantern-cli pq-table --uri 'postgres://[email protected]:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --skip-table-setup --skip-codebook-creation --parallel-task-count 10 --total-task-count 10 --compression-task-id 0
```

In this case this command should be run 10 times for each part of codebook in range [0-9] and `--parallel-task-count` means at most we will run 10 tasks in parallel. This is used to not exceed max connection limit on postgres.

Table should have primary key, in order for this job to work. If primary key is different than `id` provide it using `--pk` argument
2 changes: 1 addition & 1 deletion ci/scripts/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ function setup_postgres() {
}

function setup_lantern() {
LANTERN_VERSION=v0.1.1
LANTERN_VERSION=main
git clone --recursive https://github.com/lanterndata/lantern.git /tmp/lantern
pushd /tmp/lantern
git checkout ${LANTERN_VERSION} && \
Expand Down
3 changes: 2 additions & 1 deletion lantern_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "lantern_cli"
version = "0.0.38"
version = "0.0.39"
edition = "2021"

[[bin]]
Expand All @@ -16,3 +16,4 @@ lantern_embeddings = { path = "../lantern_embeddings" }
lantern_daemon = { path = "../lantern_daemon" }
lantern_logger = { path = "../lantern_logger" }
lantern_index_autotune = { path = "../lantern_index_autotune" }
lantern_pq = { path = "../lantern_pq" }
3 changes: 3 additions & 0 deletions lantern_cli/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ use lantern_daemon::cli::DaemonArgs;
use lantern_embeddings::cli::{EmbeddingArgs, MeasureModelSpeedArgs, ShowModelsArgs};
use lantern_external_index::cli::CreateIndexArgs;
use lantern_index_autotune::cli::IndexAutotuneArgs;
use lantern_pq::cli::PQArgs;

#[derive(Subcommand, Debug)]
pub enum Commands {
Expand All @@ -18,6 +19,8 @@ pub enum Commands {
MeasureModelSpeed(MeasureModelSpeedArgs),
/// Autotune index
AutotuneIndex(IndexAutotuneArgs),
/// Quantize table
PQTable(PQArgs),
/// Start in daemon mode
StartDaemon(DaemonArgs),
}
Expand Down
9 changes: 9 additions & 0 deletions lantern_cli/src/main.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
use std::process;

use clap::Parser;
use lantern_daemon;
use lantern_embeddings;
use lantern_external_index;
use lantern_logger::{LogLevel, Logger};
use lantern_pq;
mod cli;

fn main() {
Expand Down Expand Up @@ -46,6 +49,11 @@ fn main() {
_main_logger = Some(logger.clone());
lantern_index_autotune::autotune_index(&args, None, None, Some(logger))
}
cli::Commands::PQTable(args) => {
let logger = Logger::new("Lantern PQ", LogLevel::Debug);
_main_logger = Some(logger.clone());
lantern_pq::quantize_table(args, None, None, Some(logger))
}
cli::Commands::StartDaemon(args) => {
let logger = Logger::new("Lantern Daemon", args.log_level.value());
_main_logger = Some(logger.clone());
Expand All @@ -56,5 +64,6 @@ fn main() {
let logger = _main_logger.unwrap();
if let Err(e) = res {
logger.error(&e.to_string());
process::exit(1);
}
}
27 changes: 27 additions & 0 deletions lantern_pq/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[package]
name = "lantern_pq"
version = "0.0.1"
edition = "2021"

[lib]
crate-type = ["lib"]

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
clap = { version = "4.4.0", features = ["derive"] }
anyhow = "1.0.75"
postgres = "0.19.7"
lantern_logger = { path = "../lantern_logger" }
lantern_utils = { path = "../lantern_utils" }
rand = "0.8.5"
linfa-clustering = { version = "0.7.0", features = ["ndarray-linalg"] }
linfa = "0.7.0"
ndarray = { version = "0.15.6", features = ["rayon"] }
rayon = "1.8.1"
md5 = "0.7.0"
isahc = "1.7.2"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0.111"
gcp_auth = "0.10.0"
tokio = { version = "1.36.0", features = ["rt", "rt-multi-thread"] }
125 changes: 125 additions & 0 deletions lantern_pq/src/cli.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
use clap::Parser;

#[derive(Parser, Debug)]
#[command(version, about, long_about = None)]
pub struct PQArgs {
/// Fully associated database connection string including db name
#[arg(short, long)]
pub uri: String,

/// Table name
#[arg(short, long)]
pub table: String,

/// Schema name
#[arg(short, long, default_value = "public")]
pub schema: String,

/// Column name to quantize
#[arg(short, long)]
pub column: String,

/// Name for codebook table
#[arg(long)]
pub codebook_table_name: Option<String>,

/// Dataset limit. Limit should be greater or equal to cluster count
#[arg(long)]
pub dataset_limit: Option<usize>,

/// Cluster count for kmeans
#[arg(long, default_value_t = 256)]
pub clusters: usize,

/// Subvector count to split vector
#[arg(long, default_value_t = 1)]
pub splits: usize,

/// Subvector part to process
#[arg(long)]
pub subvector_id: Option<usize>,

/// If true, codebook table will not be created and pq column will not be added to table. So
/// they should be set up externally
#[arg(long, default_value_t = false)]
pub skip_table_setup: bool,

/// If true vectors will not be quantized and exported to the table
#[arg(long, default_value_t = false)]
pub skip_vector_quantization: bool,

/// If true codebook will not be created
#[arg(long, default_value_t = false)]
pub skip_codebook_creation: bool,

/// Primary key of the table, needed for quantization job
#[arg(long, default_value = "id")]
pub pk: String,

/// Number of total tasks running (used in gcp batch jobs)
#[arg(long)]
pub total_task_count: Option<usize>,

/// Number of tasks running in parallel (used in gcp batch jobs)
#[arg(long)]
pub parallel_task_count: Option<usize>,

/// Task id of currently running quantization job (used in gcp batch jobs)
#[arg(long)]
pub quantization_task_id: Option<usize>,

// GCP ARGS
/// If true job will be submitted to gcp
#[arg(long, default_value_t = false)]
pub run_on_gcp: bool,

/// Image tag to use for GCR. example: 0.0.38-cpu
#[arg(long)]
pub gcp_cli_image_tag: Option<String>,

/// GCP project ID
#[arg(long)]
pub gcp_project: Option<String>,

/// GCP region. Default: us-central1
#[arg(long)]
pub gcp_region: Option<String>,

/// Full GCR image name. default: {gcp_region}-docker.pkg.dev/{gcp_project_id}/lanterndata/lantern-cli:{gcp_cli_image_tag}
#[arg(long)]
pub gcp_image: Option<String>,

/// Task count for quantization. default: calculated automatically based on dataset size
#[arg(long)]
pub gcp_quantization_task_count: Option<usize>,

/// Parallel tasks for quantization. default: calculated automatically based on
/// max connections
#[arg(long)]
pub gcp_quantization_task_parallelism: Option<usize>,

/// Parallel tasks for quantization. default: calculated automatically based on
/// max connections and dataset size
#[arg(long)]
pub gcp_clustering_task_parallelism: Option<usize>,

/// If image is hosted on GCR this will speed up the VM startup time
#[arg(long, default_value_t = true)]
pub gcp_enable_image_streaming: bool,

/// CPU count for one VM in clustering task. default: calculated based on dataset size
#[arg(long)]
pub gcp_clustering_cpu: Option<usize>,

/// Memory GB for one VM in clustering task. default: calculated based on CPU count
#[arg(long)]
pub gcp_clustering_memory_gb: Option<usize>,

/// CPU count for one VM in quantization task. default: calculated based on dataset size
#[arg(long)]
pub gcp_quantization_cpu: Option<usize>,

/// Memory GB for one VM in quantization task. default: calculated based on CPU count
#[arg(long)]
pub gcp_quantization_memory_gb: Option<usize>,
}
Loading