Excessive data transfer when querying S3 `.duckdb` files (including `INDEX` queries) #1575

ryan-williams · 2024-01-09T03:13:50Z

ryan-williams
Jan 9, 2024

I'm seeing unexpected amounts of data transfer when querying .duckdb files on S3:

Methods

For each size in (100k, 200k, 500k, 1MM, 2MM, 4MM, 6MM), I made:

A .duckdb file with size rows (blue dots above).
Another .duckdb with the same rows, plus a UNIQUE INDEX on column id (red dots above).

The plots show total data transfer when running these queries against each .duckdb file:

select * from crashes where id=50000 (all .duckdb files generated above contain one row with this id)
select * from crashes limit 1

Full repo is at duckdb-wasm-test:

This notebook makes the .duckdb files.
This webapp queries them and helps download a summary of HTTP requests.
This notebook analyzes those requests and renders plots.

Questions

Why do some smaller datasets exhibit more data transfer than their larger siblings? e.g.:
- Blue dots don't increase monotonically in either plot.
- Red dots don't increase monotonically in the second plot.
Why is total data transferred so high?
- Selecting 1 row by id, given a UNIQUE INDEX on id, transfers >10MB for all sizes ≥200k rows.
  - At 100k rows it still transfers several MB, surely much more than is necessary.
- Selecting 1 arbitrary row (limit 1) transfers between 6MB and 16MB, also surely higher than necessary

Motivation

I want to randomly-access specific rows from .duckdb files in S3. I realize that's outside DuckDB's normal OLAP focus, but it should be possible to support efficiently, especially with indexes. sql.js-httpvfs does a good job of this for SQLite, but my .sqlite files are ≈4x the size of equivalent .duckdbs (due to lack of columnar compression). I'm working on a similar benchmark of data transferred with sql.js-httpvfs, though, and will follow up here.

If duckdb-wasm supported this well, it would enable static webapps to query huge remote datasets, which would be very powerful. I'm interested to hear whether it's possible today, or what it would take to enable it.

xref: this is similar to #407, but I'd expect .duckdbs (especially with indices) to be able to perform much better than raw .parquets discussed there.

xref: #1577 ("Reading a remote parquet file with a simple WHERE clause results in loading more than twice its size.")

danilo-css · 2024-06-22T07:15:59Z

danilo-css
Jun 22, 2024

I am by no means an expert in DuckDB, S3 or web development in general, but from my understanding, reading directly from S3 will require you to load the whole file before you can query it. Because there's no way for DuckDB to know in advance how to download just the parts of the file you need for the query.

I can tell you my strategy for dealing with duckdb and other client-side application data transfers is using the Origin Private File System.

This is basically a novel technology in browsers. It's a bucket in the client where you can store and persist files exclusive to the origin (your site or application), in such a way that you can do "if-else" javascript statements to verify if the file isn't already contained in the bucket before making a new request for data transfer from the remote storage and some things like that.

I can share with you the basic TypeScript functions I have created to deal with that and it has worked so far:

// this is the main one you have to call inside your application and edit it accordingly
export async function handleOpfs() {
  const opfsRoot = await navigator.storage.getDirectory();
  let files = [];
  for await (let name of opfsRoot.keys()) {
    files.push(name);
  }
  if (!files.includes("MY_PARQUET.parquet")) { // you can see I am persisting this file in the OPFS if it doesn't exist
    const fileBufferArray = await FetchFromS3(); /// This is your function to fetch from s3
    const fileHandle = await loadFileHandle("MY_PARQUET.parquet"); // here's the file again
    if (fileBufferArray) {
      writeFile(fileHandle, fileBufferArray);
    }
  }
}

export async function loadFileHandle(file_name: string) {
  const opfsRoot = await navigator.storage.getDirectory();
  const fileHandle = await opfsRoot.getFileHandle(file_name, {
    create: true,
  });
  return fileHandle;
}

export async function writeFile(
  fileHandle: FileSystemFileHandle,
  contents: ArrayBuffer
) {
  // Create a FileSystemWritableFileStream to write to.
  const writable = await fileHandle.createWritable();

  // Write the contents of the file to the stream.
  await writable.write(contents);

  // Close the file and write the contents to disk.
  await writable.close();
}

export async function readFile(fileHandle: FileSystemFileHandle) {
  // Create a FileSystemWritableFileStream to write to.
  const file = await fileHandle.getFile();

  return file;
}

export async function FetchFromS3() {
  try {
    const response = await fetch(
      "https://my-s3-storage-or-something.com/MY_PARQUET.parquet",
      {
        method: "GET",
        cache: "no-cache",
      }
    );

    // Check if the request was successful
    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }

    // Get ArrayBuffer from response
    const buffer = await response.arrayBuffer();
    return buffer;
  } catch (error) {
    console.error("There was a problem with the fetch operation: ", error);
  }
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive data transfer when querying S3 `.duckdb` files (including `INDEX` queries) #1575

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Excessive data transfer when querying S3 .duckdb files (including INDEX queries) #1575

ryan-williams Jan 9, 2024

Methods

Questions

Motivation

Replies: 1 comment

danilo-css Jun 22, 2024

Excessive data transfer when querying S3 `.duckdb` files (including `INDEX` queries) #1575

ryan-williams
Jan 9, 2024

danilo-css
Jun 22, 2024