Excessive data transfer when querying S3 .duckdb
files (including INDEX
queries)
#1575
Replies: 1 comment
-
I am by no means an expert in DuckDB, S3 or web development in general, but from my understanding, reading directly from S3 will require you to load the whole file before you can query it. Because there's no way for DuckDB to know in advance how to download just the parts of the file you need for the query. I can tell you my strategy for dealing with duckdb and other client-side application data transfers is using the Origin Private File System. This is basically a novel technology in browsers. It's a bucket in the client where you can store and persist files exclusive to the origin (your site or application), in such a way that you can do "if-else" javascript statements to verify if the file isn't already contained in the bucket before making a new request for data transfer from the remote storage and some things like that. I can share with you the basic TypeScript functions I have created to deal with that and it has worked so far: // this is the main one you have to call inside your application and edit it accordingly
export async function handleOpfs() {
const opfsRoot = await navigator.storage.getDirectory();
let files = [];
for await (let name of opfsRoot.keys()) {
files.push(name);
}
if (!files.includes("MY_PARQUET.parquet")) { // you can see I am persisting this file in the OPFS if it doesn't exist
const fileBufferArray = await FetchFromS3(); /// This is your function to fetch from s3
const fileHandle = await loadFileHandle("MY_PARQUET.parquet"); // here's the file again
if (fileBufferArray) {
writeFile(fileHandle, fileBufferArray);
}
}
}
export async function loadFileHandle(file_name: string) {
const opfsRoot = await navigator.storage.getDirectory();
const fileHandle = await opfsRoot.getFileHandle(file_name, {
create: true,
});
return fileHandle;
}
export async function writeFile(
fileHandle: FileSystemFileHandle,
contents: ArrayBuffer
) {
// Create a FileSystemWritableFileStream to write to.
const writable = await fileHandle.createWritable();
// Write the contents of the file to the stream.
await writable.write(contents);
// Close the file and write the contents to disk.
await writable.close();
}
export async function readFile(fileHandle: FileSystemFileHandle) {
// Create a FileSystemWritableFileStream to write to.
const file = await fileHandle.getFile();
return file;
}
export async function FetchFromS3() {
try {
const response = await fetch(
"https://my-s3-storage-or-something.com/MY_PARQUET.parquet",
{
method: "GET",
cache: "no-cache",
}
);
// Check if the request was successful
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
// Get ArrayBuffer from response
const buffer = await response.arrayBuffer();
return buffer;
} catch (error) {
console.error("There was a problem with the fetch operation: ", error);
}
}
|
Beta Was this translation helpful? Give feedback.
-
I'm seeing unexpected amounts of data transfer when querying
.duckdb
files on S3:Methods
For each
size
in (100k, 200k, 500k, 1MM, 2MM, 4MM, 6MM), I made:.duckdb
file withsize
rows (blue dots above)..duckdb
with the same rows, plus aUNIQUE INDEX
on columnid
(red dots above).The plots show total data transfer when running these queries against each
.duckdb
file:select * from crashes where id=50000
(all.duckdb
files generated above contain one row with thisid
)select * from crashes limit 1
Full repo is at duckdb-wasm-test:
.duckdb
files.Questions
id
, given aUNIQUE INDEX
onid
, transfers >10MB for all sizes ≥200k rows.limit 1
) transfers between 6MB and 16MB, also surely higher than necessaryMotivation
I want to randomly-access specific rows from
.duckdb
files in S3. I realize that's outside DuckDB's normal OLAP focus, but it should be possible to support efficiently, especially with indexes. sql.js-httpvfs does a good job of this for SQLite, but my.sqlite
files are ≈4x the size of equivalent.duckdb
s (due to lack of columnar compression). I'm working on a similar benchmark of data transferred with sql.js-httpvfs, though, and will follow up here.If duckdb-wasm supported this well, it would enable static webapps to query huge remote datasets, which would be very powerful. I'm interested to hear whether it's possible today, or what it would take to enable it.
xref: this is similar to #407, but I'd expect
.duckdb
s (especially with indices) to be able to perform much better than raw.parquet
s discussed there.xref: #1577 ("Reading a remote parquet file with a simple WHERE clause results in loading more than twice its size.")
Beta Was this translation helpful? Give feedback.
All reactions