Querying Hive partitioning parquets is slow #173

xqe2011 · 2024-11-07T11:32:53Z

What happens?

Recently, we tried this extensions instead of using a standalone duckdb instance. When we run a simple SELECT query on parquet files, it's 2-20 times slower than DuckDB.

Profiling method
SELECT duckdb_execute($$SET enable_profiling='query_tree'$$); and watch logs.

To Reproduce

Query one field : SELECT name FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1 and code2 are partition fields.

DuckDB runs on cli 0.0190s
DuckDB runs in this extension 0.0291s
Total time of using this extension 0.513s
Query multi fields: SELECT name, level, detxlen, detylen, downid FROM public.table1 where code1 = 3261 and code2 = '001' and code3 = '5204' and code4 = '1'
code1 and code2 are partition fields.

DuckDB runs on cli 0.0413s
DuckDB runs in this extension 0.037s
Total time of using this extension 0.552s

OS:

Ubuntu Server 22.04.3

ParadeDB Version:

paradedb/paradedb:16-v0.11.1

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Helm Chart

Full Name:

Liu Qijie

Affiliation:

Dongguan University of Technology

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

philippemnoel · 2024-11-07T20:02:19Z

Thanks for opening! Would love your help with debugging this, or anyone else if willing to assist here :)

xqe2011 added the bug Something isn't working label Nov 7, 2024

philippemnoel added good first issue Good for newcomers help wanted Extra attention is needed priority-medium Medium priority issue user-request This issue was directly requested by a user labels Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying Hive partitioning parquets is slow #173

Querying Hive partitioning parquets is slow #173

xqe2011 commented Nov 7, 2024

philippemnoel commented Nov 7, 2024

Querying Hive partitioning parquets is slow #173

Querying Hive partitioning parquets is slow #173

Comments

xqe2011 commented Nov 7, 2024

What happens?

To Reproduce

OS:

ParadeDB Version:

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

Full Name:

Affiliation:

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

philippemnoel commented Nov 7, 2024