Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7905][CH] Implete window's topk by aggregation #7976

Merged
merged 9 commits into from
Nov 28, 2024

Conversation

lgbo-ustc
Copy link
Contributor

@lgbo-ustc lgbo-ustc commented Nov 18, 2024

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Fixes: #7905

This PR will use aggregation to calculate window's topk automatically when the partition keys are low cardinality ones.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

#7905

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@lgbo-ustc
Copy link
Contributor Author

lgbo-ustc commented Nov 20, 2024

A benchmark on following queries

low cardinality partition keys

insert overwrite table dump_line 
select l_orderkey, l_partkey, l_suppkey, l_linenumber from (
  select l_orderkey, l_partkey, l_suppkey, l_linenumber, row_number() over (partition by l_suppkey order by l_orderkey, l_partkey) as r from tpch_pq.lineitem
) where r = 1;
  • before
0: jdbc:hive2://localhost:10000> insert overwrite table dump_line select l_orderkey, l_partkey, l_suppkey, l_linenumber from (select l_orderkey, l_partkey, l_suppkey, l_linenumber, row_number() over (partition by l_suppkey order by l_orderkey, l_partkey) as r from tpch_pq.lineitem) where r = 1;
+---------+
| Result  |
+---------+
+---------+
No rows selected (28.467 seconds)

image

  • after
0: jdbc:hive2://localhost:10000> insert overwrite table dump_line select l_orderkey, l_partkey, l_suppkey, l_linenumber from (select l_orderkey, l_partkey, l_suppkey, l_linenumber, row_number() over (partition by l_suppkey order by l_orderkey, l_partkey) as r from tpch_pq.lineitem) where r = 1;
+---------+
| Result  |
+---------+
+---------+
No rows selected (14.3 seconds)

image

high cardinality partition keys

  • before
0: jdbc:hive2://localhost:10000> insert overwrite table dump_line select l_orderkey, l_partkey, l_suppkey, l_linenumber from (select l_orderkey, l_partkey, l_suppkey, l_linenumber, row_number() over (partition by l_suppkey, l_orderkey  order by  l_partkey) as r from tpch_pq.lineitem) where r = 1;
+---------+
| Result  |
+---------+
+---------+
No rows selected (41.441 seconds)

image

  • after
0: jdbc:hive2://localhost:10000> insert overwrite table dump_line select l_orderkey, l_partkey, l_suppkey, l_linenumber from (select l_orderkey, l_partkey, l_suppkey, l_linenumber, row_number() over (partition by l_suppkey, l_orderkey  order by  l_partkey) as r from tpch_pq.lineitem) where r = 1;
+---------+
| Result  |
+---------+
+---------+
No rows selected (50.714 seconds)

image

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@lgbo-ustc
Copy link
Contributor Author

lgbo-ustc commented Nov 26, 2024

For high cardinality partition keys, fallback to window. We have following result

  • before
0: jdbc:hive2://localhost:10000> insert overwrite table dump_line select l_orderkey, l_partkey, l_suppkey, l_linenumber from (select l_orderkey, l_partkey, l_suppkey, l_linenumber, row_number() over (partition by l_suppkey, l_orderkey  order by  l_partkey) as r from tpch_pq.lineitem) where r = 1;
+---------+
| Result  |
+---------+
+---------+
No rows selected (26.549 seconds)
  • after
0: jdbc:hive2://localhost:10000> insert overwrite table dump_line select l_orderkey, l_partkey, l_suppkey, l_linenumber from (select l_orderkey, l_partkey, l_suppkey, l_linenumber, row_number() over (partition by l_suppkey, l_orderkey  order by  l_partkey) as r from tpch_pq.lineitem) where r = 1;
+---------+
| Result  |
+---------+
+---------+
No rows selected (25.58 seconds)

image

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@liuneng1994 liuneng1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <new>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird header file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's auto included by vs. So annoying


#include <Poco/Logger.h>
#include <Common/logger_useful.h>
#include "base/defines.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use <>

#include <Processors/IProcessor.h>
#include <Processors/QueryPlan/IQueryPlanStep.h>
#include <Processors/QueryPlan/ITransformingStep.h>
#include "Processors/Port.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<>

#include <Processors/QueryPlan/QueryPlan.h>
#include <Poco/Logger.h>
#include <Common/logger_useful.h>
#include "Analyzer/IQueryTreeNode.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<>

@liuneng1994 liuneng1994 merged commit 883d026 into apache:main Nov 28, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] topk of row_number
2 participants