chore: Removing copying data from dictionary values into CometDictionary #490

viirya · 2024-05-29T02:16:06Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya · 2024-05-29T02:18:14Z

common/src/main/java/org/apache/comet/vector/CometDictionary.java

-    switch (values.getValueVector().getMinorType()) {
-      case BIT:
-        booleans = new boolean[numValues];
-        for (int i = 0; i < numValues; i++) {
-          booleans[i] = values.getBoolean(i);
-        }


The recent performance regression hits that this copying from dictionary to CometDictionary is slow. It seems we don't need to copy dictionary values.

viirya · 2024-05-29T06:09:57Z

cc @sunchao

sunchao · 2024-05-29T06:50:12Z

common/src/main/java/org/apache/comet/vector/CometDictionary.java

-          binaries[i] = new ByteArrayWrapper(bytes);
-        }
-        break;
+        byte[] bytes = new byte[DECIMAL_BYTE_WIDTH];


Hmm will this be more expensive? originally we would copy binaries into ByteArrayWrapper only once up-front, but now we are doing it for each decodeToBinary call?

I think for whole stage codgen, the accessing code to values in rows are collapsed as a whole and the value is accessed as a temporary variable in Java code. One value should have only one decodeToBinary call.

Although for interpreted execution, each time to access a value in a row, it will call these accessors.

We probably can benchmark it. cc @andygrove

Btw, except for the cases that only CometScan is triggered, if there is any Comet native operator, these copied dictionary values are not used at all. They are only for row-based access by Spark operator.

I think for whole stage codgen, the accessing code to values in rows are collapsed as a whole and the value is accessed as a temporary variable in Java code. One value should have only one decodeToBinary call.

Hmm I'm not sure. Suppose we have a Comet dictionary vector where indices are 100 1's. In the current approach we'd need to call decodeToBinary on index 1 for 100 times?

Btw, except for the cases that only CometScan is triggered, if there is any Comet native operator, these copied dictionary values are not used at all. They are only for row-based access by Spark operator.

Yes this is true.

I'm basically trying to understand whether the changes in this PR is necessary. Is it trying to address some perf regression we saw recently?

I'm basically trying to understand whether the changes in this PR is necessary. Is it trying to address some perf regression we saw recently?

This is motivated by recent regression on CometDictionary after re-initiating CometDictionary after re-importing array, though it is not directly related. It inspires me that this part of copying might be not good for performance.

Hmm I'm not sure. Suppose we have a Comet dictionary vector where indices are 100 1's. In the current approach we'd need to call decodeToBinary on index 1 for 100 times?

That's good point. It is an orthogonal concern. I was talking about is the access of same field in a row.

Maybe we just need to do copying for binary types? For other types like int, long, etc, I don't see the good reason to copy the values actually.

Yes, I think should be OK to avoid copying for other types.

We probably can benchmark it. cc @andygrove

I ran the TPC-H benchmark comparing main with this PR and did not see any significant difference. I think the changes in this PR make sense, though, and reduce some memory pressure.

viirya · 2024-05-30T04:05:15Z

Merged. Thanks @sunchao @andygrove

## Which issue does this PR close? Part of #679 and #670 Related #490 ## Rationale for this change For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary. ## What changes are included in this PR? Unpack only for Decimal 128 ## How are these changes tested? Existing test

…ary (apache#490) * chore: Removing copying data from dictionary values into CometDictionary * For review (cherry picked from commit 8c532be)

## Which issue does this PR close? Part of apache#679 and apache#670 Related apache#490 ## Rationale for this change For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary. ## What changes are included in this PR? Unpack only for Decimal 128 ## How are these changes tested? Existing test (cherry picked from commit c1b7c7d)

viirya commented May 29, 2024

View reviewed changes

viirya force-pushed the remove_copying_dictionary_values branch from 560d3d1 to c8d57ea Compare May 29, 2024 05:50

viirya changed the title ~~fix: Removing copying data from dictionary values into CometDictionary~~ chore: Removing copying data from dictionary values into CometDictionary May 29, 2024

chore: Removing copying data from dictionary values into CometDictionary

4d088a2

viirya force-pushed the remove_copying_dictionary_values branch from c8d57ea to 4d088a2 Compare May 29, 2024 06:27

sunchao reviewed May 29, 2024

View reviewed changes

For review

de55c05

viirya closed this May 30, 2024

viirya reopened this May 30, 2024

andygrove approved these changes May 30, 2024

View reviewed changes

sunchao approved these changes May 30, 2024

View reviewed changes

viirya merged commit 8c532be into apache:main May 30, 2024
86 of 87 checks passed

viirya deleted the remove_copying_dictionary_values branch May 30, 2024 04:05

kazuyukitanimura mentioned this pull request Jul 22, 2024

fix: dictionary decimal vector optimization #705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Removing copying data from dictionary values into CometDictionary #490

chore: Removing copying data from dictionary values into CometDictionary #490

viirya commented May 29, 2024

viirya May 29, 2024 •

edited

Loading

viirya commented May 29, 2024

sunchao May 29, 2024

viirya May 29, 2024

viirya May 29, 2024

viirya May 29, 2024 •

edited

Loading

sunchao May 29, 2024

viirya May 29, 2024 •

edited

Loading

sunchao May 29, 2024

andygrove May 30, 2024

viirya commented May 30, 2024

chore: Removing copying data from dictionary values into CometDictionary #490

chore: Removing copying data from dictionary values into CometDictionary #490

Conversation

viirya commented May 29, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya May 29, 2024 • edited Loading

Choose a reason for hiding this comment

viirya commented May 29, 2024

sunchao May 29, 2024

Choose a reason for hiding this comment

viirya May 29, 2024

Choose a reason for hiding this comment

viirya May 29, 2024

Choose a reason for hiding this comment

viirya May 29, 2024 • edited Loading

Choose a reason for hiding this comment

sunchao May 29, 2024

Choose a reason for hiding this comment

viirya May 29, 2024 • edited Loading

Choose a reason for hiding this comment

sunchao May 29, 2024

Choose a reason for hiding this comment

andygrove May 30, 2024

Choose a reason for hiding this comment

viirya commented May 30, 2024

viirya May 29, 2024 •

edited

Loading

viirya May 29, 2024 •

edited

Loading

viirya May 29, 2024 •

edited

Loading