Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Removing copying data from dictionary values into CometDictionary #490

Merged
merged 2 commits into from
May 30, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented May 29, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Comment on lines 103 to 108
switch (values.getValueVector().getMinorType()) {
case BIT:
booleans = new boolean[numValues];
for (int i = 0; i < numValues; i++) {
booleans[i] = values.getBoolean(i);
}
Copy link
Member Author

@viirya viirya May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recent performance regression hits that this copying from dictionary to CometDictionary is slow. It seems we don't need to copy dictionary values.

@viirya viirya force-pushed the remove_copying_dictionary_values branch from 560d3d1 to c8d57ea Compare May 29, 2024 05:50
@viirya viirya changed the title fix: Removing copying data from dictionary values into CometDictionary chore: Removing copying data from dictionary values into CometDictionary May 29, 2024
@viirya
Copy link
Member Author

viirya commented May 29, 2024

cc @sunchao

@viirya viirya force-pushed the remove_copying_dictionary_values branch from c8d57ea to 4d088a2 Compare May 29, 2024 06:27
binaries[i] = new ByteArrayWrapper(bytes);
}
break;
byte[] bytes = new byte[DECIMAL_BYTE_WIDTH];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm will this be more expensive? originally we would copy binaries into ByteArrayWrapper only once up-front, but now we are doing it for each decodeToBinary call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for whole stage codgen, the accessing code to values in rows are collapsed as a whole and the value is accessed as a temporary variable in Java code. One value should have only one decodeToBinary call.

Although for interpreted execution, each time to access a value in a row, it will call these accessors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can benchmark it. cc @andygrove

Copy link
Member Author

@viirya viirya May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, except for the cases that only CometScan is triggered, if there is any Comet native operator, these copied dictionary values are not used at all. They are only for row-based access by Spark operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for whole stage codgen, the accessing code to values in rows are collapsed as a whole and the value is accessed as a temporary variable in Java code. One value should have only one decodeToBinary call.

Hmm I'm not sure. Suppose we have a Comet dictionary vector where indices are 100 1's. In the current approach we'd need to call decodeToBinary on index 1 for 100 times?

Btw, except for the cases that only CometScan is triggered, if there is any Comet native operator, these copied dictionary values are not used at all. They are only for row-based access by Spark operator.

Yes this is true.

I'm basically trying to understand whether the changes in this PR is necessary. Is it trying to address some perf regression we saw recently?

Copy link
Member Author

@viirya viirya May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically trying to understand whether the changes in this PR is necessary. Is it trying to address some perf regression we saw recently?

This is motivated by recent regression on CometDictionary after re-initiating CometDictionary after re-importing array, though it is not directly related. It inspires me that this part of copying might be not good for performance.

Hmm I'm not sure. Suppose we have a Comet dictionary vector where indices are 100 1's. In the current approach we'd need to call decodeToBinary on index 1 for 100 times?

That's good point. It is an orthogonal concern. I was talking about is the access of same field in a row.

Maybe we just need to do copying for binary types? For other types like int, long, etc, I don't see the good reason to copy the values actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think should be OK to avoid copying for other types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can benchmark it. cc @andygrove

I ran the TPC-H benchmark comparing main with this PR and did not see any significant difference. I think the changes in this PR make sense, though, and reduce some memory pressure.

@viirya viirya closed this May 30, 2024
@viirya viirya reopened this May 30, 2024
@viirya viirya merged commit 8c532be into apache:main May 30, 2024
86 of 87 checks passed
@viirya
Copy link
Member Author

viirya commented May 30, 2024

Merged. Thanks @sunchao @andygrove

@viirya viirya deleted the remove_copying_dictionary_values branch May 30, 2024 04:05
kazuyukitanimura added a commit that referenced this pull request Jul 24, 2024
## Which issue does this PR close?

Part of #679 and #670
Related #490

## Rationale for this change

For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary.

## What changes are included in this PR?

Unpack only for Decimal 128

## How are these changes tested?

Existing test
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…ary (apache#490)

* chore: Removing copying data from dictionary values into CometDictionary

* For review

(cherry picked from commit 8c532be)
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
## Which issue does this PR close?

Part of apache#679 and apache#670
Related apache#490

## Rationale for this change

For dictionary decimal vectors, it was unpacking even for Int and Long decimals that used more memory than necessary.

## What changes are included in this PR?

Unpack only for Decimal 128

## How are these changes tested?

Existing test

(cherry picked from commit c1b7c7d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants