Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned hashing support on multiple GPU's for trace and constraint commitments. #335

Open
TheMenko opened this issue Oct 16, 2024 · 4 comments

Comments

@TheMenko
Copy link

To support trace commitments with multiple GPUs, we need partitioned hashing for both prover and the verifier, so that output from GPU code is the same as the one from the CPU, when computing on multiple devices.

Partition size

The GPU code uses "partition size" to determine the number of columns to be processed per GPU.
Since trace commitment has a minimum column number of 64. 16 as partition size is usually good.
But we can come up with some formula for calculating it automatically.
For e.g. 16 - (num_columns % 16) / (num_columns / 16) would in most cases work quite well for any number of GPU's.

Per device partitioning

Number of partitions is calculated with num_partitions = (num_columns + partition_size - 1) / partition_size, and then we calculate how many devices will be used for computing with num_devices = min(num_devices, num_partitions). This means that in case when we have an 8 column matrix, and a partition size = 8 (e.g. in constraint commitments), only a single GPU will be used for computing purposes, since num_partitions would be 1.

Hashing process

We need a buffer for storing partitioned linear hash output. The required buffer size is: num_partitions * num_rows * blowup * CAPACITY - enough space to store all the partitioned outputs.
Each GPU takes partition_size number of columns, and it hashes 8 columns at a time up to the partition_size. So if we have for e.g. 14 columns, 1 GPU will hash 8 columns and then 6 columns, and write that to the correct position in the buffer. If we set our partition size to 16 on a 64 column trace, num_partitons would be 4, so we allocate space for 4 * lde_domain_size * CAPACITY, and when hashing we would write linear hash of 16 columns 4 times.
If we set partition_size to 8 however, we would need double the allocated space, and we would have 8 linear hash outputs written to the buffer.
After the linear hash step, each linear hash partition is taken, producing CAPACITY number of columns, and then the final linear hash is applied from which later on we build a Merkle tree.

@irakliyk
Copy link
Collaborator

I think what we need from Winterfell side to support this is adding num_partitions parameter to ProofOptions. The default value for num_partitions would be 1, and in this case, everything would work as it does now.

But, if the user would set num_partitions to, say, 4, we'd compute row commitments as follows:

  1. Split the row into 4 equal segments and hash each segment using liner hash.
  2. Compute linear hash of the 4 values we got in the previous step.

One question is what to do if the length of the row is not evenly divisible by num_partitions. Here, I think, we can just round the length to the next multiple of num_partitions and use it as row length. For example, if a row consists of 23 values, and we have 4 partitions, then the first 3 partitions would have 6 columns each, and the last partition would have 5 columns. @TheMenko - would this work for the GPU?

@TheMenko
Copy link
Author

TheMenko commented Oct 17, 2024

I based my issue on current implementation on Polygon's miden-gpu, which is currently unreleased for CUDA,
but as i've said, it takes partition_size number of columns at a time, and does a linear hash.
If we have 8 partitions, and 4 gpu's it would do 8 linear hashes.
each gpu will take its own portion first, and then the first gpu that finishes will start doing the next portion of columns if there are any left.
edit: So approach with splting into num_partitions might work for even columns, but would it work for odd ones, i'm not too sure. Should be easy to compare the outputs

@irakliyk
Copy link
Collaborator

I based my issue on current implementation on Polygon's miden-gpu, which is currently unreleased for CUDA,
but as i've said, it takes partition_size number of columns at a time, and does a linear hash.
If we have 8 partitions, and 4 gpu's it would do 8 linear hashes.
each gpu will take its own portion first, and then the first gpu that finishes will start doing the next portion of columns if there are any left.

Could we make it user's responsibility to set num_partitions to the number of available devices? Is there any advantage to having the number of partitions different from the number of GPUs?

@TheMenko
Copy link
Author

I'm not the original author of the CUDA code, so i can't currently tell if there are some advantages of having num_partitions greater then the number of devices. The current code could be adapted to calculate correct partition_size from which the num_partitions is calculated. If we do num_columns / num_devices, we would should get the desired effect of having num_partitions = number of devices.
in miden-gpu, this could be calculated by fetching the number of GPU's present on the system, so there would be no need for user to set num_partitions. In Winterfell, this can be set as option from user or config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants