Partitioned hashing support on multiple GPU's for trace and constraint commitments. #335

TheMenko · 2024-10-16T14:30:51Z

To support trace commitments with multiple GPUs, we need partitioned hashing for both prover and the verifier, so that output from GPU code is the same as the one from the CPU, when computing on multiple devices.

Partition size

The GPU code uses "partition size" to determine the number of columns to be processed per GPU.
Since trace commitment has a minimum column number of 64. 16 as partition size is usually good.
But we can come up with some formula for calculating it automatically.
For e.g. 16 - (num_columns % 16) / (num_columns / 16) would in most cases work quite well for any number of GPU's.

Per device partitioning

Number of partitions is calculated with num_partitions = (num_columns + partition_size - 1) / partition_size, and then we calculate how many devices will be used for computing with num_devices = min(num_devices, num_partitions). This means that in case when we have an 8 column matrix, and a partition size = 8 (e.g. in constraint commitments), only a single GPU will be used for computing purposes, since num_partitions would be 1.

Hashing process

We need a buffer for storing partitioned linear hash output. The required buffer size is: num_partitions * num_rows * blowup * CAPACITY - enough space to store all the partitioned outputs.
Each GPU takes partition_size number of columns, and it hashes 8 columns at a time up to the partition_size. So if we have for e.g. 14 columns, 1 GPU will hash 8 columns and then 6 columns, and write that to the correct position in the buffer. If we set our partition size to 16 on a 64 column trace, num_partitons would be 4, so we allocate space for 4 * lde_domain_size * CAPACITY, and when hashing we would write linear hash of 16 columns 4 times.
If we set partition_size to 8 however, we would need double the allocated space, and we would have 8 linear hash outputs written to the buffer.
After the linear hash step, each linear hash partition is taken, producing CAPACITY number of columns, and then the final linear hash is applied from which later on we build a Merkle tree.

The text was updated successfully, but these errors were encountered:

irakliyk · 2024-10-17T04:11:01Z

I think what we need from Winterfell side to support this is adding num_partitions parameter to ProofOptions. The default value for num_partitions would be 1, and in this case, everything would work as it does now.

But, if the user would set num_partitions to, say, 4, we'd compute row commitments as follows:

Split the row into 4 equal segments and hash each segment using liner hash.
Compute linear hash of the 4 values we got in the previous step.

One question is what to do if the length of the row is not evenly divisible by num_partitions. Here, I think, we can just round the length to the next multiple of num_partitions and use it as row length. For example, if a row consists of 23 values, and we have 4 partitions, then the first 3 partitions would have 6 columns each, and the last partition would have 5 columns. @TheMenko - would this work for the GPU?

TheMenko · 2024-10-17T10:59:59Z

I based my issue on current implementation on Polygon's miden-gpu, which is currently unreleased for CUDA,
but as i've said, it takes partition_size number of columns at a time, and does a linear hash.
If we have 8 partitions, and 4 gpu's it would do 8 linear hashes.
each gpu will take its own portion first, and then the first gpu that finishes will start doing the next portion of columns if there are any left.
edit: So approach with splting into num_partitions might work for even columns, but would it work for odd ones, i'm not too sure. Should be easy to compare the outputs

irakliyk · 2024-10-17T14:03:54Z

I based my issue on current implementation on Polygon's miden-gpu, which is currently unreleased for CUDA,
but as i've said, it takes partition_size number of columns at a time, and does a linear hash.
If we have 8 partitions, and 4 gpu's it would do 8 linear hashes.
each gpu will take its own portion first, and then the first gpu that finishes will start doing the next portion of columns if there are any left.

Could we make it user's responsibility to set num_partitions to the number of available devices? Is there any advantage to having the number of partitions different from the number of GPUs?

TheMenko · 2024-10-19T20:28:33Z

I'm not the original author of the CUDA code, so i can't currently tell if there are some advantages of having num_partitions greater then the number of devices. The current code could be adapted to calculate correct partition_size from which the num_partitions is calculated. If we do num_columns / num_devices, we would should get the desired effect of having num_partitions = number of devices.
in miden-gpu, this could be calculated by fetching the number of GPU's present on the system, so there would be no need for user to set num_partitions. In Winterfell, this can be set as option from user or config.

Al-Kindi-0 mentioned this issue Oct 18, 2024

Option for partitioned trace committment #336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioned hashing support on multiple GPU's for trace and constraint commitments. #335

Partitioned hashing support on multiple GPU's for trace and constraint commitments. #335

TheMenko commented Oct 16, 2024

irakliyk commented Oct 17, 2024

TheMenko commented Oct 17, 2024 •

edited

Loading

irakliyk commented Oct 17, 2024

TheMenko commented Oct 19, 2024

Partitioned hashing support on multiple GPU's for trace and constraint commitments. #335

Partitioned hashing support on multiple GPU's for trace and constraint commitments. #335

Comments

TheMenko commented Oct 16, 2024

Partition size

Per device partitioning

Hashing process

irakliyk commented Oct 17, 2024

TheMenko commented Oct 17, 2024 • edited Loading

irakliyk commented Oct 17, 2024

TheMenko commented Oct 19, 2024

TheMenko commented Oct 17, 2024 •

edited

Loading