Add tensorOps benchmarks #180

corbett5 · 2020-07-10T04:24:46Z

Also examine using std::fma.

The text was updated successfully, but these errors were encountered:

rrsettgast · 2020-07-12T00:55:08Z

What did you have in mind for std::fma for device kernels?

corbett5 · 2020-07-12T01:10:24Z

CUDA has a fma as well, just like cos and whatnot. I'm not sure it would be beneficial but worth checking out.

rrsettgast · 2020-07-12T05:28:52Z

I suspect that it may force the compiler to recognize the fma operation, when it might miss it otherwise?? We are getting all sorts of DFMA instructions in our CUDA PTX, but I was pretty careful about checking that we are getting them when we expect.

corbett5 · 2020-07-12T05:37:32Z

Yeah but it could be slower: https://stackoverflow.com/questions/34265982/automatically-generate-fma-instructions-in-msvc
For things like AiBi it is very applicable. But how you'd go about applying it to things like

dstSymMatrix[ 3 ] = matrixA[ 1 ][ 0 ] * symMatrixB[ 0 ] * matrixA[ 2 ][ 0 ] +
                        matrixA[ 1 ][ 0 ] * symMatrixB[ 5 ] * matrixA[ 2 ][ 1 ] +
                        matrixA[ 1 ][ 0 ] * symMatrixB[ 4 ] * matrixA[ 2 ][ 2 ] +
                        matrixA[ 1 ][ 1 ] * symMatrixB[ 5 ] * matrixA[ 2 ][ 0 ] +
                        matrixA[ 1 ][ 1 ] * symMatrixB[ 1 ] * matrixA[ 2 ][ 1 ] +
                        matrixA[ 1 ][ 1 ] * symMatrixB[ 3 ] * matrixA[ 2 ][ 2 ] +
                        matrixA[ 1 ][ 2 ] * symMatrixB[ 4 ] * matrixA[ 2 ][ 0 ] +
                        matrixA[ 1 ][ 2 ] * symMatrixB[ 3 ] * matrixA[ 2 ][ 1 ] +
                        matrixA[ 1 ][ 2 ] * symMatrixB[ 2 ] * matrixA[ 2 ][ 2 ];

might harm performance even if std::fma is fast because it limits the re-arranging the compiler can do.

rrsettgast · 2020-07-12T05:50:26Z

without fma I count 27 fp operations.

dstSymMatrix[ 3 ] = matrixA[ 1 ][ 0 ] * ( symMatrixB[ 0 ] * matrixA[ 2 ][ 0 ] +
                                          symMatrixB[ 5 ] * matrixA[ 2 ][ 1 ] +
                                          symMatrixB[ 4 ] * matrixA[ 2 ][ 2 ] ) +
                    matrixA[ 1 ][ 1 ] * ( symMatrixB[ 5 ] * matrixA[ 2 ][ 0 ] +
                                          symMatrixB[ 1 ] * matrixA[ 2 ][ 1 ] +
                                          symMatrixB[ 3 ] * matrixA[ 2 ][ 2 ] ) +
                    matrixA[ 1 ][ 2 ] * ( symMatrixB[ 4 ] * matrixA[ 2 ][ 0 ] +
                                          symMatrixB[ 3 ] * matrixA[ 2 ][ 1 ] +
                                          symMatrixB[ 2 ] * matrixA[ 2 ][ 2 ] );

rearranging and using fma i count 12.

corbett5 added effort: 1 week type: benchmarks labels Jul 10, 2020

corbett5 self-assigned this Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tensorOps benchmarks #180

Add tensorOps benchmarks #180

corbett5 commented Jul 10, 2020

rrsettgast commented Jul 12, 2020

corbett5 commented Jul 12, 2020

rrsettgast commented Jul 12, 2020

corbett5 commented Jul 12, 2020

rrsettgast commented Jul 12, 2020

Add tensorOps benchmarks #180

Add tensorOps benchmarks #180

Comments

corbett5 commented Jul 10, 2020

rrsettgast commented Jul 12, 2020

corbett5 commented Jul 12, 2020

rrsettgast commented Jul 12, 2020

corbett5 commented Jul 12, 2020

rrsettgast commented Jul 12, 2020