Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It's not a purr fect implementation, but it is a start...
This patch implements the following:
CPU_ATTR
set. as I was lazy. I hope that inlining means they would be generated with the proper ISA limitataions. Regardless, so far only VNNI multiply is implemented anyways.Example: If we have A = 2x64 matrix and B = 64x9, we will perform a multiplication first, of 2x64 times 64x8 and then 1x64 times 64x1 (to produce the last column)
Unfortunately, now that we can have matrices that have non-multiple-of-eight columns, but we no longer write the columns consecutively, we get unaligned memory access when writing and we segfault. For this reason I have replaced the store routine with storeu.
Preliminary performance benchmarks with the builtin
intgemm/benchmarks/biasmultiply.cc
Line 267 in 6228d01
This branch (n=1)
Master (n=1)
Speed seems to be even better, but I don't trust that. Maybe some of the instruction reordering makes the benchmark perform better. I will have test it in a real world situation later on.