vulkan: Optimize contiguous copies (#10254)
* tests: Fix memory bandwidth calculation for perf tests
Add a flops calculation for flash attention.
Add one GGML_OP_CPY perf test.
* vulkan: Optimize contiguous copies
Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.
Apply similar changes to the scale shader, since scale is always contiguous.
Add a "progress bar" for shader compiles.