You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did some experiments using the DirectDMA implementation, which is used for transferring data to and from PE-local memory (BRAM).
I used an ILA directly at the PCIe bridge on the FPGA to look into the AXI transactions generated when calling the copy_to (and copy_from) method of DirectDMA for different sizes (64B,128B,192B,256B,320B).
The results differ from what I expect.
Firstly, the AXI transaction sizes where always 32B (on a 64B-wide interface; only the upper or lower half of the strobe bits was set; no bursts). With some further experiments this seems to be the upper bound per transfer here, not sure exactly where this limitation comes from.
But even disregarding this there were other peculiarities: For the transfers >= 128B there were more 32B transactions than required. Looking at the ILA I found that some 32B words were transmitted multiple times.
From my experiments this does not affect correctness, as data is just transferred multiple times to the same address. However this of course still suboptimal, e.g. with regards to performance.
Details
The following tables shows the exact transfers for copy_to calls of different sizes. The left column (for each size) gives the actual transfers, the right what I would have expected
64 Byte
Exp
128 Byte
Exp
192 Byte
Exp
256 Byte
Exp
320 Byte
Exp
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x120
0x000
0x20
0x20
0x20
0x20
0x20
0x20
0x20
0x20
0x100
0x020
0x40
0x40
0x40
0x40
0x40
0x40
0x0e0
0x040
0x60
0x60
0x60
0x60
0x60
0x60
0x0c0
0x060
0x60
0xa0
0x80
0xe0
0x80
0x0a0
0x080
0x40
0x80
0xa0
0xc0
0xa0
0x080
0x0a0
0x20
0x60
0xa0
0xc0
0x060
0x0c0
0x00
0x40
0x80
0xe0
0x040
0x0e0
0x000
0x0100
0x020
0x120
0x040
0x060
0x120
I also looked into the copy_from method, it behaves identical for up to 256 Bytes. For 320 Bytes (and more) it behaves differently, producing even more read transactions (e.g. 17 read transactions vs. 13 writes transactions for 320 Bytes).
The text was updated successfully, but these errors were encountered:
The runtime uses AVX/SSE when available. Those registers are 32B/256Bit on most machines. You could try an AVX512 machine to see if you get 64B requests. I'm not aware of a faster way to copy data from the CPU over PCIe, if you don't want to use an on-device DMA engine.
As for the extra requests: No idea where those might come from.
I did some experiments using the DirectDMA implementation, which is used for transferring data to and from PE-local memory (BRAM).
I used an ILA directly at the PCIe bridge on the FPGA to look into the AXI transactions generated when calling the
copy_to
(andcopy_from
) method of DirectDMA for different sizes (64B,128B,192B,256B,320B).The results differ from what I expect.
Firstly, the AXI transaction sizes where always 32B (on a 64B-wide interface; only the upper or lower half of the strobe bits was set; no bursts). With some further experiments this seems to be the upper bound per transfer here, not sure exactly where this limitation comes from.
But even disregarding this there were other peculiarities: For the transfers >= 128B there were more 32B transactions than required. Looking at the ILA I found that some 32B words were transmitted multiple times.
From my experiments this does not affect correctness, as data is just transferred multiple times to the same address. However this of course still suboptimal, e.g. with regards to performance.
Details
The following tables shows the exact transfers for
copy_to
calls of different sizes. The left column (for each size) gives the actual transfers, the right what I would have expectedI also looked into the
copy_from
method, it behaves identical for up to 256 Bytes. For 320 Bytes (and more) it behaves differently, producing even more read transactions (e.g. 17 read transactions vs. 13 writes transactions for 320 Bytes).The text was updated successfully, but these errors were encountered: