[Proposal] Deprecate SYCL Offset Accessors #569

hdelan · 2024-06-25T10:22:30Z

The SYCL spec allows for an extra offset argument to be used when constructing an accessor.

/* Available only when: (Dimensions > 0) */
  template <typename AllocatorT>
  accessor(buffer<DataT, Dimensions, AllocatorT>& bufferRef,
           handler& commandGroupHandlerRef, range<Dimensions> accessRange,
           id<Dimensions> accessOffset, const property_list& propList = {});

  /* Available only when: (Dimensions > 0) */
  template <typename AllocatorT, typename TagT>
  accessor(buffer<DataT, Dimensions, AllocatorT>& bufferRef,
           handler& commandGroupHandlerRef, range<Dimensions> accessRange,
           id<Dimensions> accessOffset, TagT tag,
           const property_list& propList = {});

This seems like a good idea to allow the programmer to offset into a mem arg, such that acc[0] refers not to the base of some native allocation.

However, this means that when calculating acc[arbitrary_idx], internally the SYCL implementation must add arbitrary_idx + accessOffset in order to generate the required index.

In 99% (guess) of cases, accessOffset is zero. However most compilers (DPC++, for one) lack the host-device optimizations necessary in order to propagate the zero value to the device compilation pass. This means that the access offset becomes a kernel argument, and the zero values are loaded at runtime and then added to each accessor base pointer, at least once at accessor initialization. This results in more kernel args, clock cycles wasted adding zero to ptrs, and increased register usage within the kernel.

As a quick demonstration:

A simple SYCL kernel just doing acc[0] = 1:

.weak .entry _ZTS6kernel(
    .param .u64 _ZTS6kernel_param_0,
    .param .align 8 .b8 _ZTS6kernel_param_1[8] // THIS IS THE OFFSET ARG WHICH IS USUALLY ZERO
)
{
    .reg .b32   %r<2>;
    .reg .b64   %rd<8>;

// %bb.0:                               // %entry
    ld.param.u32    %rd1, [_ZTS6kernel_param_0+4];
    shl.b64     %rd2, %rd1, 32;
    ld.param.u32    %rd3, [_ZTS6kernel_param_0];
    or.b64      %rd4, %rd2, %rd3;
    ld.param.u64    %rd5, [_ZTS6kernel_param_1]; // LOAD OFFSET ARG
    shl.b64     %rd6, %rd5, 2;
    add.s64     %rd7, %rd4, %rd6; // COMBINE OFFSET ARG WITH PTR
    mov.b32     %r1, 1;
    st.global.u32   [%rd7], %r1;
    ret;
                                        // -- End function
}

vs a simple SYCL kernel doing acc.get_multi_ptr()[0]

.weak .entry _ZTS6kernel(
    .param .u64 _ZTS6kernel_param_0
)
{
    .reg .b32   %r<2>;
    .reg .b64   %rd<5>;

// %bb.0:                               // %entry
    ld.param.u32    %rd1, [_ZTS6kernel_param_0+4];
    shl.b64     %rd2, %rd1, 32;
    ld.param.u32    %rd3, [_ZTS6kernel_param_0];
    or.b64      %rd4, %rd2, %rd3;
    mov.b32     %r1, 1;
    st.global.u32   [%rd4], %r1;
    ret;
                                        // -- End function
}

Note that when get_multi_ptr() is called, the compiler is able to see that the accessOffset member is not used in the kernel and it can remove a kernel argument.

This can be solved in two ways in the SYCL specification:

Add an extra template arg to the accessor class, which would describe whether the accessor was offset or not. This would allow the use of if constexpr in the accessor setup which would use the accessOffset or not.
Deprecating and eventually removing support for offset accessors, which presumably are not being used much anyway.

The text was updated successfully, but these errors were encountered:

TApplencourt · 2024-06-25T14:46:48Z

I vote for 2/; simplifying the accessor is a good thing, IMO.

I guess if one day we mandate that buffers be backed by USM, people will be able to writte accessor(buffer.get_ptr() + offset) or something like that. Right now, they will need to do the offset inside the kernel, so potentially mem-copying more memory than required but 🤷🏽.

victor-eds · 2024-06-26T09:39:02Z

Great, @hdelan. Agree with @TApplencourt. Anyway, any idea why this was included in the spec in the first place? What was the intended usage justifying the potential overhead?

TApplencourt · 2024-06-26T14:11:25Z

In theory, I can create an accesors on only half of the buffer, and the runtime may do this amount of data-transfert? But I guess people can always create a sub-buffer.
But all those buffer / sub buffer / accessor are full of subtility so maybe I'm wrong

victor-eds · 2024-06-26T14:16:42Z

In theory, I can create an accesors on only half of the buffer, and the runtime may do this amount of data-transfert? But I guess people can always create a sub-buffer.

Yes, that's what I mean. If that's the case, a sub-buffer sounds like a better approach due to the issue Hugh points out here.

gmlueck · 2024-06-26T18:38:41Z

However, this means that when calculating acc[arbitrary_idx], internally the SYCL implementation must add arbitrary_idx + accessOffset in order to generate the required index.

Not necessarily. Couldn't the implementation pass a base pointer as the kernel argument, where the base pointer already had the offset added? Things probably get more complicated for multi-dimensional accessors, though.

psalz · 2024-06-27T07:28:07Z

In theory, I can create an accesors on only half of the buffer, and the runtime may do this amount of data-transfert? But I guess people can always create a sub-buffer.

Yes, that's what I mean. If that's the case, a sub-buffer sounds like a better approach due to the issue Hugh points out here.

Sub-buffers are more limited though, as they can only be created on a contiguous region of the parent buffer.

One application that comes to mind for ranged accessors: handler::fill-ing only certain parts of a buffer.

hdelan · 2024-06-27T08:29:53Z

Not necessarily. Couldn't the implementation pass a base pointer as the kernel argument, where the base pointer already had the offset added? Things probably get more complicated for multi-dimensional accessors, though.

This is a good point. In theory this should work, although DPC++ would need to change some internals to make this happen. Thanks for suggestions and discussion.

I'll keep this open for the moment in case anyone else wants to chime in, but I will make a ticket on our side to change the impl so only already offset accessor ptrs are passed to device.

illuhad · 2024-07-09T23:53:58Z

AdaptiveCpp has recently solved this issue on the compiler side; however we already have an extension that basically does your point 1. This is possible in a fairly seamless way when CTAD is used: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/accessor-variants.md

tomdeakin added the Agenda To be discussed during a SYCL committee meeting label Jul 10, 2024

tomdeakin removed the Agenda To be discussed during a SYCL committee meeting label Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Deprecate SYCL Offset Accessors #569

[Proposal] Deprecate SYCL Offset Accessors #569

hdelan commented Jun 25, 2024 •

edited

Loading

TApplencourt commented Jun 25, 2024

victor-eds commented Jun 26, 2024

TApplencourt commented Jun 26, 2024

victor-eds commented Jun 26, 2024 •

edited

Loading

gmlueck commented Jun 26, 2024

psalz commented Jun 27, 2024

hdelan commented Jun 27, 2024

illuhad commented Jul 9, 2024 •

edited

Loading

[Proposal] Deprecate SYCL Offset Accessors #569

[Proposal] Deprecate SYCL Offset Accessors #569

Comments

hdelan commented Jun 25, 2024 • edited Loading

TApplencourt commented Jun 25, 2024

victor-eds commented Jun 26, 2024

TApplencourt commented Jun 26, 2024

victor-eds commented Jun 26, 2024 • edited Loading

gmlueck commented Jun 26, 2024

psalz commented Jun 27, 2024

hdelan commented Jun 27, 2024

illuhad commented Jul 9, 2024 • edited Loading

hdelan commented Jun 25, 2024 •

edited

Loading

victor-eds commented Jun 26, 2024 •

edited

Loading

illuhad commented Jul 9, 2024 •

edited

Loading