Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Deprecate SYCL Offset Accessors #569

Open
hdelan opened this issue Jun 25, 2024 · 8 comments
Open

[Proposal] Deprecate SYCL Offset Accessors #569

hdelan opened this issue Jun 25, 2024 · 8 comments

Comments

@hdelan
Copy link
Contributor

hdelan commented Jun 25, 2024

The SYCL spec allows for an extra offset argument to be used when constructing an accessor.

/* Available only when: (Dimensions > 0) */
  template <typename AllocatorT>
  accessor(buffer<DataT, Dimensions, AllocatorT>& bufferRef,
           handler& commandGroupHandlerRef, range<Dimensions> accessRange,
           id<Dimensions> accessOffset, const property_list& propList = {});

  /* Available only when: (Dimensions > 0) */
  template <typename AllocatorT, typename TagT>
  accessor(buffer<DataT, Dimensions, AllocatorT>& bufferRef,
           handler& commandGroupHandlerRef, range<Dimensions> accessRange,
           id<Dimensions> accessOffset, TagT tag,
           const property_list& propList = {});

This seems like a good idea to allow the programmer to offset into a mem arg, such that acc[0] refers not to the base of some native allocation.

However, this means that when calculating acc[arbitrary_idx], internally the SYCL implementation must add arbitrary_idx + accessOffset in order to generate the required index.

In 99% (guess) of cases, accessOffset is zero. However most compilers (DPC++, for one) lack the host-device optimizations necessary in order to propagate the zero value to the device compilation pass. This means that the access offset becomes a kernel argument, and the zero values are loaded at runtime and then added to each accessor base pointer, at least once at accessor initialization. This results in more kernel args, clock cycles wasted adding zero to ptrs, and increased register usage within the kernel.

As a quick demonstration:

A simple SYCL kernel just doing acc[0] = 1:

.weak .entry _ZTS6kernel(
    .param .u64 _ZTS6kernel_param_0,
    .param .align 8 .b8 _ZTS6kernel_param_1[8] // THIS IS THE OFFSET ARG WHICH IS USUALLY ZERO
)
{
    .reg .b32   %r<2>;
    .reg .b64   %rd<8>;

// %bb.0:                               // %entry
    ld.param.u32    %rd1, [_ZTS6kernel_param_0+4];
    shl.b64     %rd2, %rd1, 32;
    ld.param.u32    %rd3, [_ZTS6kernel_param_0];
    or.b64      %rd4, %rd2, %rd3;
    ld.param.u64    %rd5, [_ZTS6kernel_param_1]; // LOAD OFFSET ARG
    shl.b64     %rd6, %rd5, 2;
    add.s64     %rd7, %rd4, %rd6; // COMBINE OFFSET ARG WITH PTR
    mov.b32     %r1, 1;
    st.global.u32   [%rd7], %r1;
    ret;
                                        // -- End function
}

vs a simple SYCL kernel doing acc.get_multi_ptr()[0]

.weak .entry _ZTS6kernel(
    .param .u64 _ZTS6kernel_param_0
)
{
    .reg .b32   %r<2>;
    .reg .b64   %rd<5>;

// %bb.0:                               // %entry
    ld.param.u32    %rd1, [_ZTS6kernel_param_0+4];
    shl.b64     %rd2, %rd1, 32;
    ld.param.u32    %rd3, [_ZTS6kernel_param_0];
    or.b64      %rd4, %rd2, %rd3;
    mov.b32     %r1, 1;
    st.global.u32   [%rd4], %r1;
    ret;
                                        // -- End function
}

Note that when get_multi_ptr() is called, the compiler is able to see that the accessOffset member is not used in the kernel and it can remove a kernel argument.

This can be solved in two ways in the SYCL specification:

  1. Add an extra template arg to the accessor class, which would describe whether the accessor was offset or not. This would allow the use of if constexpr in the accessor setup which would use the accessOffset or not.
  2. Deprecating and eventually removing support for offset accessors, which presumably are not being used much anyway.
@TApplencourt
Copy link
Contributor

I vote for 2/; simplifying the accessor is a good thing, IMO.

I guess if one day we mandate that buffers be backed by USM, people will be able to writte accessor(buffer.get_ptr() + offset) or something like that. Right now, they will need to do the offset inside the kernel, so potentially mem-copying more memory than required but 🤷🏽.

@victor-eds
Copy link
Contributor

Great, @hdelan. Agree with @TApplencourt. Anyway, any idea why this was included in the spec in the first place? What was the intended usage justifying the potential overhead?

@TApplencourt
Copy link
Contributor

In theory, I can create an accesors on only half of the buffer, and the runtime may do this amount of data-transfert? But I guess people can always create a sub-buffer.
But all those buffer / sub buffer / accessor are full of subtility so maybe I'm wrong

@victor-eds
Copy link
Contributor

victor-eds commented Jun 26, 2024

In theory, I can create an accesors on only half of the buffer, and the runtime may do this amount of data-transfert? But I guess people can always create a sub-buffer.

Yes, that's what I mean. If that's the case, a sub-buffer sounds like a better approach due to the issue Hugh points out here.

@gmlueck
Copy link
Contributor

gmlueck commented Jun 26, 2024

However, this means that when calculating acc[arbitrary_idx], internally the SYCL implementation must add arbitrary_idx + accessOffset in order to generate the required index.

Not necessarily. Couldn't the implementation pass a base pointer as the kernel argument, where the base pointer already had the offset added? Things probably get more complicated for multi-dimensional accessors, though.

@psalz
Copy link
Contributor

psalz commented Jun 27, 2024

In theory, I can create an accesors on only half of the buffer, and the runtime may do this amount of data-transfert? But I guess people can always create a sub-buffer.

Yes, that's what I mean. If that's the case, a sub-buffer sounds like a better approach due to the issue Hugh points out here.

Sub-buffers are more limited though, as they can only be created on a contiguous region of the parent buffer.

One application that comes to mind for ranged accessors: handler::fill-ing only certain parts of a buffer.

@hdelan
Copy link
Contributor Author

hdelan commented Jun 27, 2024

Not necessarily. Couldn't the implementation pass a base pointer as the kernel argument, where the base pointer already had the offset added? Things probably get more complicated for multi-dimensional accessors, though.

This is a good point. In theory this should work, although DPC++ would need to change some internals to make this happen. Thanks for suggestions and discussion.

I'll keep this open for the moment in case anyone else wants to chime in, but I will make a ticket on our side to change the impl so only already offset accessor ptrs are passed to device.

@illuhad
Copy link
Contributor

illuhad commented Jul 9, 2024

AdaptiveCpp has recently solved this issue on the compiler side; however we already have an extension that basically does your point 1. This is possible in a fairly seamless way when CTAD is used: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/accessor-variants.md

@tomdeakin tomdeakin added the Agenda To be discussed during a SYCL committee meeting label Jul 10, 2024
@tomdeakin tomdeakin removed the Agenda To be discussed during a SYCL committee meeting label Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants