-
Hi, I'm trying to tune grain size in my application that uses Is there a way to set the grain size at runtime? Thank you very much! Tuan |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
Hey Tuan, Essentially, OpenCilk's Here's some more context behind the Cilk programs achieve good parallel speedup when they expose ample parallelism to the runtime system. (Typically, one wants the program's parallelism to exceed the number of Cilk workers by roughly a factor of 10.) Intuitively, that means that Cilk programs should aim to expose as much parallel work as possible. Of course, there are practical reasons why one doesn't always want to expose all of the logical parallelism in a program. In the case of a To amortize the constant overhead of spawning a parallel loop iteration, the (Aside: The OpenCilk compiler and runtime will both try to determine an appropriate grainsize for any One not-so-great use of the All that being said, suppose you need to implement a
cilk_for (int i = 0; i < n; ++i) {
loop_body(i);
} To stripmine the loop to use an arbitrary grainsize #pragma cilk grainsize 1
cilk_for (int ii = 0; ii < n; ii += G) {
for (int i = ii; i < min(ii + G, n); ++i) {
loop_body(i);
}
} (Note: There are a couple of different ways to write this stripmined loop, and those different implementations can have different performance. In this example, I'm using a compact implementation, which you will hopefully find easier to work with.) Hope that helps. Let me know if you have more questions. Cheers, |
Beta Was this translation helpful? Give feedback.
-
Thanks for your detailed answer! I'm trying to explore OpenCilk to manage parallel tasks in a heterogeneous multi-core system (e.g., big.LITTLE). Since not all cores perform at the same pace, a grain size picked by the runtime may not work best for that kind of systems. As the first step, I'm trying to sweep through multiple grain sizes to see which one would yield the optimal balance between parallelism and runtime overheads. Can you elaborate a bit more about how OpenCilk runtime picks the grain size? Is it always N/(KP) where K is a constant from empirical experiments? Does the runtime consider asymmetric systems? I will try your suggestion of strip-mining the loop. Thanks! |
Beta Was this translation helpful? Give feedback.
-
TB,
That could be a blog post when we’re set up!
Cheers,
Charles
On Thu, May 20, 2021 at 19:44 Tao B. Schardl ***@***.***> wrote:
Hey Tuan,
Essentially, OpenCilk's grainsize pragma requires constant grainsizes
because this restriction simplifies the OpenCilk design and implementation
while simultaneously accommodating and encouraging reasonable uses of the
grainsize pragma, specifically, to reduce scheduling overheads.
Here's some more context behind the grainsize pragma.
Cilk programs achieve good parallel speedup when they expose ample
parallelism to the runtime system. (Typically, one wants the program's
parallelism to exceed the number of Cilk workers by roughly a factor of
10.) Intuitively, that means that Cilk programs should aim to expose as
much parallel work as possible.
Of course, there are practical reasons why one doesn't always want to
expose all of the logical parallelism in a program. In the case of a
cilk_for loop, there is a nonnegligible cost to spawning a loop
iteration, which means cilk_for loops could incur high scheduling
overheads if they always spawned off all loop iterations. (In addition,
spawning every loop iteration can inhibit compiler optimizations, such as
vectorization.) This is where the grainsize pragma come in.
To amortize the constant overhead of spawning a parallel loop iteration,
the grainsize pragma allows the programmer to easily specify a number of
iterations that will execute serially per spawned loop iteration. As a
result, the overhead of spawning a parallel-loop iteration can be amortized
against the time to execute multiple iterations of the serial loop. Because
the overhead of spawning a loop iteration is constant, for any given loop,
a constant-value grainsize is sufficient to amortize that cost.
(Aside: The OpenCilk compiler and runtime will both try to determine an
appropriate grainsize for any cilk_for loop that doesn't have a
user-defined grainsize specified by the pragma. This mechanism works well
for many loops, but because it can't always handle all loops perfectly, we
allow users to specify a grainsize when needed.)
One not-so-great use of the grainsize pragma is to specify a grainsize of
n/P, where n is the number of loop iterations and P is the number of Cilk
workers. Although many programmers will naturally think to use such a
grainsize, this grainsize tends not to work well in the Cilk model.
Intuitively, by using this grainsize, the parallelism of the resulting loop
becomes P, the number of workers. In contrast, the Cilk runtime works
best when the parallelism significantly exceeds the number of workers,
e.g., by a factor of 10.
All that being said, suppose you need to implement a cilk_for loop with a
nonconstant grainsize for some reason.
1. We'd like to hear about your use case!
2. Its worth noting that implementing a particular grainsize for a
cilk_for loop is equivalent to stripmining the loop. For example,
consider the following cilk_for loop:
cilk_for (int i = 0; i < n; ++i) {
loop_body(i);
}
To stripmine the loop to use an arbitrary grainsize G, one can manually
rewrite the loop as follows:
cilk_for (int ii = 0; ii < n; ii += G) {
for (int i = ii; i < min(ii + G, n); ++i) {
loop_body(i);
}
}
(Note: There are a couple of different ways to write this stripmined loop,
and those different implementations can have different performance. In this
example, I'm using a compact implementation, which you will hopefully find
easier to work with.)
Hope that helps. Let me know if you have more questions.
Cheers,
TB
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA45BONXHKEEJCWUNLCUCGLTOWNDFANCNFSM45H5RJNA>
.
--
Sent from my snart phone. ;^)
|
Beta Was this translation helpful? Give feedback.
-
Hey Tuan, For a Before this runtime-grainsize computation kicks in, the OpenCilk compiler will statically analyze Of course, if the user specifies a grainsize for the loop, then the OpenCilk compiler and runtime will use that user-specified grainsize instead. Hope that clarifies things. Let me know if you have more questions. Cheers, |
Beta Was this translation helpful? Give feedback.
Hey Tuan,
Essentially, OpenCilk's
grainsize
pragma requires constant grainsizes because this restriction simplifies the OpenCilk design and implementation while simultaneously accommodating and encouraging reasonable uses of thegrainsize
pragma, specifically, to reduce scheduling overheads.Here's some more context behind the
grainsize
pragma.Cilk programs achieve good parallel speedup when they expose ample parallelism to the runtime system. (Typically, one wants the program's parallelism to exceed the number of Cilk workers by roughly a factor of 10.) Intuitively, that means that Cilk programs should aim to expose as much parallel work as possible.
Of course, there are practical reaso…