This release contains new features, several notable changes, and some bug fixes.
Notable changes include:
-
New features:
- HIP compiler back-end added to support AMD GPUs. Usage is essentially the same as for CUDA. Note that this feature is considered a work-in-progress and not yet production ready. It is undocumented, but noted here, for friendly users who would like to try it out.
- Updated version of camp third-party library, which includes variety of portability fixes. Most users should not need to concern themselves with the details of camp.
- Added new tutorial material and exercises.
- Documentation improvements.
-
API Changes:
- None.
-
Build changes/improvements:
- RAJA version number is now accessible as #define macro variable constants so that users who need to parameterize their code to support multiple RAJA versions can do this more easily. See the file RAJA/include/RAJA/config.hpp for details. RAJA version numbers are also experted as CMake variables.
- Added support to link to external camp library. By default, the camp git submodule will be used. If you prefer to use a different version of camp, set the RAJA CMake variable 'EXTERNAL_CAMP_SOURCE_DIR' to the location of the desired camp directory.
- BLT submodule (CMake-based build system) has been updated to latest BLT release (v0.3.0). The release contains a new version of GoogleTest, which required us to modify our use of gtest macros and our own testing macros. For the most part, this change should be invisible to users. However, the new GoogleTest does not work with CUDA versions 9.1.x or earlier. Therefore, if you compile RAJA with CUDA enabled and also wish to enable RAJA tests, you must use CUDA 9.2.x or newer.
-
Bug fixes:
- Fixed various issues to make internal implementations more robust, resolved issues with non fully-qualified types in some places, and work arounds for some compiler issues.
This release contains new features, several notable changes, and some bug fixes.
Notable changes include:
-
New features:
- Added CUDA block direct execution policies, which can be used to map loop iterations directly to CUDA thread block. These are analogous to the pre-existing thread direct policies. The new block direct policies can provide better performance for kernels than the block loop policies when load balancing may be an issue. Please see the RAJA User Guide for a description of all available RAJA execution policies.
- Added a plugin registry feature that will allow plugins to be linked into RAJA that can act before and after kernel launches. One benefit of this is that RAJA no longer has an explicit CHAI dependency if RAJA is used with CHAI. Future benefits will include integration with other tools for performance analysis, etc.
- Added a shift method to RAJA::View, which allows one to create a new view object from an existing one that is shifted in index space from the original. Please see the RAJA User Guide for details.
- Added support for RAJA::TypedView and RAJA::TypedOffsetLayout, so that the index type can be specified as a template parameter.
- Added helper functions to convert a RAJA::Layout object to a RAJA::OffsetLayout object and RAJA::TypedLayout to RAJA::TypedOffsetLayout. Please see the RAJA User Guide for details.
- Added a bounds checking option to RAJA Layout types as a debugging feature. This is a compile-time option that will report user errors when given View or Layout indices are out-of-bounds. See View/Layout section in the RAjA User Guide for instructions on enabling this and how this feature works.
- We've added a RAJA Template Project on GitHub, which shows how to use RAJA in an application, either as a Git submodule or as an externally installed library that you link your application against. It is available here: https://github.com/LLNL/RAJA-project-template. It is also linked to the main RAJA project page on GitHub.
- Various user documentation improvements.
-
API Change.
- The type alias RAJA::IndexSet that was marked deprecated previously has been removed. Now, all index set usage must use the type RAJA::TypedIndexSet and specify all segment types (as template parameters) that the index set may potentially hold.
-
Bug fixes:
- Fix for issue in OpenMP target offload back-end that previously caused some RAJA Performance Suite kernels to seg fault when built with the XL compiler.
- Removed an internal RAJA class constructor to prevent users to do potentially incorrect, and very difficult to hunt down, things in their code that are technically not supported in RAJA, such as inserting RAJA::statement::CudaSyncThreads() in arbitrary places inside a lambda expression.
-
Build changes/improvements:
- RAJA now enforces a minimum CUDA compute capability of sm_35. Users can use the CMake variable 'CUDA_ARCH' to specify this. If not specified, the value of sm_35 will be used and an informational message will be emitted indicating this. If a user attempts to set the value lower than sm_35, CMake will error out and a message will be emitted indicating why this happened.
- Transition to using camp as a submodule after its open source release (https://github.com/llnl/camp).
- Made minimum required CMake version 3.9.
- Update BLT build system submodule to newer version (SHA-1 hash: 96419df).
- Cleaned up compiler warnings in OpenMP target back-end implementation.
This release contains feature enhancements, one breaking change, and some bug fixes.
- Breaking change
- The atomic namespace in RAJA has been removed. Now, use atomic operations as RAJA::atomicAdd(), not RAJA::atomic::atomicAdd(), for example. This was done to make atomic usage consistent with other RAJA features, such as reductions, scans, etc.
Other notable changes include:
-
Features
- The lambda statement interface has been extended in the RAJA kernel API. Earlier, when multiple lambda expressions were used in a kernel, they were required to all have the same arguments, although not all arguments had to be used in each lambda expression. Now, lambda arguments may be specified in the RAJA::statement::Lambda type so that each lambda expression need only take the arguments it uses. However, the previous usage pattern will continue to be supported. To support the new interface, new statement types have been introduced to indicate iteration space variables (Segs), local variable/array parameters (Params), and index offsets (Offsets). The offsets can be used with a For statement as a replacement for the ForICount statement. The new API features are described in the RAJA User Guide.
- Minloc and maxloc reductions now support a tuple of index values. So now if you have a nested loop kernel with i, j, k loops, you can get the 'loc' value out as an i, j, k triple.
-
Bug Fixes:
- Small change to make RAJA Views work properly with OpenMP target kernels.
- Changes to fix OpenMP target back-end for XL compilers.
- Fix build issue with older versions of GNU compiler.
- Fixes to resolve issues associated with corner cases in choosing improper number of threads per block or number of thread blocks for CUDA execution policies.
-
Build changes/improvements:
- A few minor portability improvements
This release contains one major change and some minor improvements to compilation and performance.
Major changes include:
- Build system updated to use the latest version of BLT (or close to it). Depending on how one builds RAJA, this could require changes to how information is passed to CMake. Content has been added to the relevant sections of the RAJA User Guide which describes how this is done.
Other notable changes include:
-
Features (These are not yet documented and should be considered experimental. There will be documentation and usage examples in the next RAJA release.)
- New thread, warp, and bitmask policies for CUDA. These are not yet documented and should be considered experimental.
- Added AtomicLocalArray type which returns data elements wrapped in an AtomicRef object.
-
Bug Fixes:
- Fixed issue in RangeStrideSegment iteration.
- Fix 'align hint' macro to eliminate compile warning when XL compiler is used with nvcc.
- Fix issues associated with CUDA architecture level (i.e., sm_*) set too low and generated compiler warning/errors. Caveats for RAJA features (mostly atomic operations) available at different CUDA architecture levels added to User Guide.
-
Performance Improvements:
- Some performance improvements in RAJA::kernel usage with CUDA back-end.
This release contains several major changes, new features, a variety of bug fixes, and expanded user documentation and accompanying example codes. For more information and details about any of the changes listed below, please consult the RAJA documentation for the 0.7.0 release which is linked to our Github project.
Major changes include:
- RAJA::forallN methods were marked deprecated in the 0.6.0 release. They have been removed. All applications that contain nested loops and have been using forallN methods should convert them to use the RAJA::kernel interface.
- RAJA::forall methods that take explicit loop bounds rather than segments
(e.g., RAJA::forall(beg, end, ...) were marked deprecated in the 0.6.0 release. They have been removed. Hopefully, this will result in faster compile times due to simpler template resolution. Users who have been passing loop bounds directly to forall methods should convert those cases to use RAJA segments instead. - CUDA execution policies for use in RAJA::kernel policies have been significantly reworked and redefined. The new set of policies are much more flexible and provide improved run time performance.
- New, improved support for loop tiling algorithms and support for CPU cache blocking, CUDA GPU thread local data and shared memory is available. This includes RAJA::kernel policy statement types to make tile numbers and local tile indices available in user kernels (TileTCount and ForICount statement types), and a new RAJA::LocalArray type with various CPU and GPU memory policies. Due to these new features, RAJA 'shmem window' statements have been removed.
- This release contains expanded documentation and example codes for the RAJA::kernel interface, including loop tiling algorithms and support for CPU cache blocking, CUDA GPU thread local data and shared memory.
Other notable changes include:
-
Features:
- Initial support for OpenMP target execution policies with RAJA::kernel added.
- The RAJA::AtomicRef interface is now consistent with the C++20 std::atomic_ref interface.
- Atomic compare-exchange operations added.
- CUDA reduce policies no longer require a thread-block size parameter.
- New features considered preliminary with no significant documentation or
examples available yet:
- RAJA::statement::Reduce type for use in RAJA::kernel execution policies. This enables the ability to perform reductions and access reduced values inside user kernels.
- Warp-level execution policies added for CUDA.
-
Performance improvements:
- Better use of inline directives to improve likelihood of SIMD instruction generation with the Intel compiler.
-
Bug fixes:
- Several CHAI integration issues resolved.
- Resolve issue with alignx directive when using XL compiler as host compiler with CUDA.
- Fix issue associated with how XL compiler interprets OpenMP region definition.
- Various tweaks to camp implementation to improve robustness.
-
Build changes/improvements:
- The minimum required version of CMake has changed to 3.8 for all programming model back-ends, except CUDA. The minimum CMake version for CUDA support is 3.9.
- Improved support for clang-cuda compiler. Some features still do not work with that compiler.
- Update NVIDIA cub module to version 1.8.0.
- Enable use of 'BLT_SOURCE_DIR' CMake variable to help prevent conflicts with BLT versions in RAJA and other libraries used in applications.
This release contains two major changes, a variety of bug fixes and feature enhancements, and expanded user documentation and accompanying example codes.
Major changes include:
- RAJA::forallN methods are marked deprecated. They will be removed in the 0.7.0 release.
- RAJA::forall methods that take loop bounds rather than segments (e.g., RAJA::forall(beg, end, ...) are marked deprecated. They will be removed in the 0.7.0 release.
- RAJA::nested has been replaced with RAJA::kernel. The RAJA::kernel interface is much more flexible and full featured. Going forward, it will be the supported interface for nested loops and more complex kernels in RAJA.
- This release contains new documentation and example codes for the RAJA::kernel interface. The documentation described key features and summarizes available 'statement' types. However, it remains a work-in-progress and expanded documentation with more examples will be available in future releases.
- Documentation of other RAJA features have been expanded and improved in this release along with additional example codes.
Other notable changes include:
-
New or improved features:
- RAJA CUDA reductions now work with host/device lambdas
- List segments now work with RAJA::kernel loops.
- New and expanded collection of build files for LC and ALCF machines. Hopefully, these will be helpful to folks getting started.
-
Performance improvements:
- Some RAJA::View use cases
- Unnecessary operations removed in min/max atomics
-
Bug fixes:
- Issues in View with OffsetLayout fixed.
- Construction of a const View from a non-const View now works
- CUDA kernel no longer launched in RAJA::kernel loops when iteration space has size zero
This is a bugfix release that fixes bugs in the IndexSetBuilder methods. These methods now work correctly with the strongly-typed IndexSet.
This release fixes some small bugs, including compiler warnings issued for deprecated features, type narrowing, and the slice method for the RangeStrideSegment class.
It also adds a new CMake variable, RAJA_LOADED, that is used to determine whether RAJA's CMakeLists file has already been processed. This is useful when including RAJA as part of another CMake project.
This release contains fixes for compiler warnings with newer GCC and Clang compilers, and allows strongly-typed indices to work with RangeStrideSegment.
Additionally, the index type for all segments in an IndexSet needs to be the same. This requirement is enforced with a static_assert.
This release contains a variety of bug fixes, removes nvcc compiler warnings, addition of unit tests to expand coverage, and a variety of other code cleanup and improvements. The most notable changes in this version include:
-
New RAJA User Guide and Tutorial along with a set of example codes that illustrate basic usage of RAJA features and which accompany the tutorial. The examples are in the
RAJA/examples
directory. The user guide is available online here: RAJA User Guide and Tutorial. -
RAJA::IndexSet
is now deprecated. You may still use it until it is removed in a future release -- you will see a notification message at compile time that it is deprecated.Index set functionality will now be available via
RAJA::TypedIndexSet
where you specify all segment types as template parameters when you declare an instance of it. This change allows us to: remove all virtual methods from the index set, be able to use index set objects to CUDA GPU kernels and all of their functionality, and support any arbitrary segment type even user-defined. Please see User Guide for details.Segment dependencies are being developed for the typed index set and will be available in a future release.
-
RAJA::nested::forall
changes:-
Addition of CUDA and OpenMP collapse policies for nested loops. OpenMP collapse will do what the OpenMP collapse clause does. CUDA collapse will collapse a loop nest into a single CUDA kernel based on how nested policies specify how the loop levels should be distributed over blocks and threads.
-
Added new policy
RAJA::cuda_loop_exec
to enable inner loops to run sequentially inside a CUDA kernel withRAJA::nested::forall
. -
Fixed
RAJA::nested::forall
so it now works with RAJA's CUDA Reducer types. -
Removed
TypedFor
policies. For type safety of nested loop iteration variables, it makes more sense to useTypedRangeSegment
since the variables are associated with the loop kernel and not the execution policy, which may be applied to multiple loops with different variables.
-
-
Fixed OpenMP scans to calculate chunks of work based on actual number of threads the OpenMP runtime makes available.
-
Enhancements and fixes to RAJA/CHAI interoperability.
-
Added aliases for several
camp
types in the RAJA namespace; e.g.,camp::make_tuple
can now be accessed asRAJA::make_tuple
. This change makes the RAJA API more consistent and clear.
This release contains a bugfix for warnings when using the -Wpedantic flag.
This release contains minor fixes for issues in the previous v0.3.1 release, plus some improvements to documentation, reduction performance, improved portability across a growing set of compilers and environments (e.g., Windows), namespace refactoring to avoid cyclic dependencies and leverage argument- dependent lookup, etc. In addition, the RAJA backend for Intel TBB is now off by default, whereas previously it was on by default.
A few major changes are included in this version:
-
Changes to the way RAJA is configured and built. We are now using the BLT build system which is a Git submodule of RAJA. In addition to requiring the '--recursive' option to be passed to 'git clone', this introduces the following major change: RAJA_ENABLE_XXX options passed to CMake are now just ENABLE_XXX.
-
A new API and implementation for nested-loop RAJA constructs has been added. It is still a work in progress, but users are welcome to try it out and provide feedback. Eventually, RAJA::nested::forall will replace RAJA::forallN.
This release contains some new RAJA features, plus a bunch of internal changes including more tests, conversion of nearly all unit tests to use Google Test, improved testing coverage, and compilation portability improvements (e.g., Intel, nvcc, msvc). Also, the prefix for all RAJA source files has been changed from 'cxx'to 'cpp' for consistency with the header file prefix conversion in the last release. The source file prefix change should not require users to change anything.
New features included in this release:
-
Execution policy modifications and additions:
- seq_exec is now strictly sequential (no SIMD, etc.)
- simd_exec will force SIMD vectorization
- loop_exec (new policy) will allow compiler to optimize however it can, including SIMD.
So, loop_exec is really what our previous simd_exec policy was before, and 'no vector' pragmas have been added to all sequential implementations.
NOTE: SIMD changes are still being evaluated with different compilers on different platforms. More information will be provided as we learn more.
-
Added support for atomic operations (min, max, inc, dec, and, or, xor, exchange, and CAS) for all programming model backends. These appear in the RAJA::atomic namespace.
-
Support added for Intel Threading Building Blocks backend (considered experimental at this point).
-
Added macros that will be used to mark features for future deprecation (please watch for this as we will be deprecating some features in the next release).
-
Added support for C++17 if CMake knows about it.
-
Remove limit on number of ordered OpenMP reductions that can be used in a kernel.
-
Remove compile-time error from memutils, add portable aligned allocator.
-
Improved ListSegment implementation.
-
RAJA::Index_type is now ptrdiff_t instead of int.
Notable bug fixes included in this release:
-
Fixed strided_numeric_iterator to apply stride sign in comparison.
-
Bug in RangeStrideSegment when using CUDA is fixed.
-
Fixed reducer logic for openmp_ordered policy.
This release contains breaking changes and is not backward compatible with prior versions. The largest change is a re-organization of header files, and the switch to .hpp as a file extension for all headers.
New features included in this release:
-
Re-organization of header files.
-
Renaming of file extensions.
-
Rudimentary OpenMP 4.5 support.
-
CHAI support.
This release includes some small fixes, as well as an initial re-organization of the RAJA header files as we move towards a more flexible usage model.
This release includes the following changes:
-
Initial support of clang-cuda compiler.
-
New, faster OpenMP reductions.
N.B. The default OpenMP reductions are no longer performed in a ordered fashion, so results may not be reproducible. The old reductions are still available with the policy RAJA::omp_reduce_ordered.
Hotfix to update the URLs used for fetching clang during Travis builds.
Bugfix release that address an error when launching forall cuda kernels with a 0-length range.
This release contains fixes for compiler warnings and removes the usage of the custom FindCUDA CMake package.
Includes internal changes for performance and code maintenance.
Initial release.