Develop stream 2024-06-26 (#575)

* feat(device_transform): add tuning benchmarks and config generation for device transform * perf(device_transform): tuned device transform algorithm for better performance * docs(changelog.md): add 'device_transform' improvements to changelog * feat(ConfigAutotuneSettings.cmake): allow benchmark_device_transform to tune for more block sizes * fix(benchmark_device_transform.cpp): fix unused type warning when compiling tuning benchmarks * perf(device_transform.hpp): updated configs for device transform which uses a wider range of block sizes * fix(transform_config_template): added missing '::type' in general case of transform config * refactor(benchmark_device_transform.cpp): remove duplicated code with 'benchmark_device_transform.parallel.hpp' * docs(changelog.md): removed 'slightly' in device transform performance improvements * fix(benchmark_device_transform): fix various build errors and warnings * test(test_device_batch_memcpy.cpp): add simple batch copy test This test can be more easily modified to find issues with batch (mem) copy. * fix(device_batch_memcpy.hpp): use dereference instead of 'rocprim::thread_load/store' 'thread_load/store' uses inline assembly prohibiting compiler optimization. This also bypasses an issue where 'thread_load' behaves oddly on debug builds. * revert test(test_device_batch_memcpy.cpp): add simple batch copy test This reverts commit 6dafd1c66684e775eae07fe4fd50632a80ca1673. * test(benchmark_device_adjacent_difference.cpp): increased the default size of input so that in place uint8 benchmarks don't fit in L3 cache on select architectures * docs(changelog.md): update changelog with benchmark changes * Added overload for match_any * Replaced section with match_any() call * Fixed copyright date * Fixed formatting * change match_any to runtime dispatch * docs(intrinsics/warp): name the correct label_bits in match_any documentation * unified wavefront definition * build: Remove force-inline workaround on windows The problem mentioned there should be resolved by now. * ci: enable debug builds on windows Supposedly the slowest jobs should now be resolved, so this should work. * docs: Add CHANGELOG for removing force-inline workaround * fix clang format * fix(tests): Add saturating casts and use them for random data generation The `static_casts` can over / underflow making the maximum value smaller than the minimum. This was triggering an assert on the microsoft standard library. Technically this was undefined behaviour that went unnoticed on non-debug builds. Saturate the input value to the range of the distribution type instead to prevent this error. * fix(benchmark_device_adjacent_difference): fixe size in bytes instead of number of elements * Update contributing guidelines * specify benchmark seed via command line * refactor lookback sleep dispatch * add config to tests * add config tuning for partition * generic tuning * add tuned configurations * Fix "warning: loop not unrolled" with CMAKE_BUILD_TYPE=MinSizeRel (-Os) The compiler generates this warning when -Os is set: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering Using static values as both loop bounds fixes these warnings. For some reason, other optimization levels do not have this issue, the compiler is able to understand that the number of iterations of the loops is a compile-time value. * fix(device_partition): re-added workaround for the device_partition family to properly limit block size for the base configuration * Using .lint:clang-format * refactor(intrinsics/thread.hpp): remove 'memory_fence_device' workaround for compiler bug on gfx10 and gfx11 * ci(.gitlab-ciy.yml): disable debug builds in cmake-minimum due to excessive build times when targeting debug test * fix(docs): Fixed documentation for thread subdir * fix(docs): Fixed documentation for the types subdir * fix undefined behavior in test data generation * Deprecate thread_load/thread_store * Ignore thread_load and thread_store deprecation warnings * Deprecated raw_storage and replaced by uninitialized_array in a few locations * unsigned char storage in raw_storage to prevent undefined aliasing * Added ROCPRIM_DONT_SUPPRESS_DEPRECATIONS * Resolve "Improve rocPRIM test logs" * improve documentation for configuration tuning * Refactor device_scan, use is_sleep_scan_state_used and with_scan_state as in other lookback algorithms * Use device of the current stream in is_sleep_scan_state_used * Do not build kernels with sleep in lookback state on devices that don't need it (!=gfx908) * Resolve "Add thread headers to rocprim.hpp and document thread-level methods" * Resolve "Batch memcpy: disable BENCHMARK_BATCH_MEMCPY_NAIVE" * Resolve "Fix under- and overflow in minimum and maximum for input data for benchmarks" * Resolve "CMake build consistency" * Resolve "Benchmark utility for random segments generates segments of wrong size" * Adapt device segmented_reduce for large indices within a segment * Add large indices test * Update CHANGELOG * reduce by key tuning * First commit nth element * Tests nth element * Simplified working version nth element on one block * Added output check for correctness * nth element sizes larger then 64 * Added equality buckets to nth_element logic * Added multiple blocks for nth element * Added test to see if elements did not change * Debugging synchronization * Nth element working version only for key with comperator greater and less * Nth element implemented for key with tests * Fixed issue for custom types in nth element and added tests * Added input and output itterators for nth element * Added some benchmarks for nth element * Small optimizations nth element * Debug code nth element * Made seperate kernel for block offset calculations nth element * Small optimizations nth element * Moved all block offset calculation to other kernel nth element * Optimization nth element * Make use of radix_rank instead of multiple scans * Start of adding multiple items per thread nth element * Nth element using less shared memory * Nth element small optimizations and cleanup code * Fixed benchmark break nth element after rebase * nth element local oracle for buckets_store * Cleanup nth element * Nth element update tests with random nth element * Addition of configs for nth element * Add lookbackstates to nth element * Cleanup and extra comments in nth_element * Removed unnecesarry test cases and choose nth_element based on seed_value * Added nth_element to changelog * Updated benchmark of nth_element based on feedback * Nth_element updated tests and config based on review * Documentation updated for nth_element * Cleanup code nth element * Nth element changes based on review * Add documentation spinx doc * Changed nth element to a while loop * Nth element asserts in device code * Nth element documentation fixes * nth element docs crash fix * nth element lookback state reset * Nth element changes based on review * Replaced raw storage with unitialized_array in nth element * Changed Nth element to be able to be used with iterators * nth element fix small mistakes * Added config for in place nth element * Changes based on review * Added c++17 tests nth_element * Make use of internal merge_path also fix bug with unsigned types for size * Added test for public merge_path_search * Fixed thread_load and thread_store bug with float and double * Made review changes * Add bug fixes to changelog * ci: remove trailing newlines in gitlab-ci.yml * ci: compress autotune artifacts using zstd * Removed oracles array from nth element * Remove constraint of 256 for number of buckets nth element * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Nara Prasetya <[email protected]> * clang-format: trick clang-format into always breaking after c-style function attributes * add ctz intrinsic Counts the number of trailing zero bits. This is just a clean wrapper around __builtin_ctz(ll). * lookback scan: remove HIP-CPU bits - memcpy() (without std) works for HIP - We don't really care about HIP-CPU anymore This cleans up the source file a bit, and it doesn't seem like this affects any benchmarks. * lookback scan: reformat Makes formatting consistent with clang-format file. * lookback scan: add reproducibility test * test: print floats as hexfloat in assert_bit_eq * add warp_readfirstlane and warp_readlane intrinsics * lookback scan: add deterministic implementation * scan: add deterministic overload * scan_by_key: add deterministic overload * reduce_by_key: add deterministic overload * add char and short atomic load/store overloads It seems that these are just supported and work fine. * lookback scan: change flag to be always one byte This slightly reduces the amount of memory required for a lookback scan. Also, changing the INVALID value from -1 to 0xFF fixes some sign issues there were before by using unsigned int as flag underlying type. * lookback scan: swap flag and prefix, allow fast scan for values up to size 7 Since the prefix flag is always one byte now, we can put it behind the value to get a smaller struct. This helps in some cases, for example, scan_by_key over sizeof(AccType) = 2 now fits in an int instead of a long. * nara nit f32 * update changelog with mention of deterministic algorithms * lookback reproducibility test: allocate temporary memory with the right scan operator * lookback scan: avoid caching large types These types are stored in a separate buffer, so we don't need to or load them. Slightly speeds up deterministic scan algorithms when the lookback scan type is > 7 bytes. * remove assertions in lookbacn scan, they don't compile properly in debug builds * lookback reproducibility test: use same functor for both tests This enables the test to work with -ffast-math too. * lookback scan: rotate prefix rather than block_prefix * lookback scan: also test deterministic in normal tests * naive implementation * partial sort benchmark * Made partial_sort in place and created partial_sort_copy * Add and fix documentation partial_sort * Test partial_sort with iterator * Add partial_sort and partial_sort_copy to the changelog * Moved partial sort to own file * Added partial_sort_config * Merge with nth_element_remove_oracle branch * Created c++17 test for partial_sort * Cleanup code based on nth_element review * Review adaptations * Added benchmark for partial_sort * Fixed bug with inplicit casting in partial sort * add static_cast to fix compiler warning * Restored tests for device histogram_even for half/bfloat16 types * Removed unused variable and formatting * ci: Enable debug builds excluding test_block_adjacent_difference/discontinuity These tests take extremely long time to build with clang from ROCm 6.1+. * test(test_device_batch_memcpy.cpp): fix invalid calls being made to generate_random_data_n * test(test_device_batch_memcpy.cpp): standardize test names * test(test_intrinsics.cpp): fix invalid calls being made to test_utils::get_random_data * ci(.gitlab-ci.yml): add hardened libc++ assertions when building tests with gitlab ci * docs: update changelog * docs: fix doxygen errors and warnings * build(cmake/Dependencies.cmake): build rocm-cmake depedency during populate step when fetching it by source * refactor(benchmark_config_dispatch.cpp): fix unused variable and function * chore: bump version to 3.3.0 * Reduce items_per_thread for merge_sort to one for large types * Reduce block_size for device_merge with large types --------- Co-authored-by: Nara Prasetya <[email protected]> Co-authored-by: Jaap Blok <[email protected]> Co-authored-by: Gergely Meszaros <[email protected]> Co-authored-by: Nol Moonen <[email protected]> Co-authored-by: Bence Parajdi <[email protected]> Co-authored-by: Beatriz Navidad Vilches <[email protected]> Co-authored-by: Anton Gorenko <[email protected]> Co-authored-by: Lőrinc Serfőző <[email protected]> Co-authored-by: Nick Breed <[email protected]> Co-authored-by: Arsalan Anwari <[email protected]> Co-authored-by: Ivan <[email protected]>
ROCm · Aug 3, 2024 · dbb52d5 · dbb52d5
1 parent dc23a85
commit dbb52d5
Show file tree

Hide file tree

Showing 241 changed files with 22,760 additions and 5,015 deletions.
diff --git a/.clang-format b/.clang-format
@@ -41,7 +41,6 @@ AllowShortLoopsOnASingleLine: false
 AlwaysBreakAfterReturnType: None
 AlwaysBreakBeforeMultilineStrings: false
 AlwaysBreakTemplateDeclarations: Yes
-AttributeMacros: ['ROCPRIM_DEVICE', 'ROCPRIM_HOST', 'ROCPRIM_HOST_DEVICE', 'ROCPRIM_SHARED_MEMORY', 'ROCPRIM_KERNEL', 'ROCPRIM_INLINE']
 BinPackArguments: false
 BinPackParameters: false
 BitFieldColonSpacing: Both
@@ -135,4 +134,40 @@ SpacesInConditionalStatement: false
 SpacesInContainerLiterals: true
 SpacesInParentheses: false
 SpacesInSquareBrackets: false
+
+AttributeMacros:
+    - __host__
+    - __device__
+    - __global__
+    - __forceinline__
+    - __shared__
+    - __launch_bounds__
+    - ROCPRIM_DEVICE
+    - ROCPRIM_HOST
+    - ROCPRIM_HOST_DEVICE
+    - ROCPRIM_SHARED_MEMORY
+    - ROCPRIM_KERNEL
+    - ROCPRIM_INLINE
+    - ROCPRIM_FORCE_INLINE
+    - ROCPRIM_LAUNCH_BOUNDS
+
+# Trick clang into thinking that our C-style attributes are C++-style attributes
+# Make sure that the sizes line up for linebreaks etc
+Macros:
+    - __host__=[[host]]
+    - __device__=[[device]]
+    - __global__=[[global]]
+    - __forceinline__=[[forceinline]]
+    - __shared__=[[shared]]
+    - __launch_bounds__(x)=[[launch_bounds(x)]]
+    - __attribute__(x)=[[attribute(x)]]
+    - ROCPRIM_DEVICE=[[DEVICE____]]
+    - ROCPRIM_HOST=[[HOST____]]
+    - ROCPRIM_HOST_DEVICE=[[HOST_DEVICE____]]
+    - ROCPRIM_SHARED_MEMORY=[[SHARED_MEMORY____]]
+    - ROCPRIM_KERNEL=[[KERNEL____]]
+    - ROCPRIM_INLINE=[[INLINE____]]
+    - ROCPRIM_FORCE_INLINE=[FORCE_INLINE____]]
+    - ROCPRIM_LAUNCH_BOUNDS(x)=[[launch_bounds(x)____]]
+BreakAfterAttributes: Always
 ---
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -27,6 +27,7 @@ include:
       - /defaults.yaml
       - /deps-cmake.yaml
       - /deps-docs.yaml
+      - /deps-format.yaml
       - /deps-rocm.yaml
       - /deps-vcpkg.yaml
       - /deps-windows.yaml
@@ -46,20 +47,7 @@ variables:
 
 clang-format:
   extends:
-    - .deps:rocm
-  stage: lint
-  needs: []
-  tags:
-    - build
-  variables:
-    CLANG_FORMAT: "/opt/rocm/llvm/bin/clang-format"
-    GIT_CLANG_FORMAT: "/opt/rocm/llvm/bin/git-clang-format"
-  rules:
-    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
-  script:
-    - cd $CI_PROJECT_DIR
-    - git config --global --add safe.directory $CI_PROJECT_DIR
-    - scripts/code-format/check-format.sh $CI_MERGE_REQUEST_DIFF_BASE_SHA --binary "$CLANG_FORMAT"
+    - .lint:clang-format
 
 copyright-date:
   extends:
@@ -162,15 +150,22 @@ build:cmake-minimum-apt:
   extends:
     - .gpus:rocm-gpus
     - .rules:build
+  variables:
+    EXTRA_CMAKE_CXX_FLAGS: ""
   script:
     - mkdir -p $BUILD_DIR
     - cd $BUILD_DIR
+    - | # Add hardened libc++ assertions for tests only
+        if [[ $BUILD_TARGET == "TEST" ]]; then
+            echo "Configuring with hardened libc++!"
+            EXTRA_CMAKE_CXX_FLAGS+=" -D_GLIBCXX_ASSERTIONS=ON"
+        fi
     - cmake
       -G Ninja
       -D CMAKE_CXX_COMPILER="$AMDCLANG"
-      -D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror"
+      -D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror $EXTRA_CMAKE_CXX_FLAGS"
       -D CMAKE_BUILD_TYPE="$BUILD_TYPE"
-      -D BUILD_$BUILD_TARGET=ON 
+      -D BUILD_$BUILD_TARGET=ON
       -D BUILD_EXAMPLE=ON
       -D GPU_TARGETS=$GPU_TARGETS
       -D AMDGPU_TEST_TARGETS=$GPU_TARGETS
@@ -197,9 +192,11 @@ build:cmake-latest:
   extends:
     - .cmake-latest
     - .build:common
-  variables:
-    BUILD_TYPE: Release
-    BUILD_TARGET: TEST
+  parallel:
+    # Debug builds disabled due to excessive build times for debug test builds
+    matrix:
+      - BUILD_TYPE: Release
+        BUILD_TARGET: [BENCHMARK, TEST]
 
 build:cmake-minimum:
   needs: []
@@ -246,20 +243,19 @@ build:windows:
     - .deps:visual-studio-devshell
   parallel:
     matrix:
-      # Debug is disabled due to extensive link times, tracked in issue 679.
-      - BUILD_TYPE: [Release]
+      - BUILD_TYPE: [Debug, Release]
         BUILD_TARGET: [BENCHMARK, TEST]
   script:
     - mkdir -p $CI_PROJECT_DIR/build
-    - cmake -G Ninja 
-      -S $CI_PROJECT_DIR 
-      -B $CI_PROJECT_DIR/build 
-      -D BUILD_$BUILD_TARGET=ON 
+    - cmake -G Ninja
+      -S $CI_PROJECT_DIR
+      -B $CI_PROJECT_DIR/build
+      -D BUILD_$BUILD_TARGET=ON
       -D GPU_TARGETS=$GPU_TARGET
-      -D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe" 
-      -D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}" 
-      -D CMAKE_BUILD_TYPE="$BUILD_TYPE" 
-    - cmake --build "$CI_PROJECT_DIR/build" 
+      -D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
+      -D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
+      -D CMAKE_BUILD_TYPE="$BUILD_TYPE"
+    - cmake --build "$CI_PROJECT_DIR/build"
   artifacts:
     paths:
       - $CI_PROJECT_DIR/build/test/test_*
@@ -281,6 +277,10 @@ autotune:build:
     - .cmake-minimum
     - .gpus:rocm-gpus
     - .rules:benchmark
+  before_script:
+    - !reference [".cmake-minimum", before_script]
+    - $SUDO_CMD apt-get update -qq
+    - $SUDO_CMD apt-get install -qq -y zstd
   variables:
     BENCHMARK_TARGETS: benchmark_config_tuning
   script:
@@ -301,22 +301,13 @@ autotune:build:
       -D GPU_TARGETS=$GPU_TARGETS
     - cmake --build . --target $BENCHMARK_TARGETS
     - 'rm -rf $BUILD_DIR/benchmark/benchmark*.parallel'
-    # remove benchmark executables if their size together is too large for gitlab ci to handle
-    - |
-      total_size_bytes=0
-      while read -r file_size; do
-        total_size_bytes=$((total_size_bytes + file_size))
-      done < <(stat --format="%s" benchmark/benchmark*)
-      total_size_gib="$(numfmt --round=down --to-unit=Gi "$total_size_bytes")"
-      if [ "$total_size_gib" -ge 3 ]; then
-        printf "Total size: %s (%d bytes) > 3GiB, skipping benchmark executables from the artifact.\n" \
-          "$(numfmt --to=iec-i "$total_size_bytes")" "$total_size_bytes"
-        rm benchmark/benchmark*
-      fi
+    # The autotune benchmarks get very large, above GitLabs upload limit. Fortunately they compress well.
+    # We'll put them all in a single archive to compress them to a few hundred MB.
+    - find benchmark -type f -executable -print0 | tar -I zstd -cvf benchmarks.tar.zstd --null -T -
 
   artifacts:
     paths:
-      - $BUILD_DIR/benchmark/benchmark*
+      - $BUILD_DIR/benchmarks.tar.zstd
       - $BUILD_DIR/.ninja_log
       - $BUILD_DIR/deps/googlebenchmark/
     expire_in: 1 week
@@ -360,7 +351,7 @@ test:
     - cd $CI_PROJECT_DIR/build
     - ctest --output-on-failure
 
-# Disabled due to extensive link times. 
+# Disabled due to extensive link times.
 # This is tracked in issue 679
 # test-windows-debug:
 #   extends:
@@ -535,9 +526,13 @@ autotune:execute-tuning:
   artifacts:
     paths:
       - ${AUTOTUNE_RESULT_DIR}/*.json
+  before_script:
+    - !reference [".cmake-minimum", before_script]
+    - $SUDO_CMD apt-get update -qq
+    - $SUDO_CMD apt-get install -qq -y zstd
   script:
-    - >
-      cd "${CI_PROJECT_DIR}"
+    - cd "${CI_PROJECT_DIR}"
+    - tar -I zstd -xvf "${BUILD_DIR}/benchmarks.tar.zstd" -C "${BUILD_DIR}/"
     - |
       if [ ! -d "${BUILD_DIR}/benchmark" ]; then
           echo "There are no benchmark executables. Run the build job with a BUILD_TARGET."
@@ -553,6 +548,7 @@ autotune:execute-tuning:
       --benchmark_filter_regex="${AUTOTUNE_ALGORITHM_REGEX}"
       --size="${AUTOTUNE_SIZE}"
       --trials="${AUTOTUNE_TRIALS}"
+      --seed=82589933
 
 autotune:generate-config:
   image: python:3.10.5-buster

diff --git a/.gitlab/run_benchmarks.py b/.gitlab/run_benchmarks.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 
-# Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
 #
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
@@ -28,7 +28,7 @@
 import subprocess
 import sys
 
-BenchmarkContext = namedtuple('BenchmarkContext', ['gpu_architecture', 'benchmark_output_dir', 'benchmark_dir', 'benchmark_filename_regex', 'benchmark_filter_regex', 'size', 'trials'])
+BenchmarkContext = namedtuple('BenchmarkContext', ['gpu_architecture', 'benchmark_output_dir', 'benchmark_dir', 'benchmark_filename_regex', 'benchmark_filter_regex', 'size', 'trials', 'seed'])
 
 def run_benchmarks(benchmark_context):
     def is_benchmark_executable(filename):
@@ -61,6 +61,8 @@ def is_benchmark_executable(filename):
             args += ['--size', benchmark_context.size]
         if benchmark_context.trials:
             args += ['--trials', benchmark_context.trials]
+        if benchmark_context.seed:
+            args += ['--seed', benchmark_context.seed]
         try:
             subprocess.check_call(args)
         except subprocess.CalledProcessError as error:
@@ -97,6 +99,10 @@ def main():
         help='Controls the number of trial iterations for each benchmark case',
         default='',
         required=False)
+    parser.add_argument('--seed',
+        help='Controls the seed for random number generation for each benchmark case',
+        default='',
+        required=False)
 
     args = parser.parse_args()
 
@@ -107,7 +113,8 @@ def main():
         args.benchmark_filename_regex,
         args.benchmark_filter_regex,
         args.size,
-        args.trials)
+        args.trials,
+        args.seed)
 
     benchmark_run_successful = run_benchmarks(benchmark_context)
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,11 +5,36 @@ Documentation for rocPRIM is available at
 
 ## Unreleased rocPRIM-3.3.0 for ROCm 6.3.0
 
-### Fixes
+### Additions
+
+* Option `--seed` to benchmarks to specify a seed for the generation of random inputs. The default behavior is to keep using a random seed per benchmark measurement.
+* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and device reduce by key (`rocprim::reduce_by_key`) for improved performance on selected architectures.
+* Added `rocprim::uninitialized_array` which provides uninitialized storage in local memory for user-defined types.
+* Added large segment support for `rocprim:segmented_reduce`.
+* Added a parallel `nth_element` device function similar to `std::nth_element`, this function rearranges elements smaller than the n-th before and bigger than the n-th after the n-th element.
+* Added deterministic (bitwise reproducible) algorithm variants `rocprim::deterministic_inclusive_scan`, `rocprim::deterministic_exclusive_scan`, `rocprim::deterministic_inclusive_scan_by_key`, `rocprim::deterministic_exclusive_scan_by_key`, and `rocprim::deterministic_reduce_by_key`. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance.
+* Added a parallel `partial_sort` and `partial_sort_copy` device function similar to `std::partial_sort` and `std::partial_sort_copy`, these functions rearranges elements such that the elements are the same as a sorted list up to and including the middle index.
+
+### Changes
 
+* Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different.
+* Changed the default seed for `device_benchmark_segmented_reduce`.
+
+### Fixes
 
 * Fixed an issue where while running rtest.py on windows and passing in an absolute path to `--install_dir` causes a `FileNotFound` error.
+* rocPRIM functions are no longer forcefully inlined on Windows, significantly reducing the build
+  time in debug builds.
+* `block_load`, `block_store`, `block_shuffle`, `block_exchange` and `warp_exchange` now use placement `new` instead of copy
+  assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.
+* Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (`uint8_t` and `int8_t`) and instantiations of `custom_type`. Specifically, device binary search, device histogram, device merge and warp sort are affected.
+* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would output wrong results.
+* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not casted to the correct type resulting in wrong results.
+* Fix tests failing when compiling with `-D_GLIBCXX_ASSERTIONS=ON`.
+
+### Deprecations
 
+* `rocprim::thread_load` and `rocprim::thread_store`, use dereference instead. Not all of those functions are available on every device architecture, and their usage can hurt performance, because inline assembly inhibits optimizations.
 
 ## Unreleased rocPRIM-3.2.0 for ROCm 6.2.0
 
@@ -28,10 +53,10 @@ Documentation for rocPRIM is available at
 * New `rocprim::batch_copy` function added. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy.
 * Added more test cases, to better cover supported data types.
 * Updated some tests to work with supported data types.
-* An optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`. 
+* An optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
   To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple`
   of references to arithmetic types from the key.
-* New `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate. 
+* New `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
   It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`.
   It can be instantiated with:
   * `rocprim::make_predicate_iterator`
@@ -43,6 +68,7 @@ Documentation for rocPRIM is available at
 
 * Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`.
 * Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread.
+* Improved the performance of `device_transform`.
 
 ### Fixes
 

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -45,6 +45,7 @@ endif()
 # Build options
 option(BUILD_TEST "Build tests (requires googletest)" OFF)
 option(BUILD_BENCHMARK "Build benchmarks" OFF)
+option(BUILD_NAIVE_BENCHMARK "Build naive benchmarks" OFF)
 option(BUILD_EXAMPLE "Build examples" OFF)
 option(BUILD_DOCS "Build documentation (requires sphinx)" OFF)
 option(USE_HIP_CPU "Prefer HIP-CPU runtime instead of HW acceleration" OFF)
@@ -130,7 +131,7 @@ if(USE_HIP_CPU)
 endif()
 
 # Setup VERSION
-set(VERSION_STRING "3.2.0")
+set(VERSION_STRING "3.3.0")
 rocm_setup_version(VERSION ${VERSION_STRING})
 
 # Print configuration summary