Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Implement max_num_work_groups from the launch queries extension #14333

Merged
merged 44 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
aead3e3
[SYCL][Ext] Query kernel maximum active work-groups based on occupancy
GeorgeWeb May 31, 2024
e172c1e
Remove forgotten stale pi.h changes
GeorgeWeb Jun 27, 2024
a840be1
Fix query test
GeorgeWeb Jun 27, 2024
a9e17b4
Update UR cuda-adapter commit tag
GeorgeWeb Jun 27, 2024
4f81d0a
Fix formatting and add missing file
GeorgeWeb Jun 28, 2024
b2756c9
Rename the kernel_queue_specific traits definitions file
GeorgeWeb Jun 28, 2024
28b09f4
Add windows symbols
GeorgeWeb Jun 28, 2024
fd51cfb
Update include_deps tests
GeorgeWeb Jun 28, 2024
377cf3b
Rename the query to recommended_num_work_groups
GeorgeWeb Jul 4, 2024
3a8f3bf
Change return type to size_t from uint32_t
GeorgeWeb Jul 4, 2024
b5b3d43
Correct the namespace for the query type in the extension doc
GeorgeWeb Jul 4, 2024
594727e
Remove the list of queries since there is only one proposed at the mo…
GeorgeWeb Jul 4, 2024
efcf44f
Update SYCL specificaiton dependency to Revision 8 from 5
GeorgeWeb Jul 4, 2024
448b191
Update group occupancy test
GeorgeWeb Jul 4, 2024
54c1b57
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…
GeorgeWeb Jul 4, 2024
eb60b1c
Bump UR tag
GeorgeWeb Jul 4, 2024
b4b355e
Fix a typo in the extension doc
GeorgeWeb Jul 4, 2024
20aa7c5
Add backend support section to the extension doc
GeorgeWeb Jul 4, 2024
8fa7b09
Update the queue-only max_num_work_group_sync overload to use kernel_…
GeorgeWeb Jul 4, 2024
e5910fa
Fix formatting
GeorgeWeb Jul 4, 2024
a7411c8
Update UR tag
GeorgeWeb Jul 4, 2024
7de06c1
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…
GeorgeWeb Jul 5, 2024
e50e837
Update Linux and Windows symbols
GeorgeWeb Jul 5, 2024
dc2dde4
Manually select which kernel_queue_specific traits definitions to ove…
GeorgeWeb Jul 5, 2024
b7807f9
Rename the query to recommended_num_work_groups (dropping the explici…
GeorgeWeb Jul 5, 2024
4344064
Fix division per CUs and update Linux symbols
GeorgeWeb Jul 5, 2024
0ef4baa
Update test and removing printfs
GeorgeWeb Jul 5, 2024
2fa280b
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…
GeorgeWeb Aug 12, 2024
da8cde2
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…
GeorgeWeb Aug 13, 2024
27b2416
Implementation changes switching to per-device only semantics
GeorgeWeb Aug 13, 2024
b51f965
Implement max_num_work_groups launch query instead of recommended and…
GeorgeWeb Aug 30, 2024
ef4cd8b
Remove recommended_num_work_groups from the launch queries extension doc
GeorgeWeb Aug 30, 2024
6483da7
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…
GeorgeWeb Sep 2, 2024
4fc7353
Update UR cuda adapter tag, query tests and symbols
GeorgeWeb Sep 2, 2024
7636f78
Remove sycl.hpp from the test and update windows symbols
GeorgeWeb Sep 3, 2024
ea1e525
Address review comments
GeorgeWeb Sep 4, 2024
1698bb8
Address more review comments
GeorgeWeb Sep 10, 2024
529caa5
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…
GeorgeWeb Sep 10, 2024
2e190a2
Update queue argument as per review comment suggestion
GeorgeWeb Sep 10, 2024
762c7e1
Bump UR tag
GeorgeWeb Sep 10, 2024
2169579
Update symbols
GeorgeWeb Sep 10, 2024
c2788f6
Update max_num_work_groups query ext docs
GeorgeWeb Sep 10, 2024
6c5485f
Update UR merge-commit tag
GeorgeWeb Sep 11, 2024
cb5e47e
Merge remote-tracking branch 'upstream/sycl' into georgi/sycl_ext_occ…
GeorgeWeb Sep 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
= sycl_ext_oneapi_group_occupancy_queries

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2024 Intel Corporation. All rights reserved.

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 5 specification. All
GeorgeWeb marked this conversation as resolved.
Show resolved Hide resolved
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.

This extension also depends on the following other SYCL extensions:

* link:../proposed/sycl_ext_oneapi_launch_queries.asciidoc[
sycl_ext_oneapi_launch_queries]


== Status

This is an experimental extension specification, intended to provide early
access to features and gather community feedback. Interfaces defined in this
specification are implemented in {dpcpp}, but they are not finalized and may
change incompatibly in future versions of {dpcpp} without prior notice.
*Shipping software products should not rely on APIs defined in this
specification.*


== Overview

This extension is based on the kernel-queue-specific specific querying mechanism
GeorgeWeb marked this conversation as resolved.
Show resolved Hide resolved
introduced by the sycl_ext_oneapi_launch_queries extension.

The purpose of queries the to be added is to aid occupancy based calculations
for kernel launches based on hardware occupancy per compute unit granularity.
The queries take in account the kernel resources and user-specified constraints,
such as, but not limited to, local (work-group) size and dynamic work-group
local memory (in bytes). The motivation behind is to aid the tuning of kernels,
by being able to design the algorithm's implementation to maintain the highest
possible occupancy in a portable way.

List of currently planned queries.
* max_num_work_group_occupancy_per_cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't render well in HTML. Asciidoc requires a blank line before the bullet.

Are you planning to add more queries to this extension soon? This list of currently planned queries seems odd. I'd suggest removing it unless you have some specific plan to add more things.

Copy link
Contributor Author

@GeorgeWeb GeorgeWeb Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on at least one more, being recommended_work_group_size. This would be useful in combination with the currently added one, to let the runtime assist in selecting a configuration for max HW occupancy. This is not super useful in most kernel launch configurations, but for small ones that are not the hot-path where sycl::nd_range is specified explicitly and manual fine-tuning is not required, it is a useful feature. However, I am not adding this yet.

Also, this is not much of a list with one addition at this point, so I am removing it.
Thank you for questioning this!


[source,c++]
----
sycl::queue q{};
auto bundle = sycl::get_kernel_bundle(q.get_context());
auto kernel = bundle.get_kernel<class KernelName>();

auto wgSizeRange = sycl::range{32, 1, 1};
size_t localMemorySize = 32;

namespace syclex = sycl::ext::oneapi::experimental;
uint32_t maxWGsPerCU = kernel.ext_oneapi_get_info<
syclex::info::kernel_queue_specific::max_num_work_group_occupancy_per_cu>(
GeorgeWeb marked this conversation as resolved.
Show resolved Hide resolved
q, wgSizeRange, localMemorySize);
----

NOTE: SYCL 2020 requires lambdas to be named in order to locate the associated
`sycl::kernel` object used to query information descriptors. Reducing the
verbosity of the queries shown above is left to a future extension.


== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_ONEAPI_GROUP_OCCUPANCY_QUERIES` to one of the values defined in
the table below. Applications can test for the existence of this macro to
determine if the implementation supports this feature, or applications can test
the macro's value to determine which of the extension's features the
implementation supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|The APIs of this experimental extension are not versioned, so the
feature-test macro always has this value.
|===


=== Occupancy queries

[source, c++]
----
namespace ext::oneapi::experimental::info::kernel {

struct max_num_work_group_occupancy_per_cu;

}
----

[%header,cols="1,5,5,5"]
|===
|Kernel Descriptor
|Argument Types
|Return Type
|Description

|`max_num_work_group_occupancy_per_cu`
|`sycl::queue`, `sycl::range`, `size_t`
|`uint32_t`
GeorgeWeb marked this conversation as resolved.
Show resolved Hide resolved
|Returns the maximum number of actively executing work-groups per compute unit
granularity, when the kernel is submitted to the specified queue with the
specified work-group size and the specified amount of dynamic work-group local
memory (in bytes). The actively executing work-groups are those that occupy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to be a little more detailed about what counts as dynamic work-group local memory. I assume this is the sum of the sizes of all local accessors, right?

Copy link
Contributor Author

@GeorgeWeb GeorgeWeb Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamically allocated (SYCL) local memory, for which the size is know only at runtime and can change between kernel submissions, so yes, you are right. I will elaborate in a little more detail to make it clear.
I can see how it is not clear the way I phrased it.

the fundamental hardware unit responsible for the execution of work-groups in
parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that max_num_work_group_occupancy_per_cu returns a recommended work-group size? If that is the case, can we rename the query to something like recommended_num_work_groups?

Copy link
Contributor Author

@GeorgeWeb GeorgeWeb Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a recommendation for the maximum number of work-groups (or Cuda blocks) of specified block size etc. that will theoretically execute concurrently on the compute unit (Cuda SM) to achieve maximum occupancy.

I think I like the naming you suggest and it does sound to me like a recommendation. What do you think would be a good name based on my description? (I am sold based on the fact the original I came up with sounds a little weird.)

Thank you, @gmlueck !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to recommended_num_work_groups. Initially, I wanted to indicate that this is not per-device semantics but per compute unit (or whatever this maps to in the HW, i.e. SM for Cuda, EU for Intel Level-Zero or CU for AMD HIP), hence why I had the _per_cu in the name. However, the extension docs describe the semantics, so I think that's okay now.


|===

== Implementation notes

The implementation needs to define `sycl::kernel::ext_onapi_get_info` with the
extra `sycl::range` and `size_t` parameters in addition to the `sycl::queue`.

The Cuda, Hip and Level Zero backend adapters have the required infrastructure
required to implement the extension.
12 changes: 12 additions & 0 deletions sycl/include/sycl/detail/info_desc_helpers.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ template <typename T> struct is_queue_info_desc : std::false_type {};
template <typename T> struct is_kernel_info_desc : std::false_type {};
template <typename T>
struct is_kernel_device_specific_info_desc : std::false_type {};
template <typename T>
struct is_kernel_queue_specific_info_desc : std::false_type {};
template <typename T> struct is_event_info_desc : std::false_type {};
template <typename T> struct is_event_profiling_info_desc : std::false_type {};
// Normally we would just use std::enable_if to limit valid get_info template
Expand Down Expand Up @@ -128,6 +130,16 @@ struct IsSubGroupInfo<info::kernel_device_specific::compile_sub_group_size>
#include <sycl/info/ext_intel_device_traits.def>
#include <sycl/info/ext_oneapi_device_traits.def>
#undef __SYCL_PARAM_TRAITS_SPEC

#define __SYCL_PARAM_TRAITS_SPEC(Namespace, DescType, Desc, ReturnT, PiCode) \
template <> \
struct is_##DescType##_info_desc<Namespace::info::DescType::Desc> \
: std::true_type { \
using return_type = Namespace::info::DescType::Desc::return_type; \
};
#include <sycl/info/ext_oneapi_kernel_queue_specific_traits.def>
#undef __SYCL_PARAM_TRAITS_SPEC

#define __SYCL_PARAM_TRAITS_SPEC(DescType, Desc, ReturnT, PiCode) \
template <> \
struct is_backend_info_desc<info::DescType::Desc> : std::true_type { \
Expand Down
4 changes: 1 addition & 3 deletions sycl/include/sycl/ext/oneapi/experimental/root_group.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,7 @@ namespace ext::oneapi::experimental {
namespace info::kernel_queue_specific {
// TODO: Revisit and align with sycl_ext_oneapi_forward_progress extension once
// #7598 is merged.
gmlueck marked this conversation as resolved.
Show resolved Hide resolved
struct max_num_work_group_sync {
using return_type = size_t;
};
// Defined in 'sycl/info/kernel_device_specific_traits.def'
gmlueck marked this conversation as resolved.
Show resolved Hide resolved
} // namespace info::kernel_queue_specific

template <int Dimensions> class root_group {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
__SYCL_PARAM_TRAITS_SPEC(ext::oneapi::experimental, kernel_queue_specific, max_num_work_group_occupancy_per_cu, uint32_t,)
__SYCL_PARAM_TRAITS_SPEC(ext::oneapi::experimental, kernel_queue_specific, max_num_work_group_sync, size_t,)
2 changes: 2 additions & 0 deletions sycl/include/sycl/info/info_desc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,8 @@ struct work_item_progress_capabilities;
#include <sycl/info/ext_codeplay_device_traits.def>
#include <sycl/info/ext_intel_device_traits.def>
#include <sycl/info/ext_oneapi_device_traits.def>
#include <sycl/info/ext_oneapi_kernel_queue_specific_traits.def>

#undef __SYCL_PARAM_TRAITS_SPEC
#undef __SYCL_PARAM_TRAITS_TEMPLATE_SPEC
} // namespace _V1
Expand Down
8 changes: 7 additions & 1 deletion sycl/include/sycl/kernel.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,13 @@ class __SYCL_EXPORT kernel : public detail::OwnerLessBase<kernel> {
// TODO: Revisit and align with sycl_ext_oneapi_forward_progress extension
// once #7598 is merged.
template <typename Param>
typename Param::return_type ext_oneapi_get_info(const queue &q) const;
typename detail::is_kernel_queue_specific_info_desc<Param>::return_type
ext_oneapi_get_info(const queue &q) const;

template <typename Param>
typename detail::is_kernel_queue_specific_info_desc<Param>::return_type
ext_oneapi_get_info(const queue &Queue, const range<3> &WorkGroupSize,
size_t DynamicLocalMemorySize) const;

private:
/// Constructs a SYCL kernel object from a valid kernel_impl instance.
Expand Down
4 changes: 2 additions & 2 deletions sycl/plugins/unified_runtime/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -119,8 +119,8 @@ if(SYCL_PI_UR_USE_FETCH_CONTENT)
)

fetch_adapter_source(cuda
${UNIFIED_RUNTIME_REPO}
${UNIFIED_RUNTIME_TAG}
https://github.com/GeorgeWeb/unified-runtime.git
3883c1139bd9d799159bac3ccf9e29d151f030d4
)

fetch_adapter_source(hip
Expand Down
72 changes: 65 additions & 7 deletions sycl/source/detail/kernel_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,11 @@ class kernel_impl {
template <typename Param>
typename Param::return_type ext_oneapi_get_info(const queue &q) const;

template <typename Param>
typename Param::return_type
ext_oneapi_get_info(const queue &Queue, const range<3> &MaxWorkGroupSize,
size_t DynamicLocalMemorySize) const;

/// Get a reference to a raw kernel object.
///
/// \return a reference to a valid PiKernel instance with raw kernel object.
Expand Down Expand Up @@ -269,22 +274,75 @@ kernel_impl::get_info(const device &Device,
getPlugin());
}

namespace syclex = ext::oneapi::experimental;

template <>
inline typename syclex::info::kernel_queue_specific::
max_num_work_group_occupancy_per_cu::return_type
kernel_impl::ext_oneapi_get_info<syclex::info::kernel_queue_specific::
max_num_work_group_occupancy_per_cu>(
const queue &Queue, const range<3> &WorkGroupSize,
size_t DynamicLocalMemorySize) const {
if (WorkGroupSize.size() == 0) {
throw runtime_error("The launch work-group size cannot be zero.",
PI_ERROR_INVALID_WORK_GROUP_SIZE);
}

const auto &Plugin = getPlugin();
const auto &Handle = getHandleRef();
const auto &Device = Queue.get_device();

// Calculate max number of work-groups per compute unit
const auto NumCUs = Device.get_info<info::device::max_compute_units>();

pi_uint32 GroupCount{0};
Plugin->call<PiApiKind::piextKernelSuggestMaxCooperativeGroupCount>(
Handle, WorkGroupSize.size(), DynamicLocalMemorySize, &GroupCount);
return GroupCount / NumCUs;
}

template <>
inline typename ext::oneapi::experimental::info::kernel_queue_specific::
max_num_work_group_sync::return_type
inline typename syclex::info::kernel_queue_specific::max_num_work_group_sync::
return_type
kernel_impl::ext_oneapi_get_info<
ext::oneapi::experimental::info::kernel_queue_specific::
max_num_work_group_sync>(const queue &Queue) const {
syclex::info::kernel_queue_specific::max_num_work_group_sync>(
[[maybe_unused]] const queue &Queue, const range<3> &WorkGroupSize,
AlexeySachkov marked this conversation as resolved.
Show resolved Hide resolved
size_t DynamicLocalMemorySize) const {
if (WorkGroupSize.size() == 0) {
throw runtime_error("The launch work-group size cannot be zero.",
PI_ERROR_INVALID_WORK_GROUP_SIZE);
}

const auto &Plugin = getPlugin();
const auto &Handle = getHandleRef();
const auto MaxWorkGroupSize =
Queue.get_device().get_info<info::device::max_work_group_size>();

pi_uint32 GroupCount = 0;
Plugin->call<PiApiKind::piextKernelSuggestMaxCooperativeGroupCount>(
Handle, MaxWorkGroupSize, /* DynamicSharedMemorySize */ 0, &GroupCount);
Handle, WorkGroupSize.size(), DynamicLocalMemorySize, &GroupCount);
return GroupCount;
}

template <>
inline typename syclex::info::kernel_queue_specific::max_num_work_group_sync::
return_type
kernel_impl::ext_oneapi_get_info<
syclex::info::kernel_queue_specific::max_num_work_group_sync>(
const queue &Queue) const {
const auto &Device = Queue.get_device();
// Prevent out of launch resources for Cuda if this is used for calculating
// the total work group size for kernel launches, by restricting the max size
// to the kernel_device_specific maximum.
const auto MaxWorkGroupSize =
(Device.get_backend() == backend::ext_oneapi_cuda)
? get_info<info::kernel_device_specific::work_group_size>(Device)
: Device.get_info<info::device::max_work_group_size>();
GeorgeWeb marked this conversation as resolved.
Show resolved Hide resolved

return ext_oneapi_get_info<
syclex::info::kernel_queue_specific::max_num_work_group_sync>(
Queue, sycl::range{MaxWorkGroupSize, 1, 1},
/* DynamicLocalMemorySize */ 0);
}

} // namespace detail
} // namespace _V1
} // namespace sycl
1 change: 1 addition & 0 deletions sycl/source/feature_test.hpp.in
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ inline namespace _V1 {
#define SYCL_EXT_ONEAPI_FREE_FUNCTION_KERNELS 1
#define SYCL_EXT_ONEAPI_PROD 1
#define SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS 1
#define SYCL_EXT_ONEAPI_GROUP_OCCUPANCY_QUERIES 1

#ifndef __has_include
#define __has_include(x) 0
Expand Down
19 changes: 18 additions & 1 deletion sycl/source/kernel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -106,17 +106,34 @@ kernel::get_info<info::kernel_device_specific::max_sub_group_size>(
const device &, const sycl::range<3> &) const;

template <typename Param>
typename Param::return_type
typename detail::is_kernel_queue_specific_info_desc<Param>::return_type
kernel::ext_oneapi_get_info(const queue &Queue) const {
return impl->ext_oneapi_get_info<Param>(Queue);
}

template <typename Param>
typename detail::is_kernel_queue_specific_info_desc<Param>::return_type
kernel::ext_oneapi_get_info(const queue &Queue, const range<3> &WorkGroupSize,
size_t DynamicLocalMemorySize) const {
return impl->ext_oneapi_get_info<Param>(Queue, WorkGroupSize,
DynamicLocalMemorySize);
}

template __SYCL_EXPORT typename ext::oneapi::experimental::info::
kernel_queue_specific::max_num_work_group_sync::return_type
kernel::ext_oneapi_get_info<
ext::oneapi::experimental::info::kernel_queue_specific::
max_num_work_group_sync>(const queue &Queue) const;

#define __SYCL_PARAM_TRAITS_SPEC(Namespace, DescType, Desc, ReturnT, PiCode) \
template __SYCL_EXPORT ReturnT \
kernel::ext_oneapi_get_info<Namespace::info::DescType::Desc>( \
const queue &, const range<3> &, size_t) const;

#include <sycl/info/ext_oneapi_kernel_queue_specific_traits.def>

#undef __SYCL_PARAM_TRAITS_SPEC

kernel::kernel(std::shared_ptr<detail::kernel_impl> Impl) : impl(Impl) {}

pi_native_handle kernel::getNative() const { return impl->getNative(); }
Expand Down
Loading
Loading