Lower latency associative scan option #10599

oliverdutton · 2022-05-06T08:28:51Z

oliverdutton
May 6, 2022

In one of my problems the implementation was bottlenecked by a cumulative matmul. JAX has a handy implemention of a work-efficient associative scan for this lax.associative_scan. This reduces the procedure from N steps to 2 log_2{N}-2 steps. There is a work-inefficient implementation that reduces this to log_2{N} steps, shown below which is faster for small problem sizes where the GPU is not saturated.

Would anyone else be interested in having a work-inefficient option in the lax.associative scan? If so I can put together a pull request.

see https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda, http://www.cs.cmu.edu/~guyb/papers/Ble93.pdf, https://en.wikipedia.org/wiki/Prefix_sum

import jax
from functools import partial
from jax import numpy as jnp, lax, random, jit
slice_in_dim = jax.lax.slice_in_dim

def sequential(operator, x):
    def _body(prev_cumulative, elem):
        cumulative = operator(prev_cumulative, elem)
        return cumulative, cumulative
    _, y = jax.lax.scan(
        f=lambda carry,x: _body(carry, x),
        init=x[0],
        xs=x[1:]
    )
    return jnp.concatenate([x[:1], y])

def work_inefficient_all_prefix_sum(operator, x):
    with jax.ensure_compile_time_eval():
        # Hillis, W. D. and Steele, G. L. (1986). Data parallel algorithms. Communications of the ACM, 29(12), 1170–1183
        # log_2{n} steps
        n = x.shape[0]
        j_max = jnp.ceil(jnp.log2(n)).astype(int)

        l = slice_in_dim(x, 0   ,n-2**0)
        r = slice_in_dim(x, 2**0,n)
        for j in range(0,j_max-1):
            prev_l=l
            n = r.shape[0]
            r_ = operator(l, r)
            l_max = 2**j
            r_max = 2**(j+1)
            if r_max > n:
                l_max = l_max-r_max+n
                l = slice_in_dim(l, 0, l_max)
            else:
                l = jnp.concatenate([slice_in_dim(l, 0, 2**j), slice_in_dim(r_, 0, n-2**(j+1))])
            r = slice_in_dim(r_, 2**j, n)
        n = r.shape[0]
        final_r_ = operator(slice_in_dim(l, 0, n), r)
        return jnp.concatenate([slice_in_dim(prev_l, 0, 2**j), slice_in_dim(r_, 0, 2**j), final_r_])

key = random.PRNGKey(42)
x = random.uniform(key, (500,3,3))
x  /= jnp.linalg.norm(x, axis=(-1,-2), keepdims=True) # not real normalisation, but means 
operator = lambda x,y: jnp.matmul(y,x)

functions = {
    'sequential': sequential,
    'work-efficient': lax.associative_scan,
    'work-inefficient': work_inefficient_all_prefix_sum
}
for name, f in functions.items():
    print(name)
    f = jit(partial(f, operator))
    inputs = (x,)
    _ = jax.block_until_ready(f(*inputs)) # Compile once
    timings = %timeit -n 10 -r 10 -o _ = jax.block_until_ready(f(*inputs))


## output
sequential
8.59 ms ± 837 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
work-efficient
214 µs ± 2.57 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
work-inefficient
141 µs ± 2.48 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower latency associative scan option #10599

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Lower latency associative scan option #10599

oliverdutton May 6, 2022

Replies: 0 comments

oliverdutton
May 6, 2022