Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support (I)FFT calculations #42

Open
Jolanrensen opened this issue Mar 28, 2021 · 24 comments
Open

Support (I)FFT calculations #42

Jolanrensen opened this issue Mar 28, 2021 · 24 comments
Assignees

Comments

@Jolanrensen
Copy link

Jolanrensen commented Mar 28, 2021

For my specific data engineering application, having a fast Fast Fourier Transform algorithm would be great to have. However, there are many applications that could benefit from FFT, like audio analysis.

Most math/vector libraries have some form of (inverse) FFT built in, like Numpy for Python, Apache Commons for Java (which is slow) or JTransforms (which is quite fast and multi threaded but doesn't use SIMD).

I think FFT could benefit a lot from the SIMD capabilities of Viktor.

This is confirmed by some papers of which this is just the first example I could find.

Intel MKL also has a form of FFT built in which uses SIMD (and maybe other Intel stuff) as well.

I don't know if boost SIMD already has FFT built in, but I wouldn't be surprised if it does.

@dievsky dievsky self-assigned this Mar 29, 2021
@dievsky
Copy link
Contributor

dievsky commented Mar 29, 2021

Thank you for your interest! I will look into it. I didn't find any FFT methods in boost.SIMD, unfortunately.
Do you have some specific FFT variety in mind? AFAIR, there are some special cases for different array sizes, like for powers of two, for other composite numbers, for arbitrary numbers etc.

@daphil19
Copy link

I'm coming at this fairly naively, but would wrapping fftw with JNI calls may be an effective way to implement this? The fallback for unsupported architectures could be JTransforms.

@dievsky
Copy link
Contributor

dievsky commented Mar 29, 2021

It might be effective, but it would also require either re-licensing viktor or purchasing a dedicated fftw license, since fftw is apparently GPL.

@Jolanrensen
Copy link
Author

Thank you for your interest! I will look into it. I didn't find any FFT methods in boost.SIMD, unfortunately.
Do you have some specific FFT variety in mind? AFAIR, there are some special cases for different array sizes, like for powers of two, for other composite numbers, for arbitrary numbers etc.

Powers of 2 are fine. I understand they are more efficient that way (and Apache Commons also had this as a hard limitation). Other than that, here is what I specifically am using right now:

A forward Fast Fourier Transform using a real double array as input which outputs a complex array (in the form of [real0, complex0, real1, complex1...]).

I perform a multiplication of two of these "complex" arrays.

I perform an inverse Fast Fourier Transform with scaling on this complex array and from that result I take just the reals.

It might be possible to still wrap another library that uses SIMD. I'm currently trying to get Intel MKL to work for my project, but I can't get the linking to work in Gradle, however there are many libraries: https://github.com/project-gemmi/benchmarking-fft

@dievsky
Copy link
Contributor

dievsky commented Mar 31, 2021

A couple insights:

  • MKL seems to be the most versatile, but it's also big, Intel-oriented, and closed-source;
  • PocketFFT is lightweight, but only does SIMD for multidimensional arrays.

We could also probably implement a SIMD-friendly decimation-in-frequency reduction ourselves, delegating the actual FFT for small arrays to Apache Commons Math (we have it as a dependency anyway).

@Jolanrensen
Copy link
Author

A couple insights:

  • MKL seems to be the most versatile, but it's also big, Intel-oriented, and closed-source;
  • PocketFFT is lightweight, but only does SIMD for multidimensional arrays.

We could also probably implement a SIMD-friendly decimation-in-frequency reduction ourselves, delegating the actual FFT for small arrays to Apache Commons Math (we have it as a dependency anyway).

I'd skip Apache Commons actually. It converts its results to Complex class instances and when I rewrote it to use F64FlatArrays, compared to JTransforms, it was just very slow.

I did find another list of libraries: https://community.vcvrack.com/t/complete-list-of-c-c-fft-libraries/9153

@dievsky
Copy link
Contributor

dievsky commented Mar 31, 2021

OK, thanks for the info! It's surprisingly hard to find a permissive-licensed library doing 1-D SIMD on doubles on this list, so the option of combining the existing SIMD capabilities with a Java FFT implementation is still viable.

@dievsky
Copy link
Contributor

dievsky commented Mar 31, 2021

After some research, it looks doubtful that we'll be able to perform the decimation using existing SIMD capabilities, The problem is that we require complex multiplication for this, and all the libraries I've seen expect the complex data in interleaved format (RIRI), while SIMD would only work well with split format (RRII).

@Jolanrensen
Copy link
Author

After some research, it looks doubtful that we'll be able to perform the decimation using existing SIMD capabilities, The problem is that we require complex multiplication for this, and all the libraries I've seen expect the complex data in interleaved format (RIRI), while SIMD would only work well with split format (RRII).

However many of the libraries, like Pffft already do use SIMD, so that should be plug-and-play right? It's only "pretty" fast though, but it does use SIMD, so I guess that would make it at least "quite" fast, haha.

@dievsky
Copy link
Contributor

dievsky commented Mar 31, 2021

Yes, the previous comment was related to the option where we would decimate in Kotlin using viktor's existing SIMD capabilities and delegate to JTransforms once the array size became small enough.

@dievsky
Copy link
Contributor

dievsky commented Mar 31, 2021

pffft only works with floats, unfortunately. PocketFFT only uses SIMD for 2+ dimensional arrays. FFTW is under GPL.

@Jolanrensen
Copy link
Author

I see, you're right. Maybe we can convert pffft to doubles? Not sure how this would affect the program (but I don't consider myself an active low-level programmer anyways, so I might be wrong).

@Jolanrensen
Copy link
Author

Oh here we go, with double support: https://github.com/marton78/pffft

@dievsky
Copy link
Contributor

dievsky commented Mar 31, 2021

Hmm, might work. I'll have to look into it in more details later.

@dievsky
Copy link
Contributor

dievsky commented Apr 6, 2021

Played a bit with the library. It seems that it doesn't support unaligned arrays, which is a problem, since JVM's double arrays are not aligned to 32 bytes (the array object normally is, but its header takes 16 bytes, so the data are only aligned to 16 bytes, and there's no ironclad guarantee even for that). And by "doesn't support" I mean it crashes hard with a segfault.
I'll try to ask the author about it.

@dievsky
Copy link
Contributor

dievsky commented Apr 6, 2021

marton78/pffft#22 created an issue.

@dievsky
Copy link
Contributor

dievsky commented Apr 8, 2021

The maintainers closed the issue as "too much work to do".

@dievsky
Copy link
Contributor

dievsky commented Apr 9, 2021

OK, I think I've actually managed to fix the unaligned access problems.

Preliminary (really, really preliminary) benchmarks show very promising results. Like, our wrapper outperforms JTransforms by an order of magnitude. This is in all likeness an unfair comparison, but still.

@Jolanrensen by the way, how do you multiply the complex arrays? And what is the typical size of the array?

@Jolanrensen
Copy link
Author

Jolanrensen commented Apr 9, 2021

That's very promising!
Typical sizes I'm working with at the moment are in magnitude of hundreds (currently at like 250 or something), but this might differ in the future, I'm not sure yet. The calculations do happen often.

To multiply the results from JTransform (which are ordered [real, complex, real complex]), I wrote the following to multiply a in place with b:

for (k in 0 until a.size / 2) {
    val aRe = a[2 * k]
    val aIm = a[2 * k + 1]
    val bRe = b[2 * k]
    val bIm = b[2 * k + 1]

    a[2 * k] = aRe * bRe - aIm * bIm
    a[2 * k + 1] = aRe * bIm + aIm * bRe
}

(this might be able to be done more efficiently, not sure)

@dievsky
Copy link
Contributor

dievsky commented Apr 9, 2021

Hmm, 250 is a really small array, I'd say. Not sure we'll see much benefit from SIMD here. Doing a more fair comparison, I start to see the difference at about 1024 complex elements (2048 doubles).
The reason I'm asking about complex multiplication is because this looks easily SIMDizable as well.

@daphil19
Copy link

daphil19 commented Apr 9, 2021

I have a use case that includes complex mutilation of up to 4M (2^22) complex values (so 8M doubles), so a speedup from SIMD would be welcomed. With that said, our FFT code is the critical path, so any speedup there would trump optimizing the multiplication.

@dievsky
Copy link
Contributor

dievsky commented Apr 9, 2021

Mostly a side note for future me:
I have done quite a few benchmarks, and they generally show that SIMD performance gain is most pronounced at mid-range array size (10K, 100K). I have no ironclad explanation for this, but my guess is that at the lower sizes (100, 1000) JNI overhead becomes is too expensive, while at the higher sizes (1M, 10M) the long JNI call somehow interferes with normal JVM work.
This leads me to consider splitting large arrays into palatable chunks and SIMDizing each chunk with a separate call. It would be really easy to implement for element-wise operations and most reductions, but with FFT, we'd need to employ a decimation technique which, in turn, relies on complex multiplication. So we need efficient complex multiplication anyway.

@dievsky
Copy link
Contributor

dievsky commented Apr 15, 2021

So, my solution seems to be working on Linux, both for me locally and on TeamCity. The other OSes, well...
The macOS build is acting weird. It fails to assemble with unaligned access workaround enabled (citing an undefined __m256d_u), but it also doesn't produce the expected SIGSEGV when assembled without the unaligned access workaround.
The Windows build currently fails with multiple C syntax errors when compiling pffft. Investigation ongoing.

@dievsky
Copy link
Contributor

dievsky commented Apr 19, 2021

I must confess I'm a bit stuck with Windows build. I can't even link against the original version (marton78/pffft) due to "unresolved external symbol" errors. If anyone participating in this issue has some experience with Visual C tool chain, I'd appreciate help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants