Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dpotrf + dpotri: Windows vs Linux #4886

Open
AllinCottrell opened this issue Aug 29, 2024 · 6 comments
Open

dpotrf + dpotri: Windows vs Linux #4886

AllinCottrell opened this issue Aug 29, 2024 · 6 comments

Comments

@AllinCottrell
Copy link

AllinCottrell commented Aug 29, 2024

I've come across what looks like an anomalous difference in performance inverting a positive definite matrix using dpotrf() and dpotri(), on Windows as compared with Linux. This is on a dual-boot SkylakeX laptop, using OpenBLAS 0.3.28, compiled with gcc 14.2.0 on Arch Linux and cross-compiled with x86_64-mingw-w64-mingw32-gcc 14.2.0 for Windows 11, in both cases using OpenMP for threading. The configuration flags are mostly the same for the two OpenBLAS builds, except that the Windows build uses DYNAMIC_ARCH=1 but the Linux one is left to auto-detect SkylakeX.

The context is a Gibbs sampling operation with many thousands of iterations, so the performance difference becomes very striking. My test rig iterates inversion of a sequence of p.d. matrices of moderate size, from dimension 4 to 64 by powers of 2. Given the moderate size, multi-threading is not really worthwhile. Best performance is achieved by setting OMP_NUM_THREADS=1; in that case the rig runs very fast on both platforms, with Windows marginally slower than Linux. But if I set the number of OMP threads to equal the number of physical cores (4), which is the default in the program I'm working with,

  • there's just a slight degradation of performance on Linux, but
  • the performance on Windows becomes really horrible, 10 or more times slower than Linux.

I'd be very grateful if anyone can offer insight into what might be going on here. I'd be happy to supply more details depending on what might be relevant.

@martin-frbg
Copy link
Collaborator

can you set OPENBLAS_VERBOSE=2 in the Windows environment please, just to be sure that it uses SKYLAKEX there too as expected ? there may be a few places in the code where OpenMP is handled differently on the two platforms, and I guess the libgomp runtime on Windows may differ from the Linux implementation too... I'm currently at a conference with limited access to decent hardware, so it may take me a few days to investigate

@AllinCottrell
Copy link
Author

Thanks for looking into this, Martin. I can confirm that SKYLAKEX is detected on Windows.

@AllinCottrell
Copy link
Author

Any more thoughts on this?

@martin-frbg
Copy link
Collaborator

Thoughts have been few and far between as I caught covid in the meantime. Sorry, nothing obvious in the OpenBLAS codebase comes to mind even now. I guess you could try if setting OMP_WAIT_POLICY=passive has any influence on this misbehaviour.
Did you perchance use earlier versions of OpenBLAS before which did not show this ? Otherwise, a small, self-contained test case would be helpful for tracking this down.

@AllinCottrell
Copy link
Author

AllinCottrell commented Sep 17, 2024

Thanks, Martin. OMP_WAIT_POLICY=passive does have an influence: it makes the problem a good deal worse! We have used earlier versions of OpenBLAS. We noticed the problem only recently, by chance -- it may well have been there before, unnoticed. Anyway I'm attaching a self-contained test-case and I'll inline below the results from running it on a few systems.

Aside from the relatively extreme problem on Windows, it seems to me that in general the matrix size at which multi-threading kicks in is much too small for optimality. In lapack/potrf/potrf_L_parallel.c there's this clause:

if (n <= GEMM_UNROLL_N * 4) {
    info = POTRF_L_SINGLE(args, NULL, range_n, sa, sb, 0);
    return info;
}

In many cases the default value of DGEMM_UNROLL_N is 4, so this policy would start multi-threading at n = 17. From experimentation on several machines with various Intel and AMD processors I think the threshold should be much higher, in the range 100-150.

Anyway, here are the results I have from the test case. The times are for 50000 replications of inversion of a p.d. matrix. "default" means letting OpenBLAS decide how many threads to use, and "single" means forcing use of a single thread. All the machines referenced below are quad-core.

Arch Linux, OpenBLAS 0.3.28, blascore HASWELL
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0456    0.0471    0.97
  8    0.0656    0.0652    1.01
 16    0.1386    0.1378    1.01
 17    0.5386    0.1498    3.60
 32    0.7247    0.4257    1.70
 64    2.4502    1.6720    1.47

Windows 11, OpenBLAS 0.3.28, blascore SKYLAKEX
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0400    0.0680    0.59
  8    2.2960    0.0940   24.43
 16    6.7490    0.1860   36.28
 17    8.9470    0.2000   44.73
 32   15.9350    0.5670   28.10
 64   36.5520    2.2400   16.32

Arch Linux, OpenBLAS 0.3.28, blascore SKYLAKEX
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0416    0.0425    0.98
  8    0.2349    0.0546    4.30
 16    0.6412    0.1092    5.87
 17    0.8500    0.1417    6.00
 32    1.6334    0.3540    4.61
 64    4.8043    1.3387    3.59

Fedora, OpenBLAS 0.3.21, blascore SANDYBRIDGE
Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0872    0.0813    1.07
  8    0.1041    0.1049    0.99
 16    0.6036    0.1979    3.05
 17    0.8675    0.2146    4.04
 32    1.7614    0.6149    2.86
 64    6.5231    2.8875    2.26

invpd.c.txt

@martin-frbg
Copy link
Collaborator

Thank you very much. Unfortunately I did not manage to do much so far, but at least this does not appear to be a recent regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants