Description
Summary
We are observing incorrect and non-deterministic results when performing matrix multiplication of the form t(X) %*% X in R when using OpenBLAS on Intel Xeon 8488C CPUs. The issue appears only when multi-threading. The bug does not occur on other CPU architectures (e.g., AMD EPYC, Intel 8259CL) and is not present when limiting OpenBLAS to a single thread or using R’s internal matrix multiplication routines.
System Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8488C
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 8
BogoMIPS: 4800.00
OS: Debian Linux
BLAS: OpenBLAS (/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 )
Threading: pthreads
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
R version 4.4.1 (2024-06-14)
Reproducible Example:
In this sample data, column 2449 is all exactly zero. When calculating X'X the element at [2448,2449] should also be exactly zero:
R
X <- as.matrix(read.csv("Xsmall.csv"))
options(matprod = "blas")
(t(X) %*% X)[2448, 2449] # returns 0.002292634
# issue fixed with crossprod:
crossprod(X)[2448, 2449] # returns exactly 0
# Manual sum:
sum(X[, 2448] * X[, 2449]) # returns exactly 0
# issue fixed with internal R implementation
options(matprod = "internal")
(t(X) %*% X)[2448, 2449] # returns exactly 0
Only on affected hardware (Intel 8488C), when using more than 1 thread, we see an incorrect value (~0.00229). Setting OPENBLAS_NUM_THREADS=1 makes the issue go away.
We have only reproduced the issue so far with the attached matrix. Note this matrix is ill-conditioned. However, this should not affect the result to this magnitude.
Here is a chart of the result we get for element [2448,2449] with respect to the OPENBLAS_NUM_THREADS setting:
Observations
- The issue does not appear on AMD EPYC 7R13 or Intel 8259CL.
- Only observed with multithreaded OpenBLAS on AVX-512 capable hardware.
- Dot products involving entirely zero columns behave unexpectedly under multithreading.
- Manual dot product confirms result should be exactly 0.
- Disabling threading (OPENBLAS_NUM_THREADS=1) makes the issue disappear.
Hypothesis
This appears to be a bug in the multithreaded dgemm implementation when using AVX-512. Potential contributing factors include:
-
Accumulation errors across SIMD lanes when computing many dot products in parallel.
-
Internal tiling and block accumulation strategies may be introducing rounding errors even when one operand is all zero.
-
The result (~0.0023) is too large to be explained by standard floating-point noise, suggesting a flaw in reduction logic.