Skip to content

Incorrect result and non-deterministic hardware specific result with threaded OpenBLAS on Intel Xeon 8488C for matrix with sparse column #5267

Open
@jwasteneys-ccl

Description

@jwasteneys-ccl

Summary
We are observing incorrect and non-deterministic results when performing matrix multiplication of the form t(X) %*% X in R when using OpenBLAS on Intel Xeon 8488C CPUs. The issue appears only when multi-threading. The bug does not occur on other CPU architectures (e.g., AMD EPYC, Intel 8259CL) and is not present when limiting OpenBLAS to a single thread or using R’s internal matrix multiplication routines.

System Information:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8488C
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 8
BogoMIPS: 4800.00

OS: Debian Linux

BLAS: OpenBLAS (/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 )

Threading: pthreads

LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

R version 4.4.1 (2024-06-14)

Reproducible Example:

Xsmall.csv

In this sample data, column 2449 is all exactly zero. When calculating X'X the element at [2448,2449] should also be exactly zero:

R
X <- as.matrix(read.csv("Xsmall.csv"))
options(matprod = "blas")
(t(X) %*% X)[2448, 2449]  # returns 0.002292634

# issue fixed with crossprod:
crossprod(X)[2448, 2449]  # returns exactly 0

# Manual sum:
sum(X[, 2448] * X[, 2449])  # returns exactly 0

# issue fixed with internal R implementation
options(matprod = "internal") 
(t(X) %*% X)[2448, 2449]  # returns exactly 0

Only on affected hardware (Intel 8488C), when using more than 1 thread, we see an incorrect value (~0.00229). Setting OPENBLAS_NUM_THREADS=1 makes the issue go away.

We have only reproduced the issue so far with the attached matrix. Note this matrix is ill-conditioned. However, this should not affect the result to this magnitude.

Here is a chart of the result we get for element [2448,2449] with respect to the OPENBLAS_NUM_THREADS setting:
Image

Observations

  • The issue does not appear on AMD EPYC 7R13 or Intel 8259CL.
  • Only observed with multithreaded OpenBLAS on AVX-512 capable hardware.
  • Dot products involving entirely zero columns behave unexpectedly under multithreading.
  • Manual dot product confirms result should be exactly 0.
  • Disabling threading (OPENBLAS_NUM_THREADS=1) makes the issue disappear.

Hypothesis
This appears to be a bug in the multithreaded dgemm implementation when using AVX-512. Potential contributing factors include:

  • Accumulation errors across SIMD lanes when computing many dot products in parallel.

  • Internal tiling and block accumulation strategies may be introducing rounding errors even when one operand is all zero.

  • The result (~0.0023) is too large to be explained by standard floating-point noise, suggesting a flaw in reduction logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions