linux-kernel - x86/fpu: Inaccurate AVX-512 Usage Tracking via arch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACueBy7-1dMwPQ4mirrRjsOkKKyLchkBR+7qMVqxjo7Bbr1T=A@mail.gmail.com>
Date: Mon, 27 Oct 2025 15:50:22 +0800
From: chuang <nashuiliang@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, open list <linux-kernel@...r.kernel.org>
Subject: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status

Dear FPU/x86 Maintainers,

I am writing to report an issue concerning the accuracy of AVX-512
usage tracking, specifically when querying the information via
'/proc/<pid>/arch_status' on systems supporting the instruction set.

This report references the mechanism introduced by the following
patch: https://lore.kernel.org/all/20190117183822.31333-1-aubrey.li@intel.com/T/#u

I have validated the patch's effect in modern environments supporting
AVX-512 (e.g., Intel Xeon Gold, AMD Zen4) and found that the tracking
mechanism does not accurately reflect the actual AVX-512 instruction
usage by the process.

Test Environment:
- CPU: Intel Xeon Gold (AVX-512 supported)
- Test Program: periodic_wake.c (Verified via objdump to not contain
any AVX-512 instructions.)
- Test Goal: To compare AVX-512 execution status as reported by perf
PMU versus procfs arch_status.

perf PMU:

$ perf stat -e instructions,cycles,fp_arith_inst_retired.512b_packed_double,fp_arith_inst_retired.512b_packed_single,fp_arith_inst_retired.8_flops,fp_arith_inst_retired2.128bit_packed_bf16,fp_arith_inst_retired2.256bit_packed_bf16,fp_arith_inst_retired2.512bit_packed_bf16
./periodic_wake > /dev/null
^C./periodic_wake: Interrupt

 Performance counter stats for './periodic_wake':

         2,329,116      instructions                     #    2.86
insn per cycle              (33.57%)
           814,040      cycles
                         (56.61%)
                 0      fp_arith_inst_retired.512b_packed_double
                                 (9.82%)
     <not counted>      fp_arith_inst_retired.512b_packed_single
                                 (0.00%)
     <not counted>      fp_arith_inst_retired.8_flops
                         (0.00%)
     <not counted>      fp_arith_inst_retired2.128bit_packed_bf16
                                  (0.00%)
     <not counted>      fp_arith_inst_retired2.256bit_packed_bf16
                                  (0.00%)
     <not counted>      fp_arith_inst_retired2.512bit_packed_bf16
                                  (0.00%)

       1.366220977 seconds time elapsed

       0.000000000 seconds user
       0.002253000 seconds sys


procfs arch_status:

$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      44
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      64
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      91
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms:      50

Based on the observed behavior and a review of the referenced patch,
my hypothesis is:

On AVX-512 capable systems, the implementation appears to record the
current timestamp into 'task->thread.fpu.avx512_timestamp' upon any
task switch, irrespective of whether the task has actually executed an
AVX-512 instruction.

This continuous updating of the timestamp, even for non-AVX-512 tasks,
results in misleading non-zero values for AVX512_elapsed_ms, rendering
the mechanism ineffective for accurately determining if a task is
actively utilizing AVX-512.

Could you please confirm if this analysis is correct and advise on the
appropriate next steps to resolve this discrepancy?

'periodic_wake.c':

#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>

// Define wakeup interval as 100 milliseconds
#define INTERVAL_MS 100

int main() {
    // Convert milliseconds to nanoseconds
    long interval_ns = (long)INTERVAL_MS * 1000000L;

    // timespec struct used for nanosleep
    struct timespec requested;
    struct timespec remaining;

    // Initialize the requested time structure
    requested.tv_sec = 0;
    requested.tv_nsec = interval_ns;

    printf("C Periodic Wakeup Program started (Interval: %dms,
%.9ldns). Press Ctrl+C to stop.\n",
           INTERVAL_MS, interval_ns);

    long long counter = 0;

    while (1) {
        counter++;

        // Print current wakeup information
        printf("Wakeup #%lld: Continuing execution.\n", counter);

        // Use nanosleep for high-precision sleep.
        // If nanosleep is interrupted by a signal (e.g., Ctrl+C), it
returns -1 and stores the remaining time in 'remaining'.
        // To maintain accurate periodicity, we re-sleep for the
remaining time if an interruption occurs.

        remaining.tv_sec = requested.tv_sec;
        remaining.tv_nsec = requested.tv_nsec;

        int result;

        do {
            // Sleep
            result = nanosleep(&remaining, &remaining);

            // Check return value
            if (result == -1) {
                if (errno == EINTR) {
                    // Interrupted by a signal (e.g., debugger or
Ctrl+C), continue sleeping for remaining time
                    printf("[Interrupted] nanosleep was interrupted by
a signal, sleeping for remaining %.3fms\n",
                           (double)remaining.tv_nsec / 1000000.0);
                    // Loop continues, using the remaining time stored
in 'remaining'
                } else {
                    // Other error, print error and exit
                    perror("nanosleep error");
                    return 1;
                }
            }
        } while (result == -1 && errno == EINTR);

        // If nanosleep returns 0 successfully, continue to the next
loop iteration
    }

    return 0; // Theoretically unreachable
}


Thank you for your time and assistance.

Best regards,