[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACueBy7-1dMwPQ4mirrRjsOkKKyLchkBR+7qMVqxjo7Bbr1T=A@mail.gmail.com>
Date: Mon, 27 Oct 2025 15:50:22 +0800
From: chuang <nashuiliang@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, open list <linux-kernel@...r.kernel.org>
Subject: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status
Dear FPU/x86 Maintainers,
I am writing to report an issue concerning the accuracy of AVX-512
usage tracking, specifically when querying the information via
'/proc/<pid>/arch_status' on systems supporting the instruction set.
This report references the mechanism introduced by the following
patch: https://lore.kernel.org/all/20190117183822.31333-1-aubrey.li@intel.com/T/#u
I have validated the patch's effect in modern environments supporting
AVX-512 (e.g., Intel Xeon Gold, AMD Zen4) and found that the tracking
mechanism does not accurately reflect the actual AVX-512 instruction
usage by the process.
Test Environment:
- CPU: Intel Xeon Gold (AVX-512 supported)
- Test Program: periodic_wake.c (Verified via objdump to not contain
any AVX-512 instructions.)
- Test Goal: To compare AVX-512 execution status as reported by perf
PMU versus procfs arch_status.
perf PMU:
$ perf stat -e instructions,cycles,fp_arith_inst_retired.512b_packed_double,fp_arith_inst_retired.512b_packed_single,fp_arith_inst_retired.8_flops,fp_arith_inst_retired2.128bit_packed_bf16,fp_arith_inst_retired2.256bit_packed_bf16,fp_arith_inst_retired2.512bit_packed_bf16
./periodic_wake > /dev/null
^C./periodic_wake: Interrupt
Performance counter stats for './periodic_wake':
2,329,116 instructions # 2.86
insn per cycle (33.57%)
814,040 cycles
(56.61%)
0 fp_arith_inst_retired.512b_packed_double
(9.82%)
<not counted> fp_arith_inst_retired.512b_packed_single
(0.00%)
<not counted> fp_arith_inst_retired.8_flops
(0.00%)
<not counted> fp_arith_inst_retired2.128bit_packed_bf16
(0.00%)
<not counted> fp_arith_inst_retired2.256bit_packed_bf16
(0.00%)
<not counted> fp_arith_inst_retired2.512bit_packed_bf16
(0.00%)
1.366220977 seconds time elapsed
0.000000000 seconds user
0.002253000 seconds sys
procfs arch_status:
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 44
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 64
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 91
$ cat /proc/$(pgrep -f "^./periodic_wake")/arch_status
AVX512_elapsed_ms: 50
Based on the observed behavior and a review of the referenced patch,
my hypothesis is:
On AVX-512 capable systems, the implementation appears to record the
current timestamp into 'task->thread.fpu.avx512_timestamp' upon any
task switch, irrespective of whether the task has actually executed an
AVX-512 instruction.
This continuous updating of the timestamp, even for non-AVX-512 tasks,
results in misleading non-zero values for AVX512_elapsed_ms, rendering
the mechanism ineffective for accurately determining if a task is
actively utilizing AVX-512.
Could you please confirm if this analysis is correct and advise on the
appropriate next steps to resolve this discrepancy?
'periodic_wake.c':
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <errno.h>
// Define wakeup interval as 100 milliseconds
#define INTERVAL_MS 100
int main() {
// Convert milliseconds to nanoseconds
long interval_ns = (long)INTERVAL_MS * 1000000L;
// timespec struct used for nanosleep
struct timespec requested;
struct timespec remaining;
// Initialize the requested time structure
requested.tv_sec = 0;
requested.tv_nsec = interval_ns;
printf("C Periodic Wakeup Program started (Interval: %dms,
%.9ldns). Press Ctrl+C to stop.\n",
INTERVAL_MS, interval_ns);
long long counter = 0;
while (1) {
counter++;
// Print current wakeup information
printf("Wakeup #%lld: Continuing execution.\n", counter);
// Use nanosleep for high-precision sleep.
// If nanosleep is interrupted by a signal (e.g., Ctrl+C), it
returns -1 and stores the remaining time in 'remaining'.
// To maintain accurate periodicity, we re-sleep for the
remaining time if an interruption occurs.
remaining.tv_sec = requested.tv_sec;
remaining.tv_nsec = requested.tv_nsec;
int result;
do {
// Sleep
result = nanosleep(&remaining, &remaining);
// Check return value
if (result == -1) {
if (errno == EINTR) {
// Interrupted by a signal (e.g., debugger or
Ctrl+C), continue sleeping for remaining time
printf("[Interrupted] nanosleep was interrupted by
a signal, sleeping for remaining %.3fms\n",
(double)remaining.tv_nsec / 1000000.0);
// Loop continues, using the remaining time stored
in 'remaining'
} else {
// Other error, print error and exit
perror("nanosleep error");
return 1;
}
}
} while (result == -1 && errno == EINTR);
// If nanosleep returns 0 successfully, continue to the next
loop iteration
}
return 0; // Theoretically unreachable
}
Thank you for your time and assistance.
Best regards,
Powered by blists - more mailing lists