linux-kernel - Re: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACueBy64u_11XgYeQJCXD1cNhDfhyy6vPGRLmC4H72H3cdu1RQ@mail.gmail.com>
Date: Sun, 9 Nov 2025 11:19:00 +0800
From: chuang <nashuiliang@...il.com>
To: Dave Hansen <dave.hansen@...el.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, open list <linux-kernel@...r.kernel.org>
Subject: Re: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status

Thank you for the ongoing discussion.

On Thu, Oct 30, 2025 at 11:00 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 10/29/25 23:56, chuang wrote:
> ...
> > I traced the code path within fpu_clone(): In fpu_clone() ->
> > save_fpregs_to_fpstate(), since my current Intel CPU supports XSAVE,
> > the call to os_xsave() results in the XFEATURE_Hi16_ZMM bit being
> > set/enabled in xsave.header.xfeatures. This then causes
> > update_avx_timestamp() to update fpu->avx512_timestamp. The same flow
> > occurs in __switch_to() -> switch_fpu_prepare().
>
> So that points more in the direction of the AVX-512 not getting
> initialized. fpu_flush_thread() either isn't getting called or isn't
> doing its job at execve(). *Or*, there's something subtle in your test
> case that's causing AVX-512 to get tracked as non-init after execve().

My analysis of the issue suggests a dependency on the glibc
implementation. Since glibc version 2.24[1] onwards, AVX-512
instructions have been utilized to optimize memory functions such as
memcpy/memmove[2], and memset[3]. In contrast, previous versions
allowed the /proc/<pid>/arch_status mechanism to reflect genuine
application-level AVX-512 usage more accurately.

The disassembly of my test binary (while_sleep_static[4]) confirms the
activation of these glibc optimizations:

00000000004109d0 <__memcpy_avx512_no_vzeroupper>:
  4109d0:       f3 0f 1e fa             endbr64
  4109d4:       48 89 f8                mov    %rdi,%rax
  4109d7:       48 8d 0c 16             lea    (%rsi,%rdx,1),%rcx
  4109db:       4c 8d 0c 17             lea    (%rdi,%rdx,1),%r9
  4109df:       48 81 fa 00 02 00 00    cmp    $0x200,%rdx
  4109e6:       0f 87 5d 01 00 00       ja     410b49
<__memcpy_avx512_no_vzeroupper+0x179>
  4109ec:       48 83 fa 10             cmp    $0x10,%rdx
  4109f0:       0f 86 0f 01 00 00       jbe    410b05
<__memcpy_avx512_no_vzeroupper+0x135>
  4109f6:       48 81 fa 00 01 00 00    cmp    $0x100,%rdx
  4109fd:       72 6f                   jb     410a6e
<__memcpy_avx512_no_vzeroupper+0x9e>
  4109ff:       62 f1 7c 48 10 06       vmovups (%rsi),%zmm0
  410a05:       62 f1 7c 48 10 4e 01    vmovups 0x40(%rsi),%zmm1
  410a0c:       62 f1 7c 48 10 56 02    vmovups 0x80(%rsi),%zmm2
  410a13:       62 f1 7c 48 10 5e 03    vmovups 0xc0(%rsi),%zmm3
  410a1a:       62 f1 7c 48 10 61 fc    vmovups -0x100(%rcx),%zmm4
  410a21:       62 f1 7c 48 10 69 fd    vmovups -0xc0(%rcx),%zmm5
  410a28:       62 f1 7c 48 10 71 fe    vmovups -0x80(%rcx),%zmm6
  410a2f:       62 f1 7c 48 10 79 ff    vmovups -0x40(%rcx),%zmm7
  410a36:       62 f1 7c 48 11 07       vmovups %zmm0,(%rdi)
  410a3c:       62 f1 7c 48 11 4f 01    vmovups %zmm1,0x40(%rdi)
  410a43:       62 f1 7c 48 11 57 02    vmovups %zmm2,0x80(%rdi)
  410a4a:       62 f1 7c 48 11 5f 03    vmovups %zmm3,0xc0(%rdi)
  410a51:       62 d1 7c 48 11 61 fc    vmovups %zmm4,-0x100(%r9)

Specifically, I performed testing on Intel(R) Xeon(R) Gold 6271C and
an AMD EPYC 9W24 96-Core. The Intel PMU failed to accurately capture
AVX-512 usage, while the AMD CPU showed partial usage in some
scenarios. This potentially relates to the PMU event definitions,
which appear not to include vmovups instructions.
The descriptions for the relevant PMU events (as exemplified by Intel
CPU) are as follows:

  fp_arith_inst_retired.512b_packed_double
       [Number of SSE/AVX computational 512-bit packed double precision
        floating-point instructions retired; some instructions will count
        twice as noted below. Each count represents 8 computation operations,
        one for each element. Applies to SSE* and AVX* packed double precision
        floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14
        SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice
        as they perform 2 calculations per element]
  fp_arith_inst_retired.512b_packed_single
       [Number of SSE/AVX computational 512-bit packed single precision
        floating-point instructions retired; some instructions will count
        twice as noted below. Each count represents 16 computation operations,
        one for each element. Applies to SSE* and AVX* packed single precision
        floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14
        SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice
        as they perform 2 calculations per element]
  fp_arith_inst_retired.8_flops
       [Number of SSE/AVX computational 256-bit packed single precision and
        512-bit packed double precision FP instructions retired; some
        instructions will count twice as noted below. Each count represents 8
        computation operations, 1 for each element. Applies to SSE* and AVX*
        packed single precision and double precision FP instructions: ADD SUB
        HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP
        FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2
        calculations per element]

Overall, the /proc/arch_status mechanism reliably reflects the AVX-512
register usage status.

We are utilizing AVX-512 enabled infrastructure within a Kubernetes
(k8s) environment and require a mechanism for monitoring the
utilization of this instruction set.
The current /proc/<pid>/arch_status file reliably indicates AVX-512
usage for a single process. However, in containerized environments
(like Kubernetes Pods), this forces us to monitor every single process
within the cgroup, which is highly inefficient and creates significant
performance overhead for monitoring.
To solve this scaling issue, we are exploring the possibility of
aggregating this usage data. Would it be feasible to extend the
AVX-512 activation status tracking to the cgroup level?

[1]: https://sourceware.org/git/?p=glibc.git;a=log;h=refs/tags/glibc-2.24;pg=1
[2]: https://sourceware.org/git/?p=glibc.git;a=commit;h=c867597bff2562180a18da4b8dba89d24e8b65c4
[3]: https://sourceware.org/git/?p=glibc.git;a=commit;h=5e8c5bb1ac83aa2577d64d82467a653fa413f7ce
[4]: while_sleep_static
// gcc -O3 -static while_sleep.c -o while_sleep_static
// glibc > 2.24
#include <unistd.h>

int main()
{
    while(1) {
            sleep(1);
    }
}


>
> > Given this, is the issue related to my specific Intel Xeon Gold? Is
> > the CPU continuously indicating that the AVX-512 state is in use?
> As much as I love to blame the hardware, I don't think we're quite there
> yet. We've literally had software bugs in the past that had this exact
> same behavior: AVX-512 state was tracked as non-init when it was never used.
>
> Any chance you could figure out where you first see XFEATURE_Hi16_ZMM in
> xfeatures? The tracepoints in here might help:
>
>         /sys/kernel/debug/tracing/events/x86_fpu
>
> Is there any rhyme or reason for which tasks see avx512_timestamp
> getting set? Is it just your test program? Or other random tasks on the
> system?