[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACueBy64u_11XgYeQJCXD1cNhDfhyy6vPGRLmC4H72H3cdu1RQ@mail.gmail.com>
Date: Sun, 9 Nov 2025 11:19:00 +0800
From: chuang <nashuiliang@...il.com>
To: Dave Hansen <dave.hansen@...el.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, open list <linux-kernel@...r.kernel.org>
Subject: Re: x86/fpu: Inaccurate AVX-512 Usage Tracking via arch_status
Thank you for the ongoing discussion.
On Thu, Oct 30, 2025 at 11:00 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 10/29/25 23:56, chuang wrote:
> ...
> > I traced the code path within fpu_clone(): In fpu_clone() ->
> > save_fpregs_to_fpstate(), since my current Intel CPU supports XSAVE,
> > the call to os_xsave() results in the XFEATURE_Hi16_ZMM bit being
> > set/enabled in xsave.header.xfeatures. This then causes
> > update_avx_timestamp() to update fpu->avx512_timestamp. The same flow
> > occurs in __switch_to() -> switch_fpu_prepare().
>
> So that points more in the direction of the AVX-512 not getting
> initialized. fpu_flush_thread() either isn't getting called or isn't
> doing its job at execve(). *Or*, there's something subtle in your test
> case that's causing AVX-512 to get tracked as non-init after execve().
My analysis of the issue suggests a dependency on the glibc
implementation. Since glibc version 2.24[1] onwards, AVX-512
instructions have been utilized to optimize memory functions such as
memcpy/memmove[2], and memset[3]. In contrast, previous versions
allowed the /proc/<pid>/arch_status mechanism to reflect genuine
application-level AVX-512 usage more accurately.
The disassembly of my test binary (while_sleep_static[4]) confirms the
activation of these glibc optimizations:
00000000004109d0 <__memcpy_avx512_no_vzeroupper>:
4109d0: f3 0f 1e fa endbr64
4109d4: 48 89 f8 mov %rdi,%rax
4109d7: 48 8d 0c 16 lea (%rsi,%rdx,1),%rcx
4109db: 4c 8d 0c 17 lea (%rdi,%rdx,1),%r9
4109df: 48 81 fa 00 02 00 00 cmp $0x200,%rdx
4109e6: 0f 87 5d 01 00 00 ja 410b49
<__memcpy_avx512_no_vzeroupper+0x179>
4109ec: 48 83 fa 10 cmp $0x10,%rdx
4109f0: 0f 86 0f 01 00 00 jbe 410b05
<__memcpy_avx512_no_vzeroupper+0x135>
4109f6: 48 81 fa 00 01 00 00 cmp $0x100,%rdx
4109fd: 72 6f jb 410a6e
<__memcpy_avx512_no_vzeroupper+0x9e>
4109ff: 62 f1 7c 48 10 06 vmovups (%rsi),%zmm0
410a05: 62 f1 7c 48 10 4e 01 vmovups 0x40(%rsi),%zmm1
410a0c: 62 f1 7c 48 10 56 02 vmovups 0x80(%rsi),%zmm2
410a13: 62 f1 7c 48 10 5e 03 vmovups 0xc0(%rsi),%zmm3
410a1a: 62 f1 7c 48 10 61 fc vmovups -0x100(%rcx),%zmm4
410a21: 62 f1 7c 48 10 69 fd vmovups -0xc0(%rcx),%zmm5
410a28: 62 f1 7c 48 10 71 fe vmovups -0x80(%rcx),%zmm6
410a2f: 62 f1 7c 48 10 79 ff vmovups -0x40(%rcx),%zmm7
410a36: 62 f1 7c 48 11 07 vmovups %zmm0,(%rdi)
410a3c: 62 f1 7c 48 11 4f 01 vmovups %zmm1,0x40(%rdi)
410a43: 62 f1 7c 48 11 57 02 vmovups %zmm2,0x80(%rdi)
410a4a: 62 f1 7c 48 11 5f 03 vmovups %zmm3,0xc0(%rdi)
410a51: 62 d1 7c 48 11 61 fc vmovups %zmm4,-0x100(%r9)
Specifically, I performed testing on Intel(R) Xeon(R) Gold 6271C and
an AMD EPYC 9W24 96-Core. The Intel PMU failed to accurately capture
AVX-512 usage, while the AMD CPU showed partial usage in some
scenarios. This potentially relates to the PMU event definitions,
which appear not to include vmovups instructions.
The descriptions for the relevant PMU events (as exemplified by Intel
CPU) are as follows:
fp_arith_inst_retired.512b_packed_double
[Number of SSE/AVX computational 512-bit packed double precision
floating-point instructions retired; some instructions will count
twice as noted below. Each count represents 8 computation operations,
one for each element. Applies to SSE* and AVX* packed double precision
floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14
SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice
as they perform 2 calculations per element]
fp_arith_inst_retired.512b_packed_single
[Number of SSE/AVX computational 512-bit packed single precision
floating-point instructions retired; some instructions will count
twice as noted below. Each count represents 16 computation operations,
one for each element. Applies to SSE* and AVX* packed single precision
floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14
SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice
as they perform 2 calculations per element]
fp_arith_inst_retired.8_flops
[Number of SSE/AVX computational 256-bit packed single precision and
512-bit packed double precision FP instructions retired; some
instructions will count twice as noted below. Each count represents 8
computation operations, 1 for each element. Applies to SSE* and AVX*
packed single precision and double precision FP instructions: ADD SUB
HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP
FM(N)ADD/SUB. DPP and FM(N)ADD/SUB count twice as they perform 2
calculations per element]
Overall, the /proc/arch_status mechanism reliably reflects the AVX-512
register usage status.
We are utilizing AVX-512 enabled infrastructure within a Kubernetes
(k8s) environment and require a mechanism for monitoring the
utilization of this instruction set.
The current /proc/<pid>/arch_status file reliably indicates AVX-512
usage for a single process. However, in containerized environments
(like Kubernetes Pods), this forces us to monitor every single process
within the cgroup, which is highly inefficient and creates significant
performance overhead for monitoring.
To solve this scaling issue, we are exploring the possibility of
aggregating this usage data. Would it be feasible to extend the
AVX-512 activation status tracking to the cgroup level?
[1]: https://sourceware.org/git/?p=glibc.git;a=log;h=refs/tags/glibc-2.24;pg=1
[2]: https://sourceware.org/git/?p=glibc.git;a=commit;h=c867597bff2562180a18da4b8dba89d24e8b65c4
[3]: https://sourceware.org/git/?p=glibc.git;a=commit;h=5e8c5bb1ac83aa2577d64d82467a653fa413f7ce
[4]: while_sleep_static
// gcc -O3 -static while_sleep.c -o while_sleep_static
// glibc > 2.24
#include <unistd.h>
int main()
{
while(1) {
sleep(1);
}
}
>
> > Given this, is the issue related to my specific Intel Xeon Gold? Is
> > the CPU continuously indicating that the AVX-512 state is in use?
> As much as I love to blame the hardware, I don't think we're quite there
> yet. We've literally had software bugs in the past that had this exact
> same behavior: AVX-512 state was tracked as non-init when it was never used.
>
> Any chance you could figure out where you first see XFEATURE_Hi16_ZMM in
> xfeatures? The tracepoints in here might help:
>
> /sys/kernel/debug/tracing/events/x86_fpu
>
> Is there any rhyme or reason for which tasks see avx512_timestamp
> getting set? Is it just your test program? Or other random tasks on the
> system?
Powered by blists - more mailing lists