linux-kernel - Re: [RFC PATCH 05/12] perf/x86: Support XMM register for non-PEBS and REGS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <368e7626-c9bd-47be-bb42-f542dc3d67b7@intel.com>
Date: Fri, 13 Jun 2025 08:15:02 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: kan.liang@...ux.intel.com, peterz@...radead.org, mingo@...hat.com,
 acme@...nel.org, namhyung@...nel.org, tglx@...utronix.de,
 dave.hansen@...ux.intel.com, irogers@...gle.com, adrian.hunter@...el.com,
 jolsa@...nel.org, alexander.shishkin@...ux.intel.com,
 linux-kernel@...r.kernel.org
Cc: dapeng1.mi@...ux.intel.com, ak@...ux.intel.com, zide.chen@...el.com
Subject: Re: [RFC PATCH 05/12] perf/x86: Support XMM register for non-PEBS and
 REGS_USER

> +static DEFINE_PER_CPU(void *, ext_regs_buf);

This should probably use one of the types in asm/fpu/types.h, not void*.

> +static void x86_pmu_get_ext_regs(struct x86_perf_regs *perf_regs, u64 mask)
> +{
> +	void *xsave = (void *)ALIGN((unsigned long)per_cpu(ext_regs_buf, smp_processor_id()), 64);

I'd just align the allocation to avoid having to align it at runtime
like this.

> +	struct xregs_state *xregs_xsave = xsave;
> +	u64 xcomp_bv;
> +
> +	if (WARN_ON_ONCE(!xsave))
> +		return;
> +
> +	xsaves_nmi(xsave, mask);
> +
> +	xcomp_bv = xregs_xsave->header.xcomp_bv;
> +	if (mask & XFEATURE_MASK_SSE && xcomp_bv & XFEATURE_SSE)
> +		perf_regs->xmm_regs = (u64 *)xregs_xsave->i387.xmm_space;
> +}

Could we please align the types on:

	perf_regs->xmm_regs
and
	xregs_xsave->i387.xmm_space

so that no casting is required?

> +static void reserve_ext_regs_buffers(void)
> +{
> +	size_t size;
> +	int cpu;
> +
> +	if (!x86_pmu.ext_regs_mask)
> +		return;
> +
> +	size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
> +
> +	/* XSAVE feature requires 64-byte alignment. */
> +	size += 64;

Does this actually work? ;)

Take a look at your system when it boots. You should see some helpful
pr_info()'s:

> [    0.137276] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [    0.138799] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [    0.139681] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [    0.140576] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
> [    0.141569] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
> [    0.142804] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
> [    0.143665] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
> [    0.144436] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> [    0.145290] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
> [    0.146238] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
> [    0.146803] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
> [    0.147397] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
> [    0.147986] x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.

Notice that we're talking about a buffer which is ~2k in size when
AVX-512 is in play. Is 'size' above that big?

> +	for_each_possible_cpu(cpu) {
> +		per_cpu(ext_regs_buf, cpu) = kzalloc_node(size, GFP_KERNEL,
> +							  cpu_to_node(cpu));
> +		if (!per_cpu(ext_regs_buf, cpu))
> +			goto err;
> +	}

Right now, any kmalloc() >=256b is going to be rounded up and aligned to
a power of 2 and thus also be 64b aligned although this is just an
implementation detail today. There's a _guarantee_ that all kmalloc()'s
with powers of 2 are naturally aligned and also 64b aligned.

In other words, in practice, these kzalloc_node() are 64b aligned and
rounded up to a power of 2 size.

You can *guarantee* they'll be 64b aligned by just rounding size up to
the next power of 2. This won't increase the size because they're
already being rounded up internally.

I can also grumble a little bit because this reinvents the wheel, and I
suspect it'll continue reinventing the wheel when it actually sizes the
buffer correctly.

We already have code in the kernel to dynamically allocate an fpstate:
fpstate_realloc(). It uses vmalloc() which wouldn't be my first choice
for this, but I also don't think it will hurt much. Looking at it, I'm
not sure how much of it you want to refactor and reuse, but you should
at least take a look.

There's also xstate_calculate_size(). That, you _definitely_ want to use
if you end up doing your own allocations.