lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFULd4a2eOhZ4TPQRsLtN1yPYSfgqsXR_yOG+z3PedE-ZCMynw@mail.gmail.com>
Date: Sun, 30 Mar 2025 18:07:30 +0200
From: Uros Bizjak <ubizjak@...il.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, 
	Thomas Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with
 X86_NATIVE_CPU option

On Sun, Mar 30, 2025 at 11:56 AM Ingo Molnar <mingo@...nel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@...il.com> wrote:
>
> > > So a better optimization I think would be to declare and implement
> > > __sw_hweight32 with a different, less intrusive function call ABI
> > > that
> >
> > With an external function, the ABI specifies the location of input
> > argument and function result.
>
> This is all within the kernel, and __sw_hweight32() is implemented in
> the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros
> are fully under our control as well - so we have full control over the
> calling convention.

There is a minor issue with a generic prototype in <linux/bitops.h>,
where we declare:

extern unsigned int __sw_hweight32(unsigned int w);
extern unsigned long __sw_hweight64(__u64 w);

This creates a bit of mixup, so perhaps it is better to define and use
an x86 specific function name. With this issue out of the way, we can
use %eax/%rax as inout register for x86_32 and x86_64 targets, and
remove:

#ifdef CONFIG_64BIT
#define REG_IN "D"
#define REG_OUT "a"
#else
#define REG_IN "a"
#define REG_OUT "a"
#endif

from <arch/x86/include/asm/arch_hweight.h>

In parallel, we should remove x86_64 conditional compilation from
arch/x86/lib/hweight.S:

SYM_FUNC_START(__sw_hweight32)

#ifdef CONFIG_X86_64
movl %edi, %eax # w
#endif
__ASM_SIZE(push,) %__ASM_REG(dx)
movl %eax, %edx # w -> t
...

> Ie. in principle there's no need for the __sw_hweight32 function
> utilized by ALTERNATIVE() to be a C-call-ABI external function with all
> its call-clobbering constraints that disturbs register state affected
> by the C-call-ABI. (RSI RSI RDX RCX R8 R9)
>
> The calling convention used is the kernel's choice, which we can
> re-evaluate.
>
> For example, we could make a version of __sw_hweight32 that is a
> largely no-clobber function that only touches a single register, which
> receives its input in RAX and returns the result to RAX (as usual), and
> saves/restores everything else. This pushes overhead into the uncommon
> case (__sw_hweight32 users) and reduces register pressure on the
> calling site.
>
> I'm not saying it's *worth* it for POPCNTL emulation alone:
>
>  - The code generation benefits might or might not be there. Needs to
>    be examined.

Matching inputs with output will actually make the instruction
"destructive", so the compiler will have to copy the input argument
when it won't die in the instruction. This is not desirable. The code
generation benefits would come from relaxing input constraint to "rm",
which is not possible with a fallback function.

>
>  - There may be some trouble with on-stack red zones used by the
>    compiler, if the compiler doesn't know that a call was done.

The kernel is currently compiled with -mno-red-zone, gcc-15 introduces
special "redzone" clobber to disable red-zone in the function that
includes asm() when/if the kernel starts using redzone.

>  - Plus rolling a different calling convention down the alternatives
>    patching macros will have some maintenance overhead side effects.
>    Possibly other usecases need to be found as well for this to be
>    worth it.

I think that adding a __POPCNT__ version (similar to my original
patch) would bring the most benefit, because we could use "rm" input
and "=r" output registers, without any constraints, enforced by
fallback function call. This is only possible with a new -march=native
functionality.

> But I wanted to bust the false assumption you seem to be making about
> C-call-ABI constraints.

Thanks,
Uros.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ