linux-kernel - Re: [PATCH] x86: Optimize __get_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251128103738.6e72c62c@pumpkin>
Date: Fri, 28 Nov 2025 10:37:38 +0000
From: david laight <david.laight@...box.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, Thomas Gleixner
 <tglx@...utronix.de>, Ingo Molnar <mingo@...nel.org>, Borislav Petkov
 <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin"
 <hpa@...or.com>
Subject: Re: [PATCH] x86: Optimize __get_user_asm() assembly

On Wed, 26 Nov 2025 21:29:22 +0100
Uros Bizjak <ubizjak@...il.com> wrote:

> On Wed, Nov 26, 2025 at 8:23 PM david laight <david.laight@...box.com> wrote:
> >
> > On Wed, 26 Nov 2025 17:22:36 +0100
> > Uros Bizjak <ubizjak@...il.com> wrote:
> >  
> > > On Wed, Nov 26, 2025 at 4:18 PM david laight <david.laight@...box.com> wrote:  
> > > >
> > > > On Wed, 26 Nov 2025 14:23:32 +0100
> > > > Uros Bizjak <ubizjak@...il.com> wrote:
> > > >  
> > > > > On x86, byte and word moves (MOVB, MOVW) write only to the low
> > > > > portion of a 32-bit register, leaving the upper bits unchanged.
> > > > > Modern compilers therefore prefer using MOVZBL and MOVZWL to load
> > > > > 8-bit and 16-bit values with zero-extension to the full register
> > > > > width.
> > > > >
> > > > > Update __get_user_asm() to follow this convention by explicitly
> > > > > zero-extending 8-bit and 16-bit loads from memory.
> > > > >
> > > > > An additional benefit of this change is that it enables the full
> > > > > integer register set to be used for 8-bit loads. Also, it
> > > > > eliminates the need for manual zero-extension of 8-bit values.
> > > > >
> > > > > There is only a minimal increase in code size:  
> > > >
> > > > Interesting, where does that come from.
> > > > I'd have thought it should shrink it.
> > > > The mov[sz]x is one byte longer.
> > > > Perhaps a lot of 8-bit values are never zero-extended?  
> > >
> > > bloat-o-meter says:
> > >
> > > add/remove: 0/0 grow/shrink: 4/0 up/down: 16/0 (16)
> > > Function                                     old     new   delta
> > > strncpy_from_kernel_nofault                  151     158      +7
> > > copy_from_kernel_nofault                     207     214      +7
> > > strncpy_from_user                            218     219      +1
> > > fault_in_readable                            150     151      +1
> > > Total: Before=23919018, After=23919034, chg +0.00%
> > >
> > > where the difference in copy_from_kernel_nofault() is expected:
> > >
> > > -       66 8b 45 00             mov    0x0(%rbp),%ax
> > > +       0f b7 45 00             movzwl 0x0(%rbp),%eax
> > > ...
> > > -       8a 45 00                mov    0x0(%rbp),%al
> > > +       0f b6 45 00             movzbl 0x0(%rbp),%eax
> > >
> > > and then some NOPs at the end due to function entry alignment.
> > >  
> > > > I'm trying to remember what the magic 'k' is for.
> > > > Does gas treat %krxx as %dxx ?
> > > > Which would matter if 'x' is 64bit when using "=r" for 32bit reads.
> > > > But, in that case, I think you need it on the 'movl' as well.  
> > >
> > > %k is gcc operand modifier:
> > >
> > > ‘k’        Print the SImode name of the register.
> > >
> > > that overrides operand width and prints a 32-bit register name.  
...
> > So it does just change %rxx to %exx.  
> 
> Actually it changes e.g ax to eax, bl to ebx (and rcx to ecx, but we
> don't usethis)
> 
> > But the u32 code is much the same - so I doubt you need it at all.  
> 
> Yes, the patch specifically changes 8-bit and 16-bit loads.
> 
> > Looks at the full file....
> > The 'x' parameter is __gu_val from __inttype(*(ptr)) __gu_val;
> > __inttype() is a 'horrid mess' that generates char/short/long/long long.
> >
> > (So the "k" are actually changing the 16bit %bx to the 32bit %ebx (etc).)  
> 
> Yes.
> 
> > But that isn't what you need for the 'movzbl' (etc).
> > At most you need to pick between 'long' and 'long long'.
> > Put the "k" in and you can use 'long long' all the time.  
> 
> No, putting "k" for the output register of long long move would
> *narrow* the read to 32 bits.

I wasn't suggesting that.
I was suggesting that 'x' could always be u64 provided you put a "k"
in the 32bit load.

> > But you lose at the final step:
> >         (x) = (__force __typeof__(*(ptr))_gu_val;
> > all your hard work goes away.  
> 
> Not really. By using MOVZBL *instead of* MOVB instruction, we can use
> all integer registers for its output.

That is only actually relevant in 32bit mode.
64bit mode can access the low 8 bits of 'si' quite happily.

It also makes no difference in 32bit mode.
The register selected depends on the "=r" and the type of the variable
(unsigned char) - so can only be one of %al, %bl, %cl or %dl.
The 'k' modifier makes the asm output contain %eax, %ebx, %ecx or %edx.

> Also, we don't need to
> zero-extend through _x_u8__ temporary, because we *know* that 32-bit
> (and 64-bit on x86_64) aliases of the original 8-bit register now hold
> zero-extended 8-bit value. (Side note: MOVW also performs the insert
> and zero-extend trick is currently missing for this case).

There is no zero-extension going on inside this code.
(There might be one that extends the result in the calling code.)
When the 'case 1' code is executed 'x' is unsigned char, the _x_u8__
probably exists to make that the code valid C when the size isn't 1
- the code is optimised away, but has to be valid enough to compile.

If there was a sign extension (and it went away) it would have been in the
bloat-o-meter output, but that just showed the one extra byte in the
'size 1' branch caused by the longer instruction.

Additionally although you've only told gcc to generate asm that uses %eax.
The compiler doesn't know what you've done with that register.
All it can look at is the C var - so if the value did need promoting
it would still need to sign/zero extend it.

This is why the only change you see is the extra byte added because
'movxbl' is a longer instruction.

I think you might be able to get the effect you want.
1) In unsafe_get_user() make __gu_val 'unsigned long' except for
   64bit read on 32bit (when it needs to be unsigned long long).
2) In __get_user_size() use movzbl and movzwl for the 8/16 bit reads
   and 'k' for all except the 64bit (on 64bit).
3) Back in unsafe_get_user() only do the cast for signed types.
   (They won't happen often enough to use movsx.)
4) Fixup get_kernel_nofault() so it still works.

In the 'not asm goto' variants, if you initialise the 'err' variable
in the caller, then it doesn't need to be initialised in get_user_size()
and then you only need one copy of get_user_size().
In fact, you can have a single copy of unsafe_get_user() if you pass
both 'err_label' and '__gu_err' though to the asm code.
If _gu_err is never referenced the final conditional test will be
optimised away.
The '.i' file will increase marginally, but the source will be better.

	David