linux-kernel - Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250317223243.460e3bff@pumpkin>
Date: Mon, 17 Mar 2025 22:32:43 +0000
From: David Laight <david.laight.linux@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: "Herton R. Krzesinski" <herton@...hat.com>, x86@...nel.org,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org,
 olichtne@...hat.com, atomasov@...hat.com, aokuliar@...hat.com
Subject: Re: [PATCH] x86: add back the alignment of the destination to 8
 bytes in copy_user_generic()

On Mon, 17 Mar 2025 14:29:05 -0700
Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Mon, 17 Mar 2025 at 06:16, David Laight <david.laight.linux@...il.com> wrote:
> >
> > You can also something similar for any trailing bytes.
> > If you are feeling 'brave' copy the last 8 bytes first.  
> 
> I think that would be a mistake.
> 
> Not only does it cause bad patterns on page faults - we should recover
> ok from it (the exception will go back and do the copy in the right
> order one byte at a time in the "copy_user_tail" code) - but even in
> the absence of page faults it quite possibly messes with CPU
> prefetching and write buffer coalescing etc if you hop around like
> that.

I thought you might say that :-)

> It *might* be worth trying doing last unaligned part the same way my
> patch does the first one - by just doing a full-word write at the end,
> offset backwards. That avoids the byte-at-a-time tail case.
> 
> I'm not convinced it's worth it, but if somebody spends the effort on
> a patch and on benchmarking...

After a 'rep movsl' you'll have %rsi and %rdi pointing to the first byte
left to copy and the remaining bytes in (say) %rax.
So something like:
	mov %rsi,-8(%rsi, %rax)
	mov -8(%rdi, %rax), %rsi
will copy the last 8 bytes.
Ought to be faster than anything with a branch in it.

Whether it is worth leaving yourself with [1..8] bytes to copy
rather than [0..7] (and copying the last 8 bytes twice) might
be debatable.

For Intel FRSM copying 32 bytes and then aligning the destination
will be worth while for longer copies.
But I've not tried to measure the cutoff for 'rep movsb' against
a copy loop - given you'll get at least on mispredicted branch
for the copy loop and a 50% one to select between the algorithms.
Add in a function call and the ~30 clocks [1] for a short 'rep movsb'
starts looking very good.

[1] I can't remember actual number, but it isn't very many even on
Ivy bridge - and you get 1-16 bytes copies for the same cost.

	David

> 
>             Linus