lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250317223243.460e3bff@pumpkin>
Date: Mon, 17 Mar 2025 22:32:43 +0000
From: David Laight <david.laight.linux@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: "Herton R. Krzesinski" <herton@...hat.com>, x86@...nel.org,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org,
 olichtne@...hat.com, atomasov@...hat.com, aokuliar@...hat.com
Subject: Re: [PATCH] x86: add back the alignment of the destination to 8
 bytes in copy_user_generic()

On Mon, 17 Mar 2025 14:29:05 -0700
Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Mon, 17 Mar 2025 at 06:16, David Laight <david.laight.linux@...il.com> wrote:
> >
> > You can also something similar for any trailing bytes.
> > If you are feeling 'brave' copy the last 8 bytes first.  
> 
> I think that would be a mistake.
> 
> Not only does it cause bad patterns on page faults - we should recover
> ok from it (the exception will go back and do the copy in the right
> order one byte at a time in the "copy_user_tail" code) - but even in
> the absence of page faults it quite possibly messes with CPU
> prefetching and write buffer coalescing etc if you hop around like
> that.

I thought you might say that :-)

> It *might* be worth trying doing last unaligned part the same way my
> patch does the first one - by just doing a full-word write at the end,
> offset backwards. That avoids the byte-at-a-time tail case.
> 
> I'm not convinced it's worth it, but if somebody spends the effort on
> a patch and on benchmarking...

After a 'rep movsl' you'll have %rsi and %rdi pointing to the first byte
left to copy and the remaining bytes in (say) %rax.
So something like:
	mov %rsi,-8(%rsi, %rax)
	mov -8(%rdi, %rax), %rsi
will copy the last 8 bytes.
Ought to be faster than anything with a branch in it.

Whether it is worth leaving yourself with [1..8] bytes to copy
rather than [0..7] (and copying the last 8 bytes twice) might
be debatable.

For Intel FRSM copying 32 bytes and then aligning the destination
will be worth while for longer copies.
But I've not tried to measure the cutoff for 'rep movsb' against
a copy loop - given you'll get at least on mispredicted branch
for the copy loop and a 50% one to select between the algorithms.
Add in a function call and the ~30 clocks [1] for a short 'rep movsb'
starts looking very good.

[1] I can't remember actual number, but it isn't very many even on
Ivy bridge - and you get 1-16 bytes copies for the same cost.

	David

> 
>             Linus


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ