lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 13 Sep 2023 08:25:40 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Linus Torvalds' <torvalds@...ux-foundation.org>
CC:     Mateusz Guzik <mjguzik@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>
Subject: RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs
 without ERMS

From: Linus Torvalds
> Sent: 12 September 2023 21:48
> 
> On Tue, 12 Sept 2023 at 12:41, David Laight <David.Laight@...lab.com> wrote:
> >
> > What I found seemed to imply that 'rep movsq' used the same internal
> > logic as 'rep movsb' (pretty easy to do in hardware)
> 
> Christ.
> 
> I told you. It's pretty easy in hardware  AS LONG AS IT'S ALIGNED.
> 
> And if it's unaligned, "rep movsq" is FUNDAMENTALLY HARDER.

For cached memory it only has to appear to have used 8 byte
accesses.
So in the same way that 'rep movsb' could be optimised to do
cache line sized reads and writes even if the address are
completely misaligned 'rep movsq' could use exactly the same
hardware logic with a byte count that is 8 times larger.

The only subtlety is that the read length would need masking
to a multiple of 8 if there is a page fault on a misaligned
read side (so that a multiple of 8 bytes would be written).
That wouldn't really be hard.

I definitely saw exactly the same number of bytes/clock
for 'rep movsb' and 'rep movsq' when the destination was
misaligned.
The alignment made no difference except that a multiple
of 32 ran (about) twice as fast.
I even double-checked the disassembly to make sure I was
running the right code.

So it looks like the Intel hardware engineers have solved
the 'FUNDAMENTALLY HARDER' problem.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ