linux-kernel - RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <e0228468e054426c9737530fed594ad0@AcuMS.aculab.com>
Date:   Wed, 13 Sep 2023 08:25:40 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Linus Torvalds' <torvalds@...ux-foundation.org>
CC:     Mateusz Guzik <mjguzik@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>
Subject: RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs
 without ERMS

From: Linus Torvalds
> Sent: 12 September 2023 21:48
> 
> On Tue, 12 Sept 2023 at 12:41, David Laight <David.Laight@...lab.com> wrote:
> >
> > What I found seemed to imply that 'rep movsq' used the same internal
> > logic as 'rep movsb' (pretty easy to do in hardware)
> 
> Christ.
> 
> I told you. It's pretty easy in hardware  AS LONG AS IT'S ALIGNED.
> 
> And if it's unaligned, "rep movsq" is FUNDAMENTALLY HARDER.

For cached memory it only has to appear to have used 8 byte
accesses.
So in the same way that 'rep movsb' could be optimised to do
cache line sized reads and writes even if the address are
completely misaligned 'rep movsq' could use exactly the same
hardware logic with a byte count that is 8 times larger.

The only subtlety is that the read length would need masking
to a multiple of 8 if there is a page fault on a misaligned
read side (so that a multiple of 8 bytes would be written).
That wouldn't really be hard.

I definitely saw exactly the same number of bytes/clock
for 'rep movsb' and 'rep movsq' when the destination was
misaligned.
The alignment made no difference except that a multiple
of 32 ran (about) twice as fast.
I even double-checked the disassembly to make sure I was
running the right code.

So it looks like the Intel hardware engineers have solved
the 'FUNDAMENTALLY HARDER' problem.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)