linux-kernel - RE: objtool clac/stac handling change..

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <88850c316ed14c7b8391cea05d875406@AcuMS.aculab.com>
Date:   Mon, 13 Jul 2020 09:32:32 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Linus Torvalds' <torvalds@...ux-foundation.org>
CC:     Al Viro <viro@...iv.linux.org.uk>,
        Michael Ellerman <mpe@...erman.id.au>,
        Christophe Leroy <christophe.leroy@....fr>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "the arch/x86 maintainers" <x86@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: RE: objtool clac/stac handling change..

From: Linus Torvalds
> Sent: 10 July 2020 23:37
> On Tue, Jul 7, 2020 at 5:35 AM David Laight <David.Laight@...lab.com> wrote:
> >
> >
> > So separate copy and checksum passes should easily exceed 4 bytes/clock,
> > but I suspect that doing them together never does.
> > (Unless the buffer is too big for the L1 cache.)
> 
> Its' the "touch the caches twice" that is the problem".
> 
> And it's not the "buffer is too big for L1", it's "the source, the
> destination and any incidentals are too big for L1" with the
> additional noise from replacement policies etc.

That's really what I meant.
L1D is actually (probably) only 32kB.
I guess that gives you 8k for the buffer.

It is a shame you can't use the AVX instructions in kernel.
(Although saving them probably costs more than the gain.)
Then you could use something based on:
10:	load ymm,src+idx   // 32 bytes
	store ymm,tgt+idx
	addq sum0,ymm   // eight 32bit adds
	rotate ymm,16   // Pretty sure there in an instruction for this!
	addq sum1,ymm
	add idx,32
	jnz 10b
It is then possibly to determine the correct result from sum0/sum1.
On very recent Intel cpu that might even run at 1 iteration/clock!
(Probably needs and unroll and explicit interleave.)
At one iteration every 2 clocks it matches the ADDX[OC] loop
but includes the write.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)