linux-kernel - RE: how many memset(,0,) calls in kernel ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <15cd0a8e72b3460db939060db25dd59a@AcuMS.aculab.com>
Date:   Tue, 14 Sep 2021 08:23:40 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Willy Tarreau' <w@....eu>
CC:     Douglas Gilbert <dgilbert@...erlog.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: RE: how many memset(,0,) calls in kernel ?

From: Willy Tarreau
> Sent: 13 September 2021 17:10
> 
> On Mon, Sep 13, 2021 at 04:03:09PM +0000, David Laight wrote:
> > >   36:   b9 06 00 00 00          mov    $0x6,%ecx
> > >   3b:   4c 89 e7                mov    %r12,%rdi
> > >   3e:   f3 ab                   rep stos %eax,%es:(%rdi)
> > >
> > > The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes
> > > for some code that modern processors are even able to optimize.
> >
> > Hmmm I'd bet that 6 stores will be faster on ~everything.
> > 'modern' processors do better than some older ones [1], but 6
> > writes isn't enough to get into the really fast paths.
> > So you'll still take a few cycles of setup.
> 
> The exact point is, here it's up to the compiler to decide thanks to
> its builtin what it considers best for the target CPU. It already
> knows the fixed size and the code is emitted accordingly. It may
> very well be a call to the memset() function when the size is large
> and a power of two because it knows alternate variants are available
> for example.
> 
> The compiler might even decide to shrink that area if other bytes
> are written just after the memset(), leaving only holes touched by
> memset().

You might think the compiler will make sane choices for the target CPU.
But it often makes a complete pig's breakfast of it.
I'm pretty sure 6 'rep stos' is slower than 6 write an absolutely
everything - with the possible exception of an 8088.

By far the worst ones are when the compiler decides to pessimise
a loop by using the simd (eg avx512) instructions to do 4 (or 8)
loop iterations in one pass.
It might be fine if the loop count is in the 100s - but not when it is 3.

One compiler I've used nicely converted any byte copy loop
into a 'rep movsb' instruction.
That was contemporary with P4 netburst - where it was terribly slow.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)