linux-kernel - Re: [RFC][PATCH 2/5] refcount: Use atomic_*

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YbI2IjU7qoSTlCH6@FVFF77S0Q05N>
Date:   Thu, 9 Dec 2021 17:00:18 +0000
From:   Mark Rutland <mark.rutland@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     will@...nel.org, boqun.feng@...il.com,
        linux-kernel@...r.kernel.org, x86@...nel.org, elver@...gle.com,
        keescook@...omium.org, hch@...radead.org,
        torvalds@...ux-foundation.org
Subject: Re: [RFC][PATCH 2/5] refcount: Use atomic_*_ofl()

On Thu, Dec 09, 2021 at 01:17:33PM +0000, Mark Rutland wrote:
> On Wed, Dec 08, 2021 at 07:36:57PM +0100, Peter Zijlstra wrote:
> > Use the shiny new atomic_*_ofl() functions in order to have better
> > code-gen.
> > 
> > Notably refcount_inc() case no longer distinguishes between
> > inc-from-zero and inc-negative in the fast path, this improves
> > code-gen:
> > 
> >     4b88:       b8 01 00 00 00          mov    $0x1,%eax
> >     4b8d:       f0 0f c1 43 28          lock xadd %eax,0x28(%rbx)
> >     4b92:       85 c0                   test   %eax,%eax
> >     4b94:       74 1b                   je     4bb1 <alloc_perf_context+0xf1>
> >     4b96:       8d 50 01                lea    0x1(%rax),%edx
> >     4b99:       09 c2                   or     %eax,%edx
> >     4b9b:       78 20                   js     4bbd <alloc_perf_context+0xfd>
> > 
> > to:
> > 
> >     4768:       b8 01 00 00 00          mov    $0x1,%eax
> >     476d:       f0 0f c1 43 28          lock xadd %eax,0x28(%rbx)
> >     4772:       85 c0                   test   %eax,%eax
> >     4774:       7e 14                   jle    478a <alloc_perf_context+0xea>
> 
> For comparison, I generated the same for arm64 below with kernel.org crosstool
> GCC 11.1.0 and defconfig.
> 
> For arm64 there's an existing sub-optimiality for inc/dec where the register
> for `1` or `-1` gets generated with a `MOV;MOV` chain or `MOV;NEG` chain rather
> than a single `MOV` in either case. I think taht's due to the way we build
> LSE/LL-SC variants of add() and build a common inc() atop, and the compiler
> just loses the opportunity to constant-fold. I'll take a look at how to make
> that neater.

With some improvement's to arm64's LSE atomics, that becomes a comparable
sequence to x86's:

    2df8:       52800021        mov     w1, #0x1                        // #1
    ...
    2e20:       b8e10002        ldaddal w1, w2, [x0]
    2e24:       7100005f        cmp     w2, #0x0
    2e28:       5400012d        b.le    2e4c <alloc_perf_context+0xc8>

> > without sacrificing on functionality; the only thing that suffers is
> > the reported error condition, which might now 'spuriously' report
> > 'saturated' instead of 'inc-from-zero'.
> > 
> > refcount_dec_and_test() is also improved:
> > 
> >     aa40:       b8 ff ff ff ff          mov    $0xffffffff,%eax
> >     aa45:       f0 0f c1 07             lock xadd %eax,(%rdi)
> >     aa49:       83 f8 01                cmp    $0x1,%eax
> >     aa4c:       74 05                   je     aa53 <ring_buffer_put+0x13>
> >     aa4e:       85 c0                   test   %eax,%eax
> >     aa50:       7e 1e                   jle    aa70 <ring_buffer_put+0x30>
> >     aa52:       c3                      ret
> > 
> > to:
> > 
> >     a980:       b8 ff ff ff ff          mov    $0xffffffff,%eax
> >     a985:       f0 0f c1 07             lock xadd %eax,(%rdi)
> >     a989:       83 e8 01                sub    $0x1,%eax
> >     a98c:       78 20                   js     a9ae <ring_buffer_put+0x2e>
> >     a98e:       74 01                   je     a991 <ring_buffer_put+0x11>
> >     a990:       c3                      ret

Likewise I can get the arm64 equivalent down to:

    bebc:       12800001        mov     w1, #0xffffffff                 // #-1
    ...
    becc:       b8e10001        ldaddal w1, w1, [x0]
    bed0:       71000421        subs    w1, w1, #0x1
    bed4:       540000c4        b.mi    beec <ring_buffer_put+0x3c>  // b.first
    bed8:       54000120        b.eq    befc <ring_buffer_put+0x4c>  // b.none

I've pushed my WIP patches to:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/atomics/improvements
  git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/atomics/improvements

... and I'll try to get those cleaned up and posted soon.

Thanks,
Mark.