lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 1 Mar 2023 10:37:28 +0100
From:   Uros Bizjak <ubizjak@...il.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Masami Hiramatsu <mhiramat@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Joel Fernandes <joel@...lfernandes.org>
Subject: Re: [PATCH 3/3] ring_buffer: Use try_cmpxchg instead of cmpxchg

On Tue, Feb 28, 2023 at 10:43 PM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Tue, 28 Feb 2023 18:59:29 +0100
> Uros Bizjak <ubizjak@...il.com> wrote:
>
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old.
> > x86 CMPXCHG instruction returns success in ZF flag, so this change
> > saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).
> >
> > Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
> > fails. There is no need to re-read the value in the loop.
> >
> > No functional change intended.
>
> As I mentioned in the RCU thread, I have issues with some of the changes
> here.
>
> >
> > Cc: Steven Rostedt <rostedt@...dmis.org>
> > Cc: Masami Hiramatsu <mhiramat@...nel.org>
> > Signed-off-by: Uros Bizjak <ubizjak@...il.com>
> > ---
> >  kernel/trace/ring_buffer.c | 20 ++++++++------------
> >  1 file changed, 8 insertions(+), 12 deletions(-)
> >
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index 4188af7d4cfe..8f0ef7d12ddd 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -1493,14 +1493,11 @@ static bool rb_head_page_replace(struct buffer_page *old,
> >  {
> >       unsigned long *ptr = (unsigned long *)&old->list.prev->next;
> >       unsigned long val;
> > -     unsigned long ret;
> >
> >       val = *ptr & ~RB_FLAG_MASK;
> >       val |= RB_PAGE_HEAD;
> >
> > -     ret = cmpxchg(ptr, val, (unsigned long)&new->list);
> > -
> > -     return ret == val;
> > +     return try_cmpxchg(ptr, &val, (unsigned long)&new->list);
>
> No, val should not be updated.

Please see the definition of try_cmpxchg. The definition is written in
such a way that benefits loops as well as linear code and in the later
case depends on the compiler to eliminate assignment to val as a dead
assignment.

The above change was done under the assumption that val is unused
after try_cmpxchg, and can be considered as a temporary
[Alternatively, the value could be copied to a local temporary and a
pointer to this local temporary could be passed to try_cmpxchg
instead. Compiler is smart enough to eliminate the assignment in any
case.]

Even in the linear code, the change has considerable effect.
rb_head_page_replace is inlined in rb_get_reader_page and gcc-10.3.1
improves code from:

     ef8:    48 8b 0e                 mov    (%rsi),%rcx
     efb:    48 83 e1 fc              and    $0xfffffffffffffffc,%rcx
     eff:    48 83 c9 01              or     $0x1,%rcx
     f03:    48 89 c8                 mov    %rcx,%rax
     f06:    f0 48 0f b1 3e           lock cmpxchg %rdi,(%rsi)
     f0b:    48 39 c1                 cmp    %rax,%rcx
     f0e:    74 3b                    je     f4b <rb_get_reader_page+0x13b>

to:

     ed8:    48 8b 01                 mov    (%rcx),%rax
     edb:    48 83 e0 fc              and    $0xfffffffffffffffc,%rax
     edf:    48 83 c8 01              or     $0x1,%rax
     ee3:    f0 48 0f b1 31           lock cmpxchg %rsi,(%rcx)
     ee8:    74 3b                    je     f25 <rb_get_reader_page+0x135>

Again, even in linear code the change is able to eliminate the
assignment to a temporary reg and the compare. Please note that there
is no move *from* %rax register after cmpxchg, so the compiler
correctly eliminated unused assignment.

>
> >  }
> >
> >  /*
> > @@ -2055,7 +2052,7 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer)
> >       retries = 10;
> >       success = false;
> >       while (retries--) {
> > -             struct list_head *head_page, *prev_page, *r;
> > +             struct list_head *head_page, *prev_page;
> >               struct list_head *last_page, *first_page;
> >               struct list_head *head_page_with_bit;
> >
> > @@ -2073,9 +2070,8 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer)
> >               last_page->next = head_page_with_bit;
> >               first_page->prev = prev_page;
> >
> > -             r = cmpxchg(&prev_page->next, head_page_with_bit, first_page);
> > -
> > -             if (r == head_page_with_bit) {
> > +             if (try_cmpxchg(&prev_page->next,
> > +                             &head_page_with_bit, first_page)) {
>
> No. head_page_with_bit should not be updated.

As above, head_page_with_bit should be considered as a temporary, it
is initialized a couple of lines above cmpxchg and unused after. The
gcc-10.3.1 compiler even found some more optimization opportunities
and reordered the code from:

    1364:    4d 8b 86 38 01 00 00     mov    0x138(%r14),%r8
    136b:    48 83 ce 01              or     $0x1,%rsi
    136f:    48 89 f0                 mov    %rsi,%rax
    1372:    49 89 30                 mov    %rsi,(%r8)
    1375:    48 89 4f 08              mov    %rcx,0x8(%rdi)
    1379:    f0 48 0f b1 39           lock cmpxchg %rdi,(%rcx)
    137e:    48 39 c6                 cmp    %rax,%rsi
    1381:    74 78                    je     13fb <rb_insert_pages+0xdb>

to:

    1343:    48 83 c8 01              or     $0x1,%rax
    1347:    48 8b bb 38 01 00 00     mov    0x138(%rbx),%rdi
    134e:    48 89 07                 mov    %rax,(%rdi)
    1351:    48 89 4e 08              mov    %rcx,0x8(%rsi)
    1355:    f0 48 0f b1 31           lock cmpxchg %rsi,(%rcx)
    135a:    41 0f 94 c7              sete   %r15b
    135e:    75 2f                    jne    138f <rb_insert_pages+0x8f>

Please also note SETE insn in the above code, this is how the
"success" variable is handled in the loop. So, besides code size
improvement, other secondary improvements can be expected from the
change, too.

I think that the above examples demonstrate various improvements that
can be achieved with effectively a one-line, almost mechanical change
to the code, even in linear code. It would be unfortunate to not
consider them.

Uros.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ