linux-kernel - Re: [PATCH 0/3] Improve trace/ring

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFULd4Z3gpPUO7vQxinm1rQcndP9c_08arBNjNtjagOiuiiABQ@mail.gmail.com>
Date:   Wed, 1 Mar 2023 09:35:44 +0100
From:   Uros Bizjak <ubizjak@...il.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Masami Hiramatsu <mhiramat@...nel.org>
Subject: Re: [PATCH 0/3] Improve trace/ring_buffer.c

On Tue, Feb 28, 2023 at 8:35 PM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Tue, 28 Feb 2023 18:59:26 +0100
> Uros Bizjak <ubizjak@...il.com> wrote:
>
> > This series improves ring_buffer.c by changing the type of some
> > static functions to void or bool and
>
> Well, it's not really an improvement unless it has a functional change. But
> I'm OK with taking these.

"No functional changes intended" claim should be taken in the sense
that the same functionality is achieved in a more optimal way. As a
trivial example, changing some functions to bool eliminates many
unnecessary zero-extensions and results in smaller code:

size ring_buffer-*.o
  text    data     bss     dec     hex filename
  25599    1762       4   27365    6ae5 ring_buffer-vanilla.o
  25551    1762       4   27317    6ab5 ring_buffer-bool.o

Please also note that setting the output value with "SETcc r8" results
in a partial register stall on x86 when returning r32. The compiler
knows how to handle this issue (using register dependency breaking XOR
in front of SETcc), but it is better to avoid the issue altogether.
So, there are also secondary benefits of the above "No functional
changes intended" change, besides lower code size.

> > uses try_cmpxchg instead of
> > cmpxchg (*ptr, old, new) == old where appropriate.
>
> Now, the try_cmpxchg() could be an improvement on x86.

True, the concrete example is e.g. ring_buffer_record_off, where the
cmpxchg loop improves from:

      78:    8b 57 08                 mov    0x8(%rdi),%edx
      7b:    89 d6                    mov    %edx,%esi
      7d:    89 d0                    mov    %edx,%eax
      7f:    81 ce 00 00 10 00        or     $0x100000,%esi
      85:    f0 0f b1 31              lock cmpxchg %esi,(%rcx)
      89:    39 d0                    cmp    %edx,%eax
      8b:    75 eb                    jne    78 <ring_buffer_record_off+0x8>

to:

     1bb:    89 c2                    mov    %eax,%edx
     1bd:    81 ca 00 00 10 00        or     $0x100000,%edx
     1c3:    f0 0f b1 17              lock cmpxchg %edx,(%rdi)
     1c7:    75 f2                    jne    1bb <ring_buffer_record_off+0xb>

As can be demonstrated above, the move to a temporary reg and the
comparison with said temporary are both eliminated.

The above code is generated with gcc-10.3.1 to show the direct
benefits of the change. gcc-12+ is able to use likely/unlikely
annotations inside try_cmpxchg definition and generates fast paths
through the code (as mentioned by you in the RCU patch-set thread).

Please also note that try_cmpxchg generates better code also for
targets that do not define try_cmpxchg and use fallback code, as
reported in [1] and follow-up messages.

The API of try_cmpxhg is written in such a way that benefits loops and
also linear code. I will discuss this in the reply of PATCH 3/3.

[1] https://lore.kernel.org/lkml/871qwgmqws.fsf@mpe.ellerman.id.au/

Uros.