lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20220714091026.GM2387@pengutronix.de>
Date:   Thu, 14 Jul 2022 11:10:26 +0200
From:   Sascha Hauer <sha@...gutronix.de>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        Ingo Molnar <mingo@...hat.com>, kernel@...gutronix.de
Subject: Re: Performance impact of CONFIG_FUNCTION_TRACER

On Tue, Jul 05, 2022 at 06:27:46PM -0400, Steven Rostedt wrote:
> On Tue, 5 Jul 2022 23:59:48 +0200
> Sascha Hauer <sha@...gutronix.de> wrote:
> 
> > > 
> > > As I believe due to using a link register for function calls, ARM
> > > requires adding two 4 byte nops to every function where as x86 only
> > > adds a single 5 byte nop.
> > > 
> > > Although nops are very fast (they should not be processed in the CPU's
> > > pipe line, but I don't know if that's true for every arch). It also
> > > affects instruction cache misses, as adding 8 bytes around the code
> > > will cause more cache misses than when they do not exist.  
> > 
> > Just digged around a bit and saw that on ARM it's not even a real nop.
> > The compiler emits:
> > 
> > 	push    {lr}
> > 	bl      8010e7c0 <__gnu_mcount_nc>
> > 
> > Which is then turned into a nop by replacing the second instruction with
> > 
> > 	add   sp, sp, #4
> > 
> > to bring the stack pointer back to its original value. This indeed must
> > be processed by the CPU pipeline. I wonder if that could be optimized by
> > replacing both instructions with a nop. I have no idea though if that's
> > feasible at all or if the overhead would even get smaller by that.
> 
> The problem is that there's no easy way to do that, because a task
> could have been preempted after doing the 'push {lr}' and before the
> 'bl'. Thus, you create a race by changing either one to a nop first.
> 
> I wonder if it would have been better to change the first one to a jump
> passed the second :-/

I gave this a try, but the performance was not better compared to the
stack push/pop operations we have now. I also tried to replace both
instructions with nops (mov r0, r0), still no better performance. I
guess we have to live with it then.

Sascha

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ