lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 5 Jul 2022 18:27:46 -0400
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Sascha Hauer <sha@...gutronix.de>
Cc:     linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        Ingo Molnar <mingo@...hat.com>, kernel@...gutronix.de
Subject: Re: Performance impact of CONFIG_FUNCTION_TRACER

On Tue, 5 Jul 2022 23:59:48 +0200
Sascha Hauer <sha@...gutronix.de> wrote:

> > 
> > As I believe due to using a link register for function calls, ARM
> > requires adding two 4 byte nops to every function where as x86 only
> > adds a single 5 byte nop.
> > 
> > Although nops are very fast (they should not be processed in the CPU's
> > pipe line, but I don't know if that's true for every arch). It also
> > affects instruction cache misses, as adding 8 bytes around the code
> > will cause more cache misses than when they do not exist.  
> 
> Just digged around a bit and saw that on ARM it's not even a real nop.
> The compiler emits:
> 
> 	push    {lr}
> 	bl      8010e7c0 <__gnu_mcount_nc>
> 
> Which is then turned into a nop by replacing the second instruction with
> 
> 	add   sp, sp, #4
> 
> to bring the stack pointer back to its original value. This indeed must
> be processed by the CPU pipeline. I wonder if that could be optimized by
> replacing both instructions with a nop. I have no idea though if that's
> feasible at all or if the overhead would even get smaller by that.

The problem is that there's no easy way to do that, because a task
could have been preempted after doing the 'push {lr}' and before the
'bl'. Thus, you create a race by changing either one to a nop first.

I wonder if it would have been better to change the first one to a jump
passed the second :-/

Actually, if you don't mind setups that take a long time, if you change
the first to a jump passed the second, then do synchronize_rcu_rude()
(which may take a while, possibly several seconds or more) then you know
that all users now only see the jump, and none will see the bl. Then
you could convert the bl to nop, and then even change the jump to nop
after that.

To convert back, you would need to reverse it. Convert the first nop
back to a jmp, run synchronize_rcu_rude(). Then convert the second nop
to the bl, and then convert the first to the push {lr}.

> 
> > 
> > Also, there's some configurations that use the old mcount that does add
> > some more code to handle the mcount case.
> > 
> > So if this is just to have us change the kconfig, I'm happy to do that.  
> 
> Yes, would be good to make the kconfig text clear. The overhead itself
> is fine when people know that's the price to pay for getting the
> function tracer.

Agreed. I'll write up a patch.

-- Steve

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ