linux-kernel - Re: [PATCH 3/3] x86/ftrace: Use text

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7b4196a4-b6e1-7e55-c3e1-a02d97c262c7@redhat.com>
Date:   Fri, 4 Oct 2019 10:10:47 +0200
From:   Daniel Bristot de Oliveira <bristot@...hat.com>
To:     Steven Rostedt <rostedt@...dmis.org>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, x86@...nel.org,
        Nadav Amit <nadav.amit@...il.com>,
        Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Song Liu <songliubraving@...com>,
        Masami Hiramatsu <mhiramat@...nel.org>
Subject: Re: [PATCH 3/3] x86/ftrace: Use text_poke()

On 04/10/2019 00:10, Steven Rostedt wrote:
> On Wed, 2 Oct 2019 20:21:06 +0200
> Peter Zijlstra <peterz@...radead.org> wrote:
> 
>> On Wed, Oct 02, 2019 at 06:35:26PM +0200, Daniel Bristot de Oliveira wrote:
>>
>>> ftrace was already batching the updates, for instance, causing 3 IPIs to enable
>>> all functions. The text_poke() batching also works. But because of the limited
>>> buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during the
>>> operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a VM,
>>> when enabling the function tracer, I see 250+ intermediate text_poke_finish()
>>> because of a full buffer...
>>>
>>> Would this be the case of trying to use a dynamically allocated buffer?
>>>
>>> Thoughts?  
>>
>> Is it a problem? I tried growing the buffer (IIRC I made it 10 times
>> bigger) and didn't see any performance improvements because of it.
> 
> I'm just worried if people are going to complain about the IPI burst.
> Although, I just tried it out before applying this patch, and there's
> still a bit of a burst. Not sure why. I did:
> 
> # cat /proc/interrupts > /tmp/before; echo function > /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after
> # cat /proc/interrupts > /tmp/before1; echo nop > /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after1
> 
> Before this patch:
> 
> # diff /tmp/before /tmp/after
> < CAL:       2342       2347       2116       2175       2446       2030       2416       2222   Function call interrupts
> ---
>> CAL:       2462       2467       2236       2295       2446       2150       2536       2342   Function call interrupts
> 
> (Just showing the function call interrupts)
> 
> There appears to be 120 IPIs sent to all CPUS for enabling function tracer.
> 
> # diff /tmp/before1 /tmp/after1
> < CAL:       2462       2467       2236       2295       2446       2150       2536       2342   Function call interrupts
> ---
>> CAL:       2577       2582       2351       2410       2446       2265       2651       2457   Function call interrupts
> 
> And 151 IPIs for disabling it.
> 
> After applying this patch:
> 
> # diff /tmp/before /tmp/after
> < CAL:      66070      46620      59955      59236      68707      63397      61644      62742   Function call interrupts
> ---
>> CAL:      66727      47277      59955      59893      69364      64054      62301      63399   Function call interrupts
> 
> # diff /tmp/before1 /tmp/after1
> < CAL:      66727      47277      59955      59893      69364      64054      62301      63399   Function call interrupts
> ---
>> CAL:      67358      47938      59985      60554      70025      64715      62962      64060   Function call interrupts
> 
> 
> We get 657 IPIs for enabling function tracer, and 661 for disabling it.
> Funny how it's more on the disable than the enable with the patch but
> the other way without it.
> 
> But still, we are going from 120 to 660 IPIs for every CPU. Not saying
> it's a problem, but something that we should note. Someone (those that
> don't like kernel interference) may complain.

That is the point I was raising.

When enabling ftrace, we have three different paths:

1) the enabling/disabling ftrace path
2) the int3 path - if a thread/irq is running a kernel function
3) the IPI - that affects all CPUs, even those that are not "hitting" trace
code, e.g., user-space.

The first one is for sure a cold-path. The second one is a hot-path: any task
running kernel functions will hit it. But IMHO, the hottest one is the IPIs,
because it will run on all CPUs, e.g., even isolated CPUs that are running in
user-space.

Currently, ftrace does:

	for_ftrace_rec:
		Install all breakpoints
	send IPI

	for_ftrace_rec:
		write the end of the instruction
	send IPI

	for_ftrace_rec:
		 write the first byte of the instruction
	send IPI

And that is the same thing we do with the batch mode, and so it would be better
to integrate both.

The problem is that considering the patch 2/3, the amount of entries we can
batch in the text_poke is limited, and so we batch on chunks of TP_VEC_MAX
entries. So, rather than doing 3 IPIs to change the code, we do:

(ftrace_rec count/TP_VEC_MAX) * 3 IPIs.

One possible solution for this would be to allocate a buffer with "number of
ftrace_rec" and use it in the text_poke batch mode.

But to do it, we should keep the old interface (the one that the 2/3 is
changing). Well, at least using a per-use-case buffer.

[ In addition ]

Currently, ftrace_rec entries are ordered inside the group of functions, but
"groups of function" are not ordered. So, the current int3 handler does a (*):

for_each_group_of_functions:
	check if the ip is in the range    ----> n by the number of groups.
		do a bsearch.		   ----> log(n) by the numbers of entry
					         in the group.

If, instead, it uses an ordered vector, the complexity would be log(n) by the
total number of entries, which is better. So, how bad is the idea of:

	in the enabling ftrace code path, it:
		discover the number of entries
		alloc a buffer
		discover the order of the groups
		for each group in the correct order
			queue the entry in the buffer
		apply the changes using the text_poke...

In this way we would optimize the two hot-paths:
	int3 will be log(n)
	IPIs bounded to 3.

I am not saying we need to do it now, as Steve said, not sure if this is a big
problem, but... those that don't like kernel interference may complain. But if
we leave the per-use-case vector in the text_poke_batch interface, things will
be easier to fix.

NOTE: the other IPIs are generated by hooking the tracepoints and switching the
code to RO/RW...
		
* as far as I understood ftrace_location_range().

-- Daniel

> -- Steve
>