[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <302cc15a-6805-9549-2de9-ffbb4707ae64@redhat.com>
Date: Mon, 15 Apr 2019 11:52:46 +0200
From: Daniel Bristot de Oliveira <bristot@...hat.com>
To: linux-kernel@...r.kernel.org
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
"H. Peter Anvin" <hpa@...or.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
"Steven Rostedt (VMware)" <rostedt@...dmis.org>,
Jiri Kosina <jkosina@...e.cz>,
Josh Poimboeuf <jpoimboe@...hat.com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Chris von Recklinghausen <crecklin@...hat.com>,
Jason Baron <jbaron@...mai.com>, Scott Wood <swood@...hat.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Clark Williams <williams@...hat.com>, x86@...nel.org
Subject: Re: [PATCH V5 0/7] x86/jump_label: Bound IPIs sent when updating a
static key
On 4/1/19 10:58 AM, Daniel Bristot de Oliveira wrote:
> While tuning a system with CPUs isolated as much as possible, we've
> noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically.
> Tracing the functions that emit IPIs, we saw chronyd - an unprivileged
> process - generating the IPIs when changing a static key, enabling
> network timestaping on sockets.
>
> For instance, the trace pointed:
>
> # trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'...
> # trace-cmde report...
> [...]
> chronyd-858 [000] 433.286406: function: smp_call_function_single
> chronyd-858 [000] 433.286408: kernel_stack: <stack trace>
> => smp_call_function_many (ffffffffbc10ab22)
> => smp_call_function (ffffffffbc10abaa)
> => on_each_cpu (ffffffffbc10ac0b)
> => text_poke_bp (ffffffffbc02436a)
> => arch_jump_label_transform (ffffffffbc020ec8)
> => __jump_label_update (ffffffffbc1b300f)
> => jump_label_update (ffffffffbc1b30ed)
> => static_key_slow_inc (ffffffffbc1b334d)
> => net_enable_timestamp (ffffffffbc625c44)
> => sock_enable_timestamp (ffffffffbc6127d5)
> => sock_setsockopt (ffffffffbc612d2f)
> => SyS_setsockopt (ffffffffbc60d5e6)
> => tracesys (ffffffffbc7681c5)
> <idle>-0 [001] 433.286416: call_function_single_entry: vector=251
> <idle>-0 [001] 433.286419: call_function_single_exit: vector=251
>
> [... The IPI takes place 12 times]
>
> The static key in case was the netstamp_needed_key. A static key
> change from enabled->disabled/disabled->enabled causes the code to be
> patched, and this is done in three steps:
>
> -- Pseudo-code #1 - Current implementation ---
> For each key to be updated:
> 1) add an int3 trap to the address that will be patched
> sync cores (send IPI to all other CPUs)
> 2) update all but the first byte of the patched range
> sync cores (send IPI to all other CPUs)
> 3) replace the first byte (int3) by the first byte of replacing opcode
> sync cores (send IPI to all other CPUs)
> -- Pseudo-code #1 ---
>
> As the static key netstamp_needed_key has four entries (used in for
> places in the code) in our kernel, 3 IPIs were generated for each
> entry, resulting in 12 IPIs. The number of IPIs then is linear with
> regard to the number 'n' of entries of a key: O(n*3), which is O(n).
>
> This algorithm works fine for the update of a single key. But we think
> it is possible to optimize the case in which a static key has more than
> one entry. For instance, the sched_schedstats jump label has 56 entries
> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> which the thread that is enabling is _not_ running.
>
> Rather than doing each updated at once, it is possible to queue all updates
> first, and then apply all updates at once, rewriting the pseudo-code #1
> in this way:
>
> -- Pseudo-code #2 - proposal ---
> 1) for each key in the queue:
> add an int3 trap to the address that will be patched
> sync cores (send IPI to all other CPUs)
>
> 2) for each key in the queue:
> update all but the first byte of the patched range
> sync cores (send IPI to all other CPUs)
>
> 3) for each key in the queue:
> replace the first byte (int3) by the first byte of replacing opcode
> sync cores (send IPI to all other CPUs)
> -- Pseudo-code #2 - proposal ---
>
> Doing the update in this way, the number of IPI becomes O(3) with regard
> to the number of keys, which is O(1).
>
> Currently, the jump label of a static key is transformed via the arch
> specific function:
>
> void arch_jump_label_transform(struct jump_entry *entry,
> enum jump_label_type type)
>
> The new approach (batch mode) uses two arch functions, the first has the
> same arguments of the arch_jump_label_transform(), and is the function:
>
> int arch_jump_label_transform_queue(struct jump_entry *entry,
> enum jump_label_type type)
>
> Rather than transforming the code, it adds the jump_entry in a queue of
> entries to be updated.
>
> After queuing all jump_entries, the function:
>
> void arch_jump_label_transform_apply(void)
>
> Applies the changes in the queue.
>
> One easy way to see the benefits of this patch is switching the
> schedstats on and off. For instance:
>
> -------------------------- %< ----------------------------
> #!/bin/bash
>
> while [ true ]; do
> sysctl -w kernel.sched_schedstats=1
> sleep 2
> sysctl -w kernel.sched_schedstats=0
> sleep 2
> done
> -------------------------- >% ----------------------------
>
> while watching the IPI count:
>
> -------------------------- %< ----------------------------
> # watch -n1 "cat /proc/interrupts | grep Function"
> -------------------------- >% ----------------------------
>
> With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
> while with the batch mode the number of IPIs goes to 3 each 2 seconds.
>
> Regarding the performance impact of this patch set, I made two measurements:
>
> The time to update a key (the task that is causing the change)
> The time to run the int3 handler (the side effect on a thread that
> hits the code being changed)
>
> The sched_schedstats static key was chosen as the key to being switched on and
> off. The reason being is that it is used in more than 56 places, in a hot path.
> The change in the schedstats static key will be done with the following
> command:
>
> while [ true ]; do
> sysctl -w kernel.sched_schedstats=1
> usleep 500000
> sysctl -w kernel.sched_schedstats=0
> usleep 500000
> done
>
> In this way, they key will be updated twice per second. To force the hit of the
> int3 handler, the system will also run a kernel compilation with two jobs per
> CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
> @2.27GHz.
>
> Regarding the update part, on average, the regular kernel takes 57 ms to update
> the static key, while the kernel with the batch updates takes just 1.4 ms on
> average. Although it seems to be too good to be true, it makes sense: the
> sched_schedstats key is used in 56 places, so it was expected that it would take
> around 56 times to update the keys with the current implementation, as the
> IPIs are the most expensive part of the update.
>
> Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
> the batch version takes around 180 ns. At first glance, it seems to be a high
> value. But it is not, considering that it is doing 56 updates, rather than one!
> It is taking four times more, only. This gain is possible because the patch
> uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
> an overhead within four times.
>
> (voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
> a shorter period (because the update part is on for a shorter time), the number
> of hits in the int3 handler decreased by 10%.
>
> The question then is: Is it worth paying the price of "135 ns" more in the int3
> handler?
>
> Considering that, in this test case, we are saving the handling of 53 IPIs,
> that takes more than these 135 ns, it seems to be a meager price to be paid.
> Moreover, the test case was forcing the hit of the int3, in practice, it
> does not take that often. While the IPI takes place on all CPUs, hitting
> the int3 handler or not!
>
> For instance, in an isolated CPU with a process running in user-space
> (nohz_full use-case), the chances of hitting the int3 handler is barely zero,
> while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
> this scenario a lot.
>
> Changes from v4:
> - Renamed jump_label_can_update_check() to jump_label_can_update()
> (Borislav Petkov)
> - Switch jump_label_can_update() return values to bool (Borislav Petkov)
> - Accept the fact that some lines will be > 80 characters (Borislav Petkov)
> - Local variables are now in the inverted Christmas three order
> (Borislav Petkov)
> - Removed superfluous helpers. (Borislav Petkov)
> - Renamed text_to_poke to text_patch_loc, and handler to detour
> (Borislav Petkov & Steven Rostedt)
> - Re-applied the suggestion of not using BUG() from steven (Masami Hiramatsu)
> - arch_jump_label_transform_queue() now returns 0 on success and
> -ENOSPC/EINVAL on errors (Masami Hiramatsu)
This is a gentle ping...
Thanks!
-- Daniel
Powered by blists - more mailing lists