[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190201234759.639afe3533feb8838fc6f2b4@kernel.org>
Date: Fri, 1 Feb 2019 23:47:59 +0900
From: Masami Hiramatsu <mhiramat@...nel.org>
To: Daniel Bristot de Oliveira <bristot@...hat.com>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
"H. Peter Anvin" <hpa@...or.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Steven Rostedt (VMware)" <rostedt@...dmis.org>,
Jiri Kosina <jkosina@...e.cz>,
Josh Poimboeuf <jpoimboe@...hat.com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Chris von Recklinghausen <crecklin@...hat.com>,
Jason Baron <jbaron@...mai.com>, Scott Wood <swood@...hat.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Clark Williams <williams@...hat.com>, x86@...nel.org
Subject: Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
Hi Daniel,
On Fri, 1 Feb 2019 13:49:32 +0100
Daniel Bristot de Oliveira <bristot@...hat.com> wrote:
> On 1/28/19 2:52 PM, Masami Hiramatsu wrote:
> > On Sat, 26 Jan 2019 12:52:15 +0100
> > Daniel Bristot de Oliveira <bristot@...hat.com> wrote:
> >
> >> On 1/23/19 6:15 AM, Masami Hiramatsu wrote:
> >>> Hi Daniel,
> >>>
> >>> On Fri, 21 Dec 2018 11:27:32 +0100
> >>> Daniel Bristot de Oliveira <bristot@...hat.com> wrote:
> >>>
> >>>> Currently, the patch of an address is done in three steps:
> >>>>
> >>>> -- Pseudo-code #1 - Current implementation ---
> >>>> 1) add an int3 trap to the address that will be patched
> >>>> sync cores (send IPI to all other CPUs)
> >>>> 2) update all but the first byte of the patched range
> >>>> sync cores (send IPI to all other CPUs)
> >>>> 3) replace the first byte (int3) by the first byte of replacing opcode
> >>>> sync cores (send IPI to all other CPUs)
> >>>> -- Pseudo-code #1 ---
> >>>>
> >>>> When a static key has more than one entry, these steps are called once for
> >>>> each entry. The number of IPIs then is linear with regard to the number 'n' of
> >>>> entries of a key: O(n*3), which is O(n).
> >>>>
> >>>> This algorithm works fine for the update of a single key. But we think
> >>>> it is possible to optimize the case in which a static key has more than
> >>>> one entry. For instance, the sched_schedstats jump label has 56 entries
> >>>> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> >>>> which the thread that is enabling the key is _not_ running.
> >>>>
> >>>> With this patch, rather than receiving a single patch to be processed, a vector
> >>>> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
> >>>> way:
> >>>>
> >>>> -- Pseudo-code #2 - This patch ---
> >>>> 1) for each patch in the vector:
> >>>> add an int3 trap to the address that will be patched
> >>>>
> >>>> sync cores (send IPI to all other CPUs)
> >>>>
> >>>> 2) for each patch in the vector:
> >>>> update all but the first byte of the patched range
> >>>>
> >>>> sync cores (send IPI to all other CPUs)
> >>>>
> >>>> 3) for each patch in the vector:
> >>>> replace the first byte (int3) by the first byte of replacing opcode
> >>>>
> >>>> sync cores (send IPI to all other CPUs)
> >>>> -- Pseudo-code #2 - This patch ---
> >>>>
> >>>> Doing the update in this way, the number of IPI becomes O(3) with regard
> >>>> to the number of keys, which is O(1).
> >>>>
> >>>> The batch mode is done with the function text_poke_bp_batch(), that receives
> >>>> two arguments: a vector of "struct text_to_poke", and the number of entries
> >>>> in the vector.
> >>>>
> >>>> The vector must be sorted by the addr field of the text_to_poke structure,
> >>>> enabling the binary search of a handler in the poke_int3_handler function
> >>>> (a fast path).
> >>>>
> >>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@...hat.com>
> >>>> Cc: Thomas Gleixner <tglx@...utronix.de>
> >>>> Cc: Ingo Molnar <mingo@...hat.com>
> >>>> Cc: Borislav Petkov <bp@...en8.de>
> >>>> Cc: "H. Peter Anvin" <hpa@...or.com>
> >>>> Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
> >>>> Cc: Masami Hiramatsu <mhiramat@...nel.org>
> >>>> Cc: "Steven Rostedt (VMware)" <rostedt@...dmis.org>
> >>>> Cc: Jiri Kosina <jkosina@...e.cz>
> >>>> Cc: Josh Poimboeuf <jpoimboe@...hat.com>
> >>>> Cc: "Peter Zijlstra (Intel)" <peterz@...radead.org>
> >>>> Cc: Chris von Recklinghausen <crecklin@...hat.com>
> >>>> Cc: Jason Baron <jbaron@...mai.com>
> >>>> Cc: Scott Wood <swood@...hat.com>
> >>>> Cc: Marcelo Tosatti <mtosatti@...hat.com>
> >>>> Cc: Clark Williams <williams@...hat.com>
> >>>> Cc: x86@...nel.org
> >>>> Cc: linux-kernel@...r.kernel.org
> >>>> ---
> >>>> arch/x86/include/asm/text-patching.h | 15 ++++
> >>>> arch/x86/kernel/alternative.c | 108 +++++++++++++++++++++++++--
> >>>> 2 files changed, 117 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> >>>> index e85ff65c43c3..42ea7846df33 100644
> >>>> --- a/arch/x86/include/asm/text-patching.h
> >>>> +++ b/arch/x86/include/asm/text-patching.h
> >>>> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
> >>>> #define __parainstructions_end NULL
> >>>> #endif
> >>>>
> >>>> +/*
> >>>> + * Currently, the max observed size in the kernel code is
> >>>> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> >>>> + * Raise it if needed.
> >>>> + */
> >>>> +#define POKE_MAX_OPCODE_SIZE 5
> >>>> +
> >>>> +struct text_to_poke {
> >>>> + void *handler;
> >>>> + void *addr;
> >>>> + size_t len;
> >>>> + const char opcode[POKE_MAX_OPCODE_SIZE];
> >>>> +};
> >>>> +
> >>>> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>>>
> >>>> /*
> >>>> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>>> extern void *text_poke(void *addr, const void *opcode, size_t len);
> >>>> extern int poke_int3_handler(struct pt_regs *regs);
> >>>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> >>>> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
> >>>> extern int after_bootmem;
> >>>>
> >>>> #endif /* _ASM_X86_TEXT_PATCHING_H */
> >>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> >>>> index 6f5ad8587de0..8fa47e5ec709 100644
> >>>> --- a/arch/x86/kernel/alternative.c
> >>>> +++ b/arch/x86/kernel/alternative.c
> >>>> @@ -21,6 +21,7 @@
> >>>> #include <asm/tlbflush.h>
> >>>> #include <asm/io.h>
> >>>> #include <asm/fixmap.h>
> >>>> +#include <linux/bsearch.h>
> >>>>
> >>>> int __read_mostly alternatives_patched;
> >>>>
> >>>> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
> >>>> }
> >>>>
> >>>> static bool bp_patching_in_progress;
> >>>> +/*
> >>>> + * Single poke.
> >>>> + */
> >>>> static void *bp_int3_handler, *bp_int3_addr;
> >>>> +/*
> >>>> + * Batching poke.
> >>>> + */
> >>>> +static struct text_to_poke *bp_int3_tpv;
> >>>> +static unsigned int bp_int3_tpv_nr;
> >>>> +
> >>>> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> >>>> +{
> >>>> + struct text_to_poke *tp = (struct text_to_poke *) elt;
> >>>> +
> >>>> + if (key < tp->addr)
> >>>> + return -1;
> >>>> + if (key > tp->addr)
> >>>> + return 1;
> >>>> + return 0;
> >>>> +}
> >>>>
> >>>> int poke_int3_handler(struct pt_regs *regs)
> >>>> {
> >>>> + void *ip;
> >>>> + struct text_to_poke *tp;
> >>>> +
> >>>> /*
> >>>> * Having observed our INT3 instruction, we now must observe
> >>>> * bp_patching_in_progress.
> >>>> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
> >>>> if (likely(!bp_patching_in_progress))
> >>>> return 0;
> >>>>
> >>>> - if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> >>>> + if (user_mode(regs))
> >>>> return 0;
> >>>>
> >>>> - /* set up the specified breakpoint handler */
> >>>> - regs->ip = (unsigned long) bp_int3_handler;
> >>>> + /*
> >>>> + * Single poke first.
> >>>> + */
> >>>
> >>> I wonder why would you separate single poke and batch poke?
> >>> It seems a single poke is just a case that bp_int3_tpv_nr == 1.
> >>
> >> Hi Masami!
> >>
> >> The single poke is used only at the boot time, before the system is able to
> >> allocate memory. After that, the batch mode becomes the default.
> >
> > Hmm, what's the difference from text_poke_early()?
>
> text_poke_early(): before enabling interrupts at boot.
>
> text_poke_bp(): after enabling interrupts, before being able to allocate memory,
> or in the error handling with batch mode.
>
> task_poke_batch(): After enabling interrupts and being able to allocate memory.
OK, I got it. Maybe we should document this for future users.
> >> I was thinking to make one function to each method, but then I would have to
> >> change the do_int3() and manage how to switch between one and the other without
> >> further overhead. I was planing to do this in a second round of improvements.
> >
> > I didn't think such big change.
> > I just thought we could allocate single entry array on stack, something like
>
> Ah!
>
> > text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> > {
> > struct text_to_poke tp = {.handler = handler, .addr = addr, .len = len};
> > if (len > POKE_MAX_OPCODE_SIZE)
> > return -E2BIG;
> > memcpy(tp.opcode, opcode, len);
> > return text_poke_bp_batch(&tp, 1);
> > }
>
> Good idea!
>
> >>
> >>> If so, you can remove bp_int3_addr and this block.
> >>>
> >>>> + if (bp_int3_addr) {
> >>>> + if (regs->ip == (unsigned long) bp_int3_addr) {
> >>>> + regs->ip = (unsigned long) bp_int3_handler;
> >>>> + return 1;
> >>>> + }
> >>>> + return 0;
> >>>> + }
> >>>>
> >>>> - return 1;
> >>>> + /*
> >>>> + * Batch mode.
> >>>> + */
> >>>> + if (bp_int3_tpv_nr) {
> >>>
> >>> if (unlikely(bp_int3_tpv_nr))
> >>>
> >>> Sorry about interrupting, but this is a "hot-path" when we use kprobes.
> >>
> >> No problem at all! :-)
> >
> > Thanks! :-)
> >
> >>
> >> I will change this function to better deal with the hot-path (the default mode
> >> after the system boots up).
> >>
> >> how about something like this:
> >> ------------------ %< ------------------
> >> int poke_int3_handler(struct pt_regs *regs)
> >> {
> >> void *ip;
> >> struct text_to_poke *tp;
> >>
> >> /*
> >> * Having observed our INT3 instruction, we now must observe
> >> * bp_patching_in_progress.
> >> *
> >> * in_progress = TRUE INT3
> >> * WMB RMB
> >> * write INT3 if (in_progress)
> >> *
> >> * Idem for bp_int3_handler.
> >> */
> >> smp_rmb();
> >>
> >> if (likely(!bp_patching_in_progress))
> >> return 0;
> >>
> >> if (user_mode(regs))
> >> return 0;
> >>
> >> /*
> >> * Single poke is only used at the boot.
> >> */
> >> if (unlikely(!bp_int3_tpv))
> >> goto single_poke;
> >>
> >> ip = (void *) regs->ip - sizeof(unsigned char);
> >> tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
> >> sizeof(struct text_to_poke),
> >> text_bp_batch_bsearch);
> >> if (tp) {
> >> /* set up the specified breakpoint handler */
> >> regs->ip = (unsigned long) tp->handler;
> >> return 1;
> >> }
> >>
> >> return 0;
> >>
> >> single_poke:
> >> if (regs->ip == (unsigned long) bp_int3_addr) {
> >> regs->ip = (unsigned long) bp_int3_handler;
> >> return 1;
> >> }
> >>
> >> return 0;
> >> }
> >> ------------- >% ----------
> >>
> >> In this way the default code is up, and the only 'if' I am using is a var of the
> >> batch mode (that will be used later). If are are still at the boot, we are
> >> jumping to the end of the function.
> >>
> >> look better?
> >
> > yeah, it looks much better. But I just wonder why don't you consolidate both by
> > just because reducing code.
> >
>
> and so I did. How about something like this?
OK, I just have a nitpick comment, but this version looks good to me.
> ---------- %< ---------
> ---
> arch/x86/include/asm/text-patching.h | 15 ++++
> arch/x86/kernel/alternative.c | 118 +++++++++++++++++++--------
> lib/bsearch.c | 2 +
> 3 files changed, 100 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..42ea7846df33 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
> #define __parainstructions_end NULL
> #endif
>
> +/*
> + * Currently, the max observed size in the kernel code is
> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> + * Raise it if needed.
> + */
> +#define POKE_MAX_OPCODE_SIZE 5
> +
> +struct text_to_poke {
> + void *handler;
> + void *addr;
> + size_t len;
> + const char opcode[POKE_MAX_OPCODE_SIZE];
> +};
> +
> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>
> /*
> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
> extern int after_bootmem;
>
> #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 202af29c43c0..2196bb8bb924 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -11,6 +11,7 @@
> #include <linux/stop_machine.h>
> #include <linux/slab.h>
> #include <linux/kdebug.h>
> +#include <linux/kprobes.h>
> #include <asm/text-patching.h>
> #include <asm/alternative.h>
> #include <asm/sections.h>
> @@ -21,6 +22,7 @@
> #include <asm/tlbflush.h>
> #include <asm/io.h>
> #include <asm/fixmap.h>
> +#include <linux/bsearch.h>
>
> int __read_mostly alternatives_patched;
>
> @@ -738,10 +740,26 @@ static void do_sync_core(void *info)
> }
>
> static bool bp_patching_in_progress;
> -static void *bp_int3_handler, *bp_int3_addr;
> +static struct text_to_poke *bp_int3_tpv;
> +static unsigned int bp_int3_tpv_nr;
> +
> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> +{
> + struct text_to_poke *tp = (struct text_to_poke *) elt;
> +
> + if (key < tp->addr)
> + return -1;
> + if (key > tp->addr)
> + return 1;
> + return 0;
> +}
> +NOKPROBE_SYMBOL(text_bp_batch_bsearch);
>
> int poke_int3_handler(struct pt_regs *regs)
> {
> + void *ip;
> + struct text_to_poke *tp;
> +
> /*
> * Having observed our INT3 instruction, we now must observe
> * bp_patching_in_progress.
> @@ -757,21 +775,41 @@ int poke_int3_handler(struct pt_regs *regs)
> if (likely(!bp_patching_in_progress))
> return 0;
>
> - if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> + if (user_mode(regs))
> return 0;
>
> - /* set up the specified breakpoint handler */
> - regs->ip = (unsigned long) bp_int3_handler;
> + ip = (void *) regs->ip - sizeof(unsigned char);
>
> - return 1;
> + /*
> + * Skip the binary search if there is a single member in the vector.
> + */
> + if (unlikely(bp_int3_tpv_nr == 1))
> + goto single_poke;
> +
> + tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
> + sizeof(struct text_to_poke),
> + text_bp_batch_bsearch);
> + if (tp) {
> + /* set up the specified breakpoint handler */
> + regs->ip = (unsigned long) tp->handler;
> + return 1;
> + }
> +
> + return 0;
> +
> +single_poke:
> + if (ip == (unsigned long) bp_int3_tpv->addr) {
> + regs->ip = (unsigned long) bp_int3_tpv->handler;
> + return 1;
> + }
>
> + return 0;
> }
> +NOKPROBE_SYMBOL(poke_int3_handler);
Ah, this will be covered by a series which currently I'm pinging Ingo.
https://lkml.org/lkml/2019/1/11/1480
>
> static void text_poke_bp_set_handler(void *addr, void *handler,
> unsigned char int3)
> {
> - bp_int3_handler = handler;
> - bp_int3_addr = (u8 *)addr + sizeof(int3);
> text_poke(addr, &int3, sizeof(int3));
> }
>
> @@ -790,32 +828,14 @@ static void patch_first_byte(void *addr, const void *opcode, unsigned char int3)
> text_poke(addr, opcode, sizeof(int3));
> }
>
> -/**
> - * text_poke_bp() -- update instructions on live kernel on SMP
> - * @addr: address to patch
> - * @opcode: opcode of new instruction
> - * @len: length to copy
> - * @handler: address to jump to when the temporary breakpoint is hit
> - *
> - * Modify multi-byte instruction by using int3 breakpoint on SMP.
> - * We completely avoid stop_machine() here, and achieve the
> - * synchronization using int3 breakpoint.
> - *
> - * The way it is done:
> - * - add a int3 trap to the address that will be patched
> - * - sync cores
> - * - update all but the first byte of the patched range
> - * - sync cores
> - * - replace the first byte (int3) by the first byte of
> - * replacing opcode
> - * - sync cores
> - */
> -void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
> {
> + unsigned int i;
> unsigned char int3 = 0xcc;
> + int patched_all_but_first = 0;
>
> - lockdep_assert_held(&text_mutex);
> -
> + bp_int3_tpv = tp;
> + bp_int3_tpv_nr = nr_entries;
> bp_patching_in_progress = true;
> /*
> * Corresponding read barrier in int3 notifier for making sure the
> @@ -823,12 +843,20 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> */
> smp_wmb();
>
> - text_poke_bp_set_handler(addr, handler, int3);
> + for (i = 0; i < nr_entries; i++)
> + text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
>
> on_each_cpu(do_sync_core, NULL, 1);
>
> - if (len - sizeof(int3) > 0) {
> - patch_all_but_first_byte(addr, opcode, len, int3);
> + for (i = 0; i < nr_entries; i++) {
> + if (tp[i].len - sizeof(int3) > 0) {
> + patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
> + tp[i].len, int3);
> + patched_all_but_first++;
> + }
> + }
> +
> + if (patched_all_but_first) {
> /*
> * According to Intel, this core syncing is very likely
> * not necessary and we'd be safe even without it. But
> @@ -837,15 +865,35 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> on_each_cpu(do_sync_core, NULL, 1);
> }
>
> - patch_first_byte(addr, opcode, int3);
> + for (i = 0; i < nr_entries; i++)
> + patch_first_byte(tp[i].addr, tp[i].opcode, int3);
>
> on_each_cpu(do_sync_core, NULL, 1);
> /*
> * sync_core() implies an smp_mb() and orders this store against
> * the writing of the new instruction.
> */
> + bp_int3_tpv_nr = 0;
> + bp_int3_tpv = NULL;
> bp_patching_in_progress = false;
> +}
> + XXX: paste the old comment here... I forgot.
> +void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +{
> + struct text_to_poke tp = {
> + .handler = handler,
> + .addr = addr,
> + .len = len
even the last field assignment, you'd better add "," here.
> + };
> +
> + if (len > POKE_MAX_OPCODE_SIZE) {
> + WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
> + return NULL;
> + }
> +
> + memcpy((void *)tp.opcode, opcode, len);
> +
> + text_poke_bp_batch(&tp, 1);
>
> return addr;
> }
> -
> diff --git a/lib/bsearch.c b/lib/bsearch.c
> index 18b445b010c3..82512fe7b33c 100644
> --- a/lib/bsearch.c
> +++ b/lib/bsearch.c
> @@ -11,6 +11,7 @@
>
> #include <linux/export.h>
> #include <linux/bsearch.h>
> +#include <linux/kprobes.h>
>
> /*
> * bsearch - binary search an array of elements
> @@ -53,3 +54,4 @@ void *bsearch(const void *key, const void *base, size_t num, size_t size,
> return NULL;
> }
> EXPORT_SYMBOL(bsearch);
> +NOKPROBE_SYMBOL(bsearch);
Actually, this part is already pointed by Andrea Righi, since ftrace
is using bsearch, see below.
https://lkml.org/lkml/2019/1/12/70
It depends on which patch series are merged first, but I would like to
separate NOKPROBE_SYMBOL() patch since it fixes (or prevents) a bug.
Anyway, this looks good to me.
Reviewed-by: Masami Hiramatsu <mhiramat@...nel.org>
Thank you,
> --
>
> If so, I will send a v4 with this ideia.
>
> Thanks!
>
> -- Daniel
--
Masami Hiramatsu <mhiramat@...nel.org>
Powered by blists - more mailing lists