linux-kernel - Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190123141523.eb6aa6a17efc76881e89e130@kernel.org>
Date:   Wed, 23 Jan 2019 14:15:23 +0900
From:   Masami Hiramatsu <mhiramat@...nel.org>
To:     Daniel Bristot de Oliveira <bristot@...hat.com>
Cc:     linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        "Steven Rostedt (VMware)" <rostedt@...dmis.org>,
        Jiri Kosina <jkosina@...e.cz>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Chris von Recklinghausen <crecklin@...hat.com>,
        Jason Baron <jbaron@...mai.com>, Scott Wood <swood@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Clark Williams <williams@...hat.com>, x86@...nel.org
Subject: Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations

Hi Daniel,

On Fri, 21 Dec 2018 11:27:32 +0100
Daniel Bristot de Oliveira <bristot@...hat.com> wrote:

> Currently, the patch of an address is done in three steps:
> 
> -- Pseudo-code #1 - Current implementation ---
>         1) add an int3 trap to the address that will be patched
>             sync cores (send IPI to all other CPUs)
>         2) update all but the first byte of the patched range
>             sync cores (send IPI to all other CPUs)
>         3) replace the first byte (int3) by the first byte of replacing opcode
>             sync cores (send IPI to all other CPUs)
> -- Pseudo-code #1 ---
> 
> When a static key has more than one entry, these steps are called once for
> each entry. The number of IPIs then is linear with regard to the number 'n' of
> entries of a key: O(n*3), which is O(n).
> 
> This algorithm works fine for the update of a single key. But we think
> it is possible to optimize the case in which a static key has more than
> one entry. For instance, the sched_schedstats jump label has 56 entries
> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> which the thread that is enabling the key is _not_ running.
> 
> With this patch, rather than receiving a single patch to be processed, a vector
> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
> way:
> 
> -- Pseudo-code #2 - This patch  ---
> 1)  for each patch in the vector:
>         add an int3 trap to the address that will be patched
> 
>     sync cores (send IPI to all other CPUs)
> 
> 2)  for each patch in the vector:
>         update all but the first byte of the patched range
> 
>     sync cores (send IPI to all other CPUs)
> 
> 3)  for each patch in the vector:
>         replace the first byte (int3) by the first byte of replacing opcode
> 
>     sync cores (send IPI to all other CPUs)
> -- Pseudo-code #2 - This patch  ---
> 
> Doing the update in this way, the number of IPI becomes O(3) with regard
> to the number of keys, which is O(1).
> 
> The batch mode is done with the function text_poke_bp_batch(), that receives
> two arguments: a vector of "struct text_to_poke", and the number of entries
> in the vector.
> 
> The vector must be sorted by the addr field of the text_to_poke structure,
> enabling the binary search of a handler in the poke_int3_handler function
> (a fast path).
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Borislav Petkov <bp@...en8.de>
> Cc: "H. Peter Anvin" <hpa@...or.com>
> Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
> Cc: Masami Hiramatsu <mhiramat@...nel.org>
> Cc: "Steven Rostedt (VMware)" <rostedt@...dmis.org>
> Cc: Jiri Kosina <jkosina@...e.cz>
> Cc: Josh Poimboeuf <jpoimboe@...hat.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@...radead.org>
> Cc: Chris von Recklinghausen <crecklin@...hat.com>
> Cc: Jason Baron <jbaron@...mai.com>
> Cc: Scott Wood <swood@...hat.com>
> Cc: Marcelo Tosatti <mtosatti@...hat.com>
> Cc: Clark Williams <williams@...hat.com>
> Cc: x86@...nel.org
> Cc: linux-kernel@...r.kernel.org
> ---
>  arch/x86/include/asm/text-patching.h |  15 ++++
>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
>  2 files changed, 117 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..42ea7846df33 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>  #define __parainstructions_end	NULL
>  #endif
>  
> +/*
> + * Currently, the max observed size in the kernel code is
> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> + * Raise it if needed.
> + */
> +#define POKE_MAX_OPCODE_SIZE	5
> +
> +struct text_to_poke {
> +	void *handler;
> +	void *addr;
> +	size_t len;
> +	const char opcode[POKE_MAX_OPCODE_SIZE];
> +};
> +
>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  
>  /*
> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
>  extern int after_bootmem;
>  
>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 6f5ad8587de0..8fa47e5ec709 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -21,6 +21,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/io.h>
>  #include <asm/fixmap.h>
> +#include <linux/bsearch.h>
>  
>  int __read_mostly alternatives_patched;
>  
> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
>  }
>  
>  static bool bp_patching_in_progress;
> +/*
> + * Single poke.
> + */
>  static void *bp_int3_handler, *bp_int3_addr;
> +/*
> + * Batching poke.
> + */
> +static struct text_to_poke *bp_int3_tpv;
> +static unsigned int bp_int3_tpv_nr;
> +
> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> +{
> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
> +
> +	if (key < tp->addr)
> +		return -1;
> +	if (key > tp->addr)
> +		return 1;
> +	return 0;
> +}
>  
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> +	void *ip;
> +	struct text_to_poke *tp;
> +
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
>  	 * bp_patching_in_progress.
> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
>  	if (likely(!bp_patching_in_progress))
>  		return 0;
>  
> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> +	if (user_mode(regs))
>  		return 0;
>  
> -	/* set up the specified breakpoint handler */
> -	regs->ip = (unsigned long) bp_int3_handler;
> +	/*
> +	 * Single poke first.
> +	 */

I wonder why would you separate single poke and batch poke?
It seems a single poke is just a case that bp_int3_tpv_nr == 1.
If so, you can remove bp_int3_addr and this block.

> +	if (bp_int3_addr) {
> +		if (regs->ip == (unsigned long) bp_int3_addr) {
> +			regs->ip = (unsigned long) bp_int3_handler;
> +			return 1;
> +		}
> +		return 0;
> +	}
> 
> -	return 1;
> +	/*
> +	 * Batch mode.
> +	 */
> +	if (bp_int3_tpv_nr) {

if (unlikely(bp_int3_tpv_nr))

Sorry about interrupting, but this is a "hot-path" when we use kprobes.

Also, could you add NOKPROBE_SYMBOL(); for all symbols involved in this
process?
Recently I found I missed it for poke_int3_handler and sent a fix.
( https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898241.html )
If this increase the function-call-chain from poke_int3_handler, those
must be marked as NOKPROBE_SYMBOL().

But basically this series looks good to me.

Thank you, 

> +		ip = (void *) regs->ip - sizeof(unsigned char);
> +		tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
> +			     sizeof(struct text_to_poke),
> +			     text_bp_batch_bsearch);
> +		if (tp) {
> +			/* set up the specified breakpoint handler */
> +			regs->ip = (unsigned long) tp->handler;
> +			return 1;
> +		}
> +	}
>  
> +	return 0;
>  }
>  
>  static void text_poke_bp_set_handler(void *addr, void *handler,
>  				     unsigned char int3)
>  {
> -	bp_int3_handler = handler;
> -	bp_int3_addr = (u8 *)addr + sizeof(int3);
>  	text_poke(addr, &int3, sizeof(int3));
>  }
>  
> @@ -816,6 +859,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  
>  	lockdep_assert_held(&text_mutex);
>  
> +	bp_int3_handler = handler;
> +	bp_int3_addr = (u8 *)addr + sizeof(int3);
> +
>  	bp_patching_in_progress = true;
>  	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> @@ -846,6 +892,56 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	 */
>  	bp_patching_in_progress = false;
>  
> +	bp_int3_handler = bp_int3_addr = 0;
>  	return addr;
>  }
>  
> +void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
> +{
> +	unsigned int i;
> +	unsigned char int3 = 0xcc;
> +	int patched_all_but_first = 0;
> +
> +	bp_int3_tpv = tp;
> +	bp_int3_tpv_nr = nr_entries;
> +	bp_patching_in_progress = true;
> +	/*
> +	 * Corresponding read barrier in int3 notifier for making sure the
> +	 * in_progress and handler are correctly ordered wrt. patching.
> +	 */
> +	smp_wmb();
> +
> +	for (i = 0; i < nr_entries; i++)
> +		text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
> +
> +	on_each_cpu(do_sync_core, NULL, 1);
> +
> +	for (i = 0; i < nr_entries; i++) {
> +		if (tp[i].len - sizeof(int3) > 0) {
> +			patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
> +						 tp[i].len, int3);
> +			patched_all_but_first++;
> +		}
> +	}
> +
> +	if (patched_all_but_first) {
> +		/*
> +		 * According to Intel, this core syncing is very likely
> +		 * not necessary and we'd be safe even without it. But
> +		 * better safe than sorry (plus there's not only Intel).
> +		 */
> +		on_each_cpu(do_sync_core, NULL, 1);
> +	}
> +
> +	for (i = 0; i < nr_entries; i++)
> +		patch_first_byte(tp[i].addr, tp[i].opcode, int3);
> +
> +	on_each_cpu(do_sync_core, NULL, 1);
> +	/*
> +	 * sync_core() implies an smp_mb() and orders this store against
> +	 * the writing of the new instruction.
> +	 */
> +	bp_int3_tpv_nr = 0;
> +	bp_int3_tpv = NULL;
> +	bp_patching_in_progress = false;
> +}
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@...nel.org>