linux-kernel - Re: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label transformations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97bb771a-2dfd-6980-5d25-9523a92a7711@akamai.com>
Date:   Mon, 8 Oct 2018 10:33:24 -0400
From:   Jason Baron <jbaron@...mai.com>
To:     Daniel Bristot de Oliveira <bristot@...hat.com>,
        linux-kernel@...r.kernel.org, x86@...nel.org
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Pavel Tatashin <pasha.tatashin@...cle.com>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        "Steven Rostedt (VMware)" <rostedt@...dmis.org>,
        Zhou Chengming <zhouchengming1@...wei.com>,
        Jiri Kosina <jkosina@...e.cz>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Chris von Recklinghausen <crecklin@...hat.com>,
        Scott Wood <swood@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Clark Williams <williams@...hat.com>
Subject: Re: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label
 transformations

On 10/08/2018 08:53 AM, Daniel Bristot de Oliveira wrote:
> A static key, changing from enabled->disabled/disabled->enabled causes
> the code to be changed, and this is done in three steps:
> 
> -- Pseudo-code #1 - Current implementation ---
> For each key to be updated:
> 	1)  add an int3 trap to the address that will be patched
>         	sync cores (send IPI to all other CPUs)
> 	2)  update all but the first byte of the patched range
> 	        sync cores (send IPI to all other CPUs)
> 	3)  replace the first byte (int3) by the first byte of replacing opcode
> 	        sync cores (send IPI to all other CPUs)
> -- Pseudo-code #1 ---
> 
> The number of IPIs sent is then linear with regard to the number 'n' of
> entries of a key: O(n*3), which is O(n). For instance, as the static key
> netstamp_needed_key has four entries (used in for places in the code)
> in our kernel, 3 IPIs were generated for each entry, resulting in 12 IPIs.
> 
> This algorithm works fine for the update of a single key. But we think
> it is possible to optimize the case in which a static key has more than
> one entry. For instance, the sched_schedstats jump label has 56 entries
> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> which the thread that is enabling is _not_ running.
> 
> In this patch, rather than doing each updated at once, it queue all
> updates first, and the, apply all updates at once, rewriting the
> pseudo-code #1 in this way:
> 
> -- Pseudo-code #2 - This patch  ---
> 1)  for each key in the queue:
>         add an int3 trap to the address that will be patched
>     sync cores (send IPI to all other CPUs)
> 
> 2)  for each key in the queue:
>         update all but the first byte of the patched range
>     sync cores (send IPI to all other CPUs)
> 
> 3)  for each key in the queue:
>         replace the first byte (int3) by the first byte of replacing opcode
> 
>     sync cores (send IPI to all other CPUs)
> -- Pseudo-code #2 - This patch  ---
> 
> Doing the update in this way, the number of IPI becomes O(3) with regard
> to the number of keys, which is O(1).
> 
> Currently, the jump label of a static key is transformed via the arch
> specific function:
> 
>     void arch_jump_label_transform(struct jump_entry *entry,
>                                    enum jump_label_type type)
> 
> The new approach (batch mode) uses two arch functions, the first has the
> same arguments of the arch_jump_label_transform(), and is the function:
> 
>     void arch_jump_label_transform_queue(struct jump_entry *entry,
>                          enum jump_label_type type)
> 
> Rather than transforming the code, it adds the jump_entry in a queue of
> entries to be updated.
> 
> After queuing all jump_entries, the function:
> 
>     void arch_jump_label_transform_apply(void)
> 
> Applies the changes in the queue.
> 
> The batch of operations was:
> Suggested-by: Scott Wood <swood@...hat.com>

Hi,

We've discussed a 'batch' mode here before, and we had patches in the
past iirc, but they never quite reached a merge-able state. I think for
this patch, we want to separate it in 2 - the text patching code that
now takes a list, and the jump_label code consumer. Comments below.


> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Borislav Petkov <bp@...en8.de>
> Cc: "H. Peter Anvin" <hpa@...or.com>
> Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
> Cc: Pavel Tatashin <pasha.tatashin@...cle.com>
> Cc: Masami Hiramatsu <mhiramat@...nel.org>
> Cc: "Steven Rostedt (VMware)" <rostedt@...dmis.org>
> Cc: Zhou Chengming <zhouchengming1@...wei.com>
> Cc: Jiri Kosina <jkosina@...e.cz>
> Cc: Josh Poimboeuf <jpoimboe@...hat.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@...radead.org>
> Cc: Chris von Recklinghausen <crecklin@...hat.com>
> Cc: Jason Baron <jbaron@...mai.com>
> Cc: Scott Wood <swood@...hat.com>
> Cc: Marcelo Tosatti <mtosatti@...hat.com>
> Cc: Clark Williams <williams@...hat.com>
> Cc: x86@...nel.org
> Cc: linux-kernel@...r.kernel.org
> ---
>  arch/x86/include/asm/jump_label.h    |  2 +
>  arch/x86/include/asm/text-patching.h |  9 +++
>  arch/x86/kernel/alternative.c        | 83 +++++++++++++++++++++++++---
>  arch/x86/kernel/jump_label.c         | 54 ++++++++++++++++++
>  include/linux/jump_label.h           |  5 ++
>  kernel/jump_label.c                  | 15 +++++
>  6 files changed, 161 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
> index 8c0de4282659..d61c476046fe 100644
> --- a/arch/x86/include/asm/jump_label.h
> +++ b/arch/x86/include/asm/jump_label.h
> @@ -15,6 +15,8 @@
>  #error asm/jump_label.h included on a non-jump-label kernel
>  #endif
>  
> +#define HAVE_JUMP_LABEL_BATCH
> +
>  #define JUMP_LABEL_NOP_SIZE 5
>  
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..a28230f09d72 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -18,6 +18,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>  #define __parainstructions_end	NULL
>  #endif
>  
> +struct text_to_poke {
> +	struct list_head list;
> +	void *opcode;
> +	void *addr;
> +	void *handler;
> +	size_t len;
> +};
> +
>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  
>  /*
> @@ -37,6 +45,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp_list(struct list_head *entry_list);
>  extern int after_bootmem;
>  
>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index a4c83cb49cd0..3bd502ea4c53 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -735,9 +735,12 @@ static void do_sync_core(void *info)
>  
>  static bool bp_patching_in_progress;
>  static void *bp_int3_handler, *bp_int3_addr;
> +struct list_head *bp_list;
>  
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> +	void *ip;
> +	struct text_to_poke *tp;
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
>  	 * bp_patching_in_progress.
> @@ -753,21 +756,38 @@ int poke_int3_handler(struct pt_regs *regs)
>  	if (likely(!bp_patching_in_progress))
>  		return 0;
>  
> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> +	if (user_mode(regs))
>  		return 0;
>  
> -	/* set up the specified breakpoint handler */
> -	regs->ip = (unsigned long) bp_int3_handler;
> +	/*
> +	 * Single poke.
> +	 */
> +	if (bp_int3_addr) {
> +		if (regs->ip == (unsigned long) bp_int3_addr) {
> +			regs->ip = (unsigned long) bp_int3_handler;
> +			return 1;
> +		}
> +		return 0;
> +	}
>  
> -	return 1;
> +	/*
> +	 * Batch mode.
> +	 */
> +	ip = (void *) regs->ip - sizeof(unsigned char);
> +	list_for_each_entry(tp, bp_list, list) {
> +		if (ip == tp->addr) {
> +			/* set up the specified breakpoint handler */
> +			regs->ip = (unsigned long) tp->handler;
> +			return 1;
> +		}
> +	}
>  
> +	return 0;
>  }
>  
>  static void text_poke_bp_set_handler(void *addr, void *handler,
>  				     unsigned char int3)
>  {
> -	bp_int3_handler = handler;
> -	bp_int3_addr = (u8 *)addr + sizeof(int3);
>  	text_poke(addr, &int3, sizeof(int3));
>  }
>  
> @@ -812,6 +832,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  
>  	lockdep_assert_held(&text_mutex);
>  
> +	bp_int3_handler = handler;
> +	bp_int3_addr = (u8 *)addr + sizeof(int3);
> +
>  	bp_patching_in_progress = true;
>  	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> @@ -841,7 +864,53 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	 * the writing of the new instruction.
>  	 */
>  	bp_patching_in_progress = false;
> -
> +	bp_int3_handler = bp_int3_addr = 0;
>  	return addr;
>  }
>  
> +void text_poke_bp_list(struct list_head *entry_list)
> +{
> +	unsigned char int3 = 0xcc;
> +	int patched_all_but_first = 0;
> +	struct text_to_poke *tp;
> +
> +	bp_list = entry_list;
> +	bp_patching_in_progress = true;
> +	/*
> +	 * Corresponding read barrier in int3 notifier for making sure the
> +	 * in_progress and handler are correctly ordered wrt. patching.
> +	 */
> +	smp_wmb();
> +
> +	list_for_each_entry(tp, entry_list, list)
> +		text_poke_bp_set_handler(tp->addr, tp->handler, int3);
> +
> +	on_each_cpu(do_sync_core, NULL, 1);
> +
> +	list_for_each_entry(tp, entry_list, list) {
> +		if (tp->len - sizeof(int3) > 0) {
> +			patch_all_but_first_byte(tp->addr, tp->opcode, tp->len, int3);
> +			patched_all_but_first++;
> +		}
> +	}
> +
> +	if (patched_all_but_first) {
> +		/*
> +		 * According to Intel, this core syncing is very likely
> +		 * not necessary and we'd be safe even without it. But
> +		 * better safe than sorry (plus there's not only Intel).
> +		 */
> +		on_each_cpu(do_sync_core, NULL, 1);
> +	}
> +
> +	list_for_each_entry(tp, entry_list, list)
> +		patch_first_byte(tp->addr, tp->opcode, int3);
> +
> +	on_each_cpu(do_sync_core, NULL, 1);
> +	/*
> +	 * sync_core() implies an smp_mb() and orders this store against
> +	 * the writing of the new instruction.
> +	 */
> +	bp_list = 0;
> +	bp_patching_in_progress = false;
> +}
> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
> index de588ff47f81..3da5af5de4d3 100644
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -12,6 +12,8 @@
>  #include <linux/list.h>
>  #include <linux/jhash.h>
>  #include <linux/cpu.h>
> +#include <linux/slab.h>
> +#include <linux/list.h>
>  #include <asm/kprobes.h>
>  #include <asm/alternative.h>
>  #include <asm/text-patching.h>
> @@ -139,6 +141,58 @@ void arch_jump_label_transform(struct jump_entry *entry,
>  	mutex_unlock(&text_mutex);
>  }
>  
> +LIST_HEAD(batch_list);
> +
> +void arch_jump_label_transform_queue(struct jump_entry *entry,
> +				     enum jump_label_type type)
> +{
> +	struct text_to_poke *tp;
> +
> +	/*
> +	 * Batch mode disabled at boot time.
> +	 */
> +	if (early_boot_irqs_disabled)
> +		goto fallback;
> +
> +	/*
> +	 * RFC Note: I put __GFP_NOFAIL, but I could also goto fallback;
> +	 * thoughts?
> +	 */
> +	tp = kzalloc(sizeof(struct text_to_poke), GFP_KERNEL | __GFP_NOFAIL);
> +	tp->opcode = kzalloc(sizeof(union jump_code_union),
> +			     GFP_KERNEL | __GFP_NOFAIL);


I wonder if we should just set aside a page here so that we can avoid
the allocation altogether. I think the size of the text_to_poke on
x86_64 is 44 bytes, so that's 93 or so entries, which I think covers the
use-case here. If we go over that limit, we would just do things in
batches of 93. I just think its nice to avoid memory allocations here to
avoid creating additional dependencies, although I'm not aware of any
specific ones.

> +
> +	__jump_label_set_jump_code(entry, type, 0, tp->opcode);
> +	tp->addr = (void *) entry->code;
> +	tp->len = JUMP_LABEL_NOP_SIZE;
> +	tp->handler = (void *) entry->code + JUMP_LABEL_NOP_SIZE;
> +
> +	list_add_tail(&tp->list, &batch_list);
> +
> +	return;
> +
> +fallback:
> +	arch_jump_label_transform(entry, type);
> +}
> +
> +void arch_jump_label_transform_apply(void)
> +{
> +	struct text_to_poke *tp, *next;
> +
> +	if (early_boot_irqs_disabled)
> +		return;
> +
> +	mutex_lock(&text_mutex);
> +	text_poke_bp_list(&batch_list);
> +	mutex_unlock(&text_mutex);
> +
> +	list_for_each_entry_safe(tp, next, &batch_list, list) {
> +		list_del(&tp->list);
> +		kfree(tp->opcode);
> +		kfree(tp);
> +	}
> +}
> +
>  static enum {
>  	JL_STATE_START,
>  	JL_STATE_NO_UPDATE,
> diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
> index cd3bed880ca0..2aca92e03494 100644
> --- a/include/linux/jump_label.h
> +++ b/include/linux/jump_label.h
> @@ -156,6 +156,11 @@ extern void jump_label_lock(void);
>  extern void jump_label_unlock(void);
>  extern void arch_jump_label_transform(struct jump_entry *entry,
>  				      enum jump_label_type type);
> +#ifdef HAVE_JUMP_LABEL_BATCH
> +extern void arch_jump_label_transform_queue(struct jump_entry *entry,
> +					    enum jump_label_type type);
> +extern void arch_jump_label_transform_apply(void);
> +#endif
>  extern void arch_jump_label_transform_static(struct jump_entry *entry,
>  					     enum jump_label_type type);
>  extern int jump_label_text_reserved(void *start, void *end);
> diff --git a/kernel/jump_label.c b/kernel/jump_label.c
> index 940ba7819c87..f534d9c4e07f 100644
> --- a/kernel/jump_label.c
> +++ b/kernel/jump_label.c
> @@ -377,6 +377,7 @@ bool jump_label_can_update_check(struct jump_entry *entry)
>  	return 0;
>  }
>  
> +#ifndef HAVE_JUMP_LABEL_BATCH
>  static void __jump_label_update(struct static_key *key,
>  				struct jump_entry *entry,
>  				struct jump_entry *stop)
> @@ -386,6 +387,20 @@ static void __jump_label_update(struct static_key *key,
>  			arch_jump_label_transform(entry, jump_label_type(entry));
>  	}
>  }
> +#else
> +static void __jump_label_update(struct static_key *key,
> +				struct jump_entry *entry,
> +				struct jump_entry *stop)
> +{
> +	for_each_label_entry(key, entry, stop) {
> +		if (jump_label_can_update_check(entry))
> +			arch_jump_label_transform_queue(entry,
> +							jump_label_type(entry));
> +	}

So this could be done in batches if there are more entries than
PAGE_SIZE / sizeof(struct text_to_poke)


> +	arch_jump_label_transform_apply();
> +
> +}
> +#endif
>  
>  void __init jump_label_init(void)
>  {
>