linux-kernel - Re: [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <527CD34A.30406@hitachi.com>
Date:	Fri, 08 Nov 2013 21:04:26 +0900
From:	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>
To:	Petr Mladek <pmladek@...e.cz>
Cc:	Steven Rostedt <rostedt@...dmis.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Jiri Kosina <jkosina@...e.cz>, linux-kernel@...r.kernel.org,
	x86@...nel.org
Subject: Re: [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid
 write

(2013/11/08 18:12), Petr Mladek wrote:
> This change is inspired by the int3-based patching code used in
> ftrace. See the commit fd4363fff3d9 (x86: Introduce int3
> (breakpoint)-based instruction patching).
> 
> When trying to use text_poke_bp in ftrace, the result was slower than
> the original implementation.
> 
> It turned out that one difference was in text_poke vs. ftrace_write.
> text_poke did many extra operations to make sure that the change
> was atomic.

AFAIK, the main reason why text_poke is used is avoiding
RODATA protection (by alias mapping).

> In fact, we do not need this luxury in text_poke_bp because
> the modified code is guarded by the int3 handler. The int3
> interrupt itself is added and removed by an atomic operation
> because we modify only one byte.
> 
> This patch adds text_poke_part which is inspired by ftrace_write.
> Then it is used in text_poke_bp instead of the paranoid text_poke.
> 
> For example, I tried to switch between 7 tracers: blk, branch,
> function_graph, wakeup_rt, irqsoff, function, and nop. Every tracer
> has also been enabled and disabled. With 100 cycles, I got these
> times with text_poke:
> 
>     real    25m7.380s     25m13.44s     25m9.072s
>     user    0m0.004s      0m0.004s      0m0.004s
>     sys     0m3.480s      0m3.508s      0m3.420s
> 
> and with text_poke_part:
> 
>     real    3m23.335s     3m24.422s     3m26.686s
>     user    0m0.004s      0m0.004s	0m0.004s
>     sys     0m3.724s      0m3.772s	0m3.588s
> 
> Signed-off-by: Petr Mladek <pmladek@...e.cz>
> ---
>  arch/x86/kernel/alternative.c | 49 ++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 44 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index df94598..0586dc1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -8,6 +8,7 @@
>  #include <linux/kprobes.h>
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
> +#include <linux/uaccess.h>
>  #include <linux/memory.h>
>  #include <linux/stop_machine.h>
>  #include <linux/slab.h>
> @@ -586,6 +587,44 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>  	return addr;
>  }
>  
> +/**
> + * text_poke_part - Update part of the instruction on a live kernel when using
> + * 		    int3 breakpoint
> + * @addr:   address to modify
> + * @opcode: source of the copy
> + * @len:    length to copy
> + *
> + * If we patch instructions using int3 interrupt, we do not need to be so
> + * careful about an atomic write. Adding and removing the interrupt is atomic
> + * because we modify only one byte. The rest of the instruction could be
> + * modified in several steps because it is guarded by the interrupt handler.
> + * Hence we could use faster code here. See also text_poke_bp below for more
> + * details.
> + *
> + * Note: Must be called under text_mutex.
> + */
> +static int text_poke_part(void *addr, const void *opcode, size_t len)
> +{
> +	/*
> +	 * On x86_64, kernel text mappings are mapped read-only with
> +	 * CONFIG_DEBUG_RODATA. So we use the kernel identity mapping instead
> +	 * of the kernel text mapping to modify the kernel text.

Hmm, I couldn't find the guarantee that __va(__pa_symbol(x)) gives us a
raw(writable) text pages always.
Even it is true, it seems wired that we can still rewrite such protected
page without any spacial page remapping operation.


> +	 *
> +	 * For 32bit kernels, these mappings are same and we can use
> +	 * kernel identity mapping to modify code.
> +	 */
> +	if (((unsigned long)addr >= (unsigned long)_text) &&
> +	    ((unsigned long)addr < (unsigned long)_etext))
> +		addr = __va(__pa_symbol(addr));

You should also warn or return error if the given addr is not in the kernel text.

> +
> +	if (probe_kernel_write(addr, opcode, len))
> +		return -EPERM;
> +
> +	sync_core();
> +
> +	return 0;
> +}
> +
>  static void do_sync_core(void *info)
>  {
>  	sync_core();
> @@ -648,15 +687,15 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	 */
>  	smp_wmb();
>  
> -	text_poke(addr, &int3, sizeof(int3));
> +	text_poke_part(addr, &int3, sizeof(int3));

Anyway, text_poke itself will cause a BUG if it hits an error, but text_poke_part() seems
silently failing. It should handle the error correctly (or call BUG()).

>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  
>  	if (len - sizeof(int3) > 0) {
>  		/* patch all but the first byte */
> -		text_poke((char *)addr + sizeof(int3),
> -			  (const char *) opcode + sizeof(int3),
> -			  len - sizeof(int3));
> +		text_poke_part((char *)addr + sizeof(int3),
> +			       (const char *) opcode + sizeof(int3),
> +			       len - sizeof(int3));

Here we should check an error too.

>  		/*
>  		 * According to Intel, this core syncing is very likely
>  		 * not necessary and we'd be safe even without it. But
> @@ -666,7 +705,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	}
>  
>  	/* patch the first byte */
> -	text_poke(addr, opcode, sizeof(int3));
> +	text_poke_part(addr, opcode, sizeof(int3));

Ditto.

>  
>  	on_each_cpu(do_sync_core, NULL, 1);

Thank you,


-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@...achi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/