[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.21.1907171223540.4492@pobox.suse.cz>
Date: Wed, 17 Jul 2019 13:01:27 +0200 (CEST)
From: Miroslav Benes <mbenes@...e.cz>
To: Joe Lawrence <joe.lawrence@...hat.com>
cc: heiko.carstens@...ibm.com, gor@...ux.ibm.com,
borntraeger@...ibm.com, linux-s390@...r.kernel.org,
linux-kernel@...r.kernel.org, jpoimboe@...hat.com,
jikos@...nel.org, pmladek@...e.com, nstange@...e.de,
live-patching@...r.kernel.org
Subject: Re: [PATCH] s390/livepatch: Implement reliable stack tracing for
the consistency model
On Tue, 16 Jul 2019, Joe Lawrence wrote:
> On Wed, Jul 10, 2019 at 12:59:18PM +0200, Miroslav Benes wrote:
> > The livepatch consistency model requires reliable stack tracing
> > architecture support in order to work properly. In order to achieve
> > this, two main issues have to be solved. First, reliable and consistent
> > call chain backtracing has to be ensured. Second, the unwinder needs to
> > be able to detect stack corruptions and return errors.
> >
> > The "zSeries ELF Application Binary Interface Supplement" says:
> >
> > "The stack pointer points to the first word of the lowest allocated
> > stack frame. If the "back chain" is implemented this word will point to
> > the previously allocated stack frame (towards higher addresses), except
> > for the first stack frame, which shall have a back chain of zero (NULL).
> > The stack shall grow downwards, in other words towards lower addresses."
> >
> > "back chain" is optional. GCC option -mbackchain enables it. Quoting
> > Martin Schwidefsky [1]:
> >
> > "The compiler is called with the -mbackchain option, all normal C
> > function will store the backchain in the function prologue. All
> > functions written in assembler code should do the same, if you find one
> > that does not we should fix that. The end result is that a task that
> > *voluntarily* called schedule() should have a proper backchain at all
> > times.
> >
> > Dependent on the use case this may or may not be enough. Asynchronous
> > interrupts may stop the CPU at the beginning of a function, if kernel
> > preemption is enabled we can end up with a broken backchain. The
> > production kernels for IBM Z are all compiled *without* kernel
> > preemption. So yes, we might get away without the objtool support.
> >
> > On a side-note, we do have a line item to implement the ORC unwinder for
> > the kernel, that includes the objtool support. Once we have that we can
> > drop the -mbackchain option for the kernel build. That gives us a nice
> > little performance benefit. I hope that the change from backchain to the
> > ORC unwinder will not be too hard to implement in the livepatch tools."
> >
> > Thus, the call chain backtracing should be currently ensured and objtool
> > should not be necessary for livepatch purposes.
>
> Hi Miroslav,
>
> Should there be a CONFIG? dependency on -mbackchain and/or kernel
> preemption, or does the following ensure that we don't need a explicit
> build time checks?
I don't think we have to do anything explicit. -mbackchain is enabled by
default (arch/s390/Makefile) and the following should ensure the rest.
I'll make it clearer in v2.
> > Regarding the second issue, stack corruptions and non-reliable states
> > have to be recognized by the unwinder. Mainly it means to detect
> > preemption or page faults, the end of the task stack must be reached,
> > return addresses must be valid text addresses and hacks like function
> > graph tracing and kretprobes must be properly detected.
> >
> > Unwinding a running task's stack is not a problem, because there is a
> > livepatch requirement that every checked task is blocked, except for the
> > current task. Due to that, the implementation can be much simpler
> > compared to the existing non-reliable infrastructure. We can consider a
> > task's kernel/thread stack only and skip the other stacks.
> >
> > Idle tasks are a bit special. Their final back chains point to no_dat
> > stacks. See for reference CALL_ON_STACK() in smp_start_secondary()
> > callback used in __cpu_up(). The unwinding is stopped there and it is
> > not considered to be a stack corruption.
> >
> > Signed-off-by: Miroslav Benes <mbenes@...e.cz>
> > ---
> > - based on Linus' master
> > - passes livepatch kselftests
> > - passes tests from https://github.com/lpechacek/qa_test_klp, which
> > stress the consistency model and the unwinder a bit more
> >
> > arch/s390/Kconfig | 1 +
> > arch/s390/include/asm/stacktrace.h | 5 ++
> > arch/s390/include/asm/unwind.h | 19 ++++++
> > arch/s390/kernel/dumpstack.c | 28 +++++++++
> > arch/s390/kernel/stacktrace.c | 78 +++++++++++++++++++++++++
> > arch/s390/kernel/unwind_bc.c | 93 ++++++++++++++++++++++++++++++
> > 6 files changed, 224 insertions(+)
> >
> > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> > index fdb4246265a5..ea73e555063d 100644
> > --- a/arch/s390/Kconfig
> > +++ b/arch/s390/Kconfig
> > @@ -170,6 +170,7 @@ config S390
> > select HAVE_PERF_EVENTS
> > select HAVE_RCU_TABLE_FREE
> > select HAVE_REGS_AND_STACK_ACCESS_API
> > + select HAVE_RELIABLE_STACKTRACE
> > select HAVE_RSEQ
> > select HAVE_SYSCALL_TRACEPOINTS
> > select HAVE_VIRT_CPU_ACCOUNTING
> > diff --git a/arch/s390/include/asm/stacktrace.h b/arch/s390/include/asm/stacktrace.h
> > index 0ae4bbf7779c..2b5c913c408f 100644
> > --- a/arch/s390/include/asm/stacktrace.h
> > +++ b/arch/s390/include/asm/stacktrace.h
> > @@ -23,6 +23,11 @@ const char *stack_type_name(enum stack_type type);
> > int get_stack_info(unsigned long sp, struct task_struct *task,
> > struct stack_info *info, unsigned long *visit_mask);
> >
> > +#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> > +int get_stack_info_reliable(unsigned long sp, struct task_struct *task,
> > + struct stack_info *info);
> > +#endif
> > +
> > static inline bool on_stack(struct stack_info *info,
> > unsigned long addr, size_t len)
> > {
> > diff --git a/arch/s390/include/asm/unwind.h b/arch/s390/include/asm/unwind.h
> > index d827b5b9a32c..1cc96c54169c 100644
> > --- a/arch/s390/include/asm/unwind.h
> > +++ b/arch/s390/include/asm/unwind.h
> > @@ -45,6 +45,25 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
> > bool unwind_next_frame(struct unwind_state *state);
> > unsigned long unwind_get_return_address(struct unwind_state *state);
> >
> > +#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> > +void __unwind_start_reliable(struct unwind_state *state,
> > + struct task_struct *task, unsigned long sp);
> > +bool unwind_next_frame_reliable(struct unwind_state *state);
> > +
> > +static inline void unwind_start_reliable(struct unwind_state *state,
> > + struct task_struct *task)
> > +{
> > + unsigned long sp;
> > +
> > + if (task == current)
> > + sp = current_stack_pointer();
> > + else
> > + sp = task->thread.ksp;
> > +
> > + __unwind_start_reliable(state, task, sp);
> > +}
> > +#endif
> > +
> > static inline bool unwind_done(struct unwind_state *state)
> > {
> > return state->stack_info.type == STACK_TYPE_UNKNOWN;
> > diff --git a/arch/s390/kernel/dumpstack.c b/arch/s390/kernel/dumpstack.c
> > index ac06c3949ab3..b21ef2a766ff 100644
> > --- a/arch/s390/kernel/dumpstack.c
> > +++ b/arch/s390/kernel/dumpstack.c
> > @@ -127,6 +127,34 @@ int get_stack_info(unsigned long sp, struct task_struct *task,
> > return -EINVAL;
> > }
> >
> > +#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> > +int get_stack_info_reliable(unsigned long sp, struct task_struct *task,
> > + struct stack_info *info)
> > +{
> > + if (!sp)
> > + goto error;
> > +
> > + /* Sanity check: ABI requires SP to be aligned 8 bytes. */
> > + if (sp & 0x7)
> > + goto error;
> > +
>
> Does SP alignment only need to be checked for the initial frame, or
> should it be verified everytime it's moved in
> unwind_next_frame_reliable()?
Good spotting. It should have been verified everytime. It got lost during
rebasing onto the new unwinding framework.
> > + if (!task)
> > + goto error;
> > +
> > + /*
> > + * The unwinding should not start on nodat_stack, async_stack or
> > + * restart_stack. The task is either current or must be inactive.
> > + */
> > + if (!in_task_stack(sp, task, info))
> > + goto error;
> > +
> > + return 0;
> > +error:
> > + info->type = STACK_TYPE_UNKNOWN;
> > + return -EINVAL;
> > +}
> > +#endif
> > +
> > void show_stack(struct task_struct *task, unsigned long *stack)
> > {
> > struct unwind_state state;
> > diff --git a/arch/s390/kernel/stacktrace.c b/arch/s390/kernel/stacktrace.c
> > index f6a620f854e1..7d774a325163 100644
> > --- a/arch/s390/kernel/stacktrace.c
> > +++ b/arch/s390/kernel/stacktrace.c
> > @@ -13,6 +13,7 @@
> > #include <linux/export.h>
> > #include <asm/stacktrace.h>
> > #include <asm/unwind.h>
> > +#include <asm/kprobes.h>
> >
> > void save_stack_trace(struct stack_trace *trace)
> > {
> > @@ -60,3 +61,80 @@ void save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
> > }
> > }
> > EXPORT_SYMBOL_GPL(save_stack_trace_regs);
> > +
> > +#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> > +/*
> > + * This function returns an error if it detects any unreliable features of the
> > + * stack. Otherwise it guarantees that the stack trace is reliable.
> > + *
> > + * If the task is not 'current', the caller *must* ensure the task is inactive.
> > + */
> > +static __always_inline int
> > +__save_stack_trace_tsk_reliable(struct task_struct *tsk,
> > + struct stack_trace *trace)
> > +{
> > + struct unwind_state state;
> > +
> > + for (unwind_start_reliable(&state, tsk);
> > + !unwind_done(&state) && !unwind_error(&state);
> > + unwind_next_frame_reliable(&state)) {
> > +
> > + if (!__kernel_text_address(state.ip))
> > + return -EINVAL;
> > +
> > +#ifdef CONFIG_KPROBES
> > + /*
> > + * Mark stacktraces with kretprobed functions on them
> > + * as unreliable.
> > + */
> > + if (state.ip == (unsigned long)kretprobe_trampoline)
> > + return -EINVAL;
> > +#endif
> > +
> > + if (trace->nr_entries >= trace->max_entries)
> > + return -E2BIG;
> > +
> > + if (!trace->skip)
> > + trace->entries[trace->nr_entries++] = state.ip;
> > + else
> > + trace->skip--;
> > + }
> > +
> > + /* Check for stack corruption */
> > + if (unwind_error(&state))
> > + return -EINVAL;
> > +
> > + /* Store kernel_thread_starter, null for swapper/0 */
> > + if (tsk->flags & (PF_KTHREAD | PF_IDLE)) {
> > + if (trace->nr_entries >= trace->max_entries)
> > + return -E2BIG;
> > +
> > + if (!trace->skip)
> > + trace->entries[trace->nr_entries++] =
> > + state.regs->psw.addr;
> > + else
> > + trace->skip--;
>
> An idea for a follow up patch: stuff this into a function like
> int save_trace_entry(struct stack_trace *trace, unsigned long entry);
> which could one day make the trace->entries[] code generic across arches.
Yes. I was thinking about it and then decided to postpone it a bit. Thomas
introduced more generic infrastructure with ARCH_STACKWALK. See x86
implementation. There is consume_entry() which is exactly what you are
proposing. So I thought it should be a part of a bigger rework in the
future.
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +int save_stack_trace_tsk_reliable(struct task_struct *tsk,
> > + struct stack_trace *trace)
> > +{
> > + int ret;
> > +
> > + /*
> > + * If the task doesn't have a stack (e.g., a zombie), the stack is
> > + * "reliably" empty.
> > + */
> > + if (!try_get_task_stack(tsk))
> > + return 0;
> > +
> > + ret = __save_stack_trace_tsk_reliable(tsk, trace);
> > +
> > + put_task_stack(tsk);
> > +
> > + return ret;
> > +}
> > +#endif
> > diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c
> > index 3ce8a0808059..ada3a8538961 100644
> > --- a/arch/s390/kernel/unwind_bc.c
> > +++ b/arch/s390/kernel/unwind_bc.c
> > @@ -153,3 +153,96 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
> > state->reliable = reliable;
> > }
> > EXPORT_SYMBOL_GPL(__unwind_start);
> > +
> > +#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> > +void __unwind_start_reliable(struct unwind_state *state,
> > + struct task_struct *task, unsigned long sp)
> > +{
> > + struct stack_info *info = &state->stack_info;
> > + struct stack_frame *sf;
> > + unsigned long ip;
> > +
> > + memset(state, 0, sizeof(*state));
> > + state->task = task;
> > +
> > + /* Get current stack pointer and initialize stack info */
> > + if (get_stack_info_reliable(sp, task, info) ||
> > + !on_stack(info, sp, sizeof(struct stack_frame))) {
> > + /* Something is wrong with the stack pointer */
> > + info->type = STACK_TYPE_UNKNOWN;
> > + state->error = true;
> > + return;
> > + }
> > +
> > + /* Get the instruction pointer from the stack frame */
> > + sf = (struct stack_frame *) sp;
> > + ip = READ_ONCE_NOCHECK(sf->gprs[8]);
> > +
> > +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> > + /* Decode any ftrace redirection */
> > + if (ip == (unsigned long) return_to_handler)
> > + ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
> > + ip, NULL);
> ^^^^
> double checking: we ignore the retp here and not in the next-frame case?
Frankly, I copy-pasted this from non-reliable versions and checked that
powerpc ignored it as well. I'll double check.
It also calls for another cleanup. #ifdef seems to be superfluous and
checking ip for return_to_handler too (because it is done in
ftrace_graph_ret_addr() itself).
> > +#endif
> > +
> > + /* Update unwind state */
> > + state->sp = sp;
> > + state->ip = ip;
> > +}
> > +
> > +bool unwind_next_frame_reliable(struct unwind_state *state)
> > +{
> > + struct stack_info *info = &state->stack_info;
> > + struct stack_frame *sf;
> > + struct pt_regs *regs;
> > + unsigned long sp, ip;
> > +
> > + sf = (struct stack_frame *) state->sp;
> > + sp = READ_ONCE_NOCHECK(sf->back_chain);
> > + /*
> > + * Idle tasks are special. The final back-chain points to nodat_stack.
> > + * See CALL_ON_STACK() in smp_start_secondary() callback used in
> > + * __cpu_up(). We just accept it, go to else branch and look for
> > + * pt_regs.
> > + */
> > + if (likely(sp && !(is_idle_task(state->task) &&
> > + outside_of_stack(state, sp)))) {
> > + /* Non-zero back-chain points to the previous frame */
> > + if (unlikely(outside_of_stack(state, sp)))
> > + goto out_err;
> > +
> > + sf = (struct stack_frame *) sp;
> > + ip = READ_ONCE_NOCHECK(sf->gprs[8]);
> > + } else {
> > + /* No back-chain, look for a pt_regs structure */
> > + sp = state->sp + STACK_FRAME_OVERHEAD;
> > + regs = (struct pt_regs *) sp;
> > + if ((unsigned long)regs != info->end - sizeof(struct pt_regs))
> > + goto out_err;
> > + if (!(state->task->flags & (PF_KTHREAD | PF_IDLE)) &&
> > + !user_mode(regs))
> > + goto out_err;
> > +
> > + state->regs = regs;
> > + goto out_stop;
> > + }
> > +
> > +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> > + /* Decode any ftrace redirection */
> > + if (ip == (unsigned long) return_to_handler)
> > + ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
> > + ip, (void *) sp);
> > +#endif
> > +
> > + /* Update unwind state */
> > + state->sp = sp;
> > + state->ip = ip;
>
> minor nit: maybe the CONFIG_FUNCTION_GRAPH_TRACER and "Update unwind
> state" logic could be combined into a function? (Not a big deal either
> way.)
I think it is better to open code it here, but it is a matter of taste for
sure.
> > + return true;
> > +
> > +out_err:
> > + state->error = true;
> > +out_stop:
> > + state->stack_info.type = STACK_TYPE_UNKNOWN;
> > + return false;
> > +}
> > +#endif
> > --
> > 2.22.0
> >
>
> I've tested the patch with positive results, however I didn't stress it
> very hard (basically only selftests). The code logic seems
> straightforward and correct by inspection.
>
> On a related note, do you think it would be feasible to extend (in
> another patchset) the reliable stack unwinding code a bit so that we
> could feed it pre-baked stacks ... then we could verify that the code
> was finding interesting scenarios. That was a passing thought I had
> back when Nicolai and I were debugging the ppc64le exception frame
> marker bug, but didn't think it worth the time/effort at the time.
That is an interesting thought. It would help the testing a lot. I will
make a note in my todo list.
> One more note: Using READ_ONCE_NOCHECK is probably correct here, but
> s390 happens to define a READ_ONCE_TASK_STACK macro which calls
> READ_ONCE_NOCHECK when task != current. According to the code comments,
> this "disables KASAN checking when reading a value from another task's
> stack". Is there any scenario here where we would want to use the that
> wrapper macro?
s/READ_ONCE_TASK_STACK/READ_ONCE_NOCHECK/ was a last minute change. s390
does not define it anymore. See 20955746320e ("s390/kasan: avoid false
positives during stack unwind") and da1776733617 ("s390/unwind: cleanup
unused READ_ONCE_TASK_STACK").
Thanks for the review and testing!
Miroslav
Powered by blists - more mailing lists