linux-kernel - Re: [PATCH 10/19] x86/dumpstack: add get_stack

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160723140439.36bhw5aslcxlkf3f@treble>
Date:	Sat, 23 Jul 2016 09:04:39 -0500
From:	Josh Poimboeuf <jpoimboe@...hat.com>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	"H . Peter Anvin" <hpa@...or.com>, X86 ML <x86@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Brian Gerst <brgerst@...il.com>,
	Kees Cook <keescook@...omium.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Byungchul Park <byungchul.park@....com>
Subject: Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

On Fri, Jul 22, 2016 at 05:15:03PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> >> > +                            unsigned long *visit_mask)
> >> > +{
> >> > +       unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
> >> > +       unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> >> > +
> >> > +       if (stack < begin || stack >= end)
> >> > +               return false;
> >> > +
> >> > +       if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> >> > +               return false;
> >> > +
> >> > +       info->type      = STACK_TYPE_IRQ;
> >> > +       info->begin     = begin;
> >> > +       info->end       = end;
> >> > +       info->next      = (unsigned long *)*begin;
> >>
> >> This works, but it's a bit magic.  I don't suppose we could get rid of
> >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> >> move from stack to stack by querying the stack type of whatever the
> >> frame base address is if the frame base address ends up being out of
> >> bounds for the current stack?  Or maybe the unwinder could even do
> >> this by itself.
> >
> > I'm not quite sure what you mean here.  The ->next stack pointer is
> > quite useful and it abstracts that ugliness away from the callers of
> > get_stack_info().  I'm open to any specific suggestions.
> 
> So far I've found two users of this thing.  One is
> show_stack_log_lvl(), and it makes sense there, but maybe
> info->heuristic_next_stack would be a better name.  The other is the
> unwinder itself, and I think that walking from stack to stack using
> this heuristic is the wrong approach there, at least in the long term.
> I'd rather we just followed the bp chain wherever it leads us, as long
> as it leads us to a valid stack that we haven't visited before.
>
> As a concrete example of what I think is wrong with the current
> approach, ISTM it would be totally valid to implement stack switching
> like this:
> 
> some_func:
>  push %rbp
>  mov %rsp, %rbp
>  ...
>  mov [next stack], %rsp
>  call some_other_func
>  mov %rbp, %rsp
>  pop %rbp
>  ret
> 
> With the current approach, you can't unwind out of that function,
> because there's no way to populate info->next.  I'm not actually
> suggesting that the kernel should ever do such a thing on x86, and my
> proposed rewrite of the IRQ stack code [1] will be fully compatible
> with your approach, but it seems odd to me that the unwinder should
> depend on idea that the stacks in use are chained together in a way
> that can be decoded without .  (But maybe some of the Go compilers do
> work this way -- I've never looked at their precise stack layout.)

I don't think relying on frame pointers to switch between stacks is
necessarily a good idea:

- It requires CONFIG_FRAME_POINTER, which makes it unwinder-specific.
  The current approach is unwinder-agnostic.

- Instead of relying on a single correct "next stack" pointer, it
  requires relying on potentially dozens of correct frame pointers,
  across multiple stacks.  So a lot of things have to go right, instead
  of just one.  And then show_trace_log_lvl() becomes more dependent on
  the unwinder not screwing things up.

> Also, if you ever intend to port this thing to other architectures, I
> think there are architectures that have separate exception stacks and
> that track the next available slot on those stacks dynamically.  I
> think that x86_32 is an example of this if task gates are used in a
> back-and-forth manner, although Linux doesn't do that.  (x86_64 should
> have done this for IST, but it didn't.)  On those architectures, you
> can have two separate switches onto the same stack live at the same
> time, and your current approach won't work.  (Even if you make the
> change I'm suggesting, visit_mask will break, too, but fixing that
> would be a much less invasive change.)
>
> Am I making any sense?  This is a suggestion for making it better, not
> something I see as a requirement for getting the x86 code upstream.

I think porting these interfaces to other architectures could eventually
be a good idea, and you're right that the current approach might need to
be tweaked in order to work everywhere.  (But I agree this needs more
thought and this discussion can wait until later.)

> >> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> >> > +                              unsigned long *visit_mask)
> >> >  {
> >> >         unsigned long stack = (unsigned long)s;
> >> >         unsigned long begin, end;
> >> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long *s, char **name,
> >> >                 if (stack < begin || stack >= end)
> >> >                         continue;
> >> >
> >> > -               if (test_and_set_bit(k, visit_mask))
> >> > +               if (visit_mask &&
> >> > +                   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> >> >                         return false;
> >> >
> >> > -               *name = exception_stack_names[k];
> >> > -               return (unsigned long *)end;
> >> > +               info->type      = STACK_TYPE_EXCEPTION + k;
> >> > +               info->begin     = (unsigned long *)begin;
> >> > +               info->end       = (unsigned long *)end;
> >> > +               info->next      = (unsigned long *)info->end[-2];
> >>
> >> This is so magical that I don't immediately see why it's correct.
> >> Presumably it's because the thing two slots down from the top of the
> >> stack is regs->sp?  If so, that needs a comment.
> >
> > Heck if I know, I just stole it from dump_trace() ;-)
> >
> > I'll figure it out and add a comment.
> 
> If you can write it as:
> 
> struct pt_regs *regs = (struct pt_regs *)end - 1;
> info->next = regs->sp;
> 
> and it still works, then no comment required :)

Yeah.  in_irq_stack() does something similar, though it uses end[-1].
And its regs are actually stored on the thread stack.  So something
doesn't quite add up for irqs.  I still need to do some homework there.

> >> But again, couldn't we use the fact that we now know how to decode
> >> pt_regs to avoid needing this?  I can imagine it being useful as a
> >> fallback in the event that the unwinder fails, but this is just a
> >> fallback.
> >
> > Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
> > fallback: the stack dump code *needs* to be able to still traverse the
> > stacks if frame pointers fail.
> >
> >> Also, NMI is weird and I'm wondering whether this works at
> >> all when trying to unwind from a looped NMI.
> >
> > Unless I'm missing something, I think it should be fine for nested NMIs,
> > since they're all on the same stack.  I can try to test it.  What in
> > particular are you worried about?
> >
> 
> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS,
> CS, IP) records.  Off the top of my head, the record that matters is
> the third one, so it should be regs[-15].  If an MCE hits while this
> mess is being set up, good luck unwinding *that*.  If you really want
> to know, take a deep breath, read the long comment in entry_64.S after
> .Lnmi_from_kernel, then give up on x86 and start hacking on some other
> architecture.

I did read that comment.  Fortunately there's a big difference between
reading and understanding so I can go on being an ignorant x86 hacker!

For nested NMIs, it does look like the stack of the exception which
interrupted the first NMI would get skipped by the stack dump.  (But
that's a general problem, not specific to my patch set.)

Am I correct in understanding that there can only be one level of NMI
nesting at any given time?  If so, could we make it easier on the
unwinder by putting the nested NMI on a separate software stack, so the
"next stack" pointers are always in the same place?  Or am I just being
naive?

-- 
Josh