linux-kernel - Re: [PATCH 19/19] x86/dumpstack: print any pt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160723000053.2wufllifxj62rqw6@treble>
Date:	Fri, 22 Jul 2016 19:00:53 -0500
From:	Josh Poimboeuf <jpoimboe@...hat.com>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	"H . Peter Anvin" <hpa@...or.com>, X86 ML <x86@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Brian Gerst <brgerst@...il.com>,
	Kees Cook <keescook@...omium.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Byungchul Park <byungchul.park@....com>
Subject: Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack

On Fri, Jul 22, 2016 at 04:39:00PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> >> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
> >> >> >> >> > an interrupt or exception, we can print them.  Here's what it looks
> >> >> >> >> > like:
> >> >> >> >> >
> >> >> >> >> >    ...
> >> >> >> >> >    [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >> >> >> >> >    [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >> >> >> >> >   RIP: 0010:[<ffffffff814529e2>]  [<ffffffff814529e2>] __clear_user+0x42/0x70
> >> >> >> >> >   RSP: 0018:ffff88007876fd38  EFLAGS: 00010202
> >> >> >> >> >   RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >> >> >> >> >   RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >> >> >> >> >   RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >> >> >> >> >   R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >> >> >> >> >   R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >> >> >> >> >    [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >> >> >> >> >    [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >> >> >> >> >    [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >> >> >> >> >    ...
> >> >> >> >>
> >> >> >> >> This looks wrong.  Here are some theories:
> >> >> >> >>
> >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
> >> >> >> >> Then it's found again as an unreliable address as "?
> >> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
> >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> >> >> >> suppressed because it matched a reliable address?
> >> >> >> >>
> >> >> >> >> (b) You actually intended for all the addresses to be printed, in
> >> >> >> >> which case "? __clear_user+0x42/0x70" should have been
> >> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> >> >> >> >> plausible that your state machine got a bit lost leading to "?
> >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> >> >> >> it's a real frame and you didn't find it).
> >> >> >> >>
> >> >> >> >> (c) Something else and I'm confused.
> >> >> >> >
> >> >> >> > So there's a subtle difference between addresses reported by regs->ip
> >> >> >> > and normal return addresses.  For example:
> >> >> >> >
> >> >> >> >    ...
> >> >> >> >    [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
> >> >> >> >    [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
> >> >> >> >   RIP: 0010:[<ffffffff8129b350>]  [<ffffffff8129b350>] path_init+0x0/0x750
> >> >> >> >   RSP: 0018:ffff880036a3fd80  EFLAGS: 00000296
> >> >> >> >   RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
> >> >> >> >   RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
> >> >> >> >   RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
> >> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
> >> >> >> >   R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
> >> >> >> >    <EOI>
> >> >> >> >    [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
> >> >> >> >    [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
> >> >> >> >    [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
> >> >> >> >    ...
> >> >> >> >
> >> >> >> > In this case the irq hit right after path_lookupat() called into
> >> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is right.
> >> >> >> >
> >> >> >> > Note the backtrace reports the same address, but it instead describes it
> >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> >> >> >> > because normally, such an address after a call instruction at the end of
> >> >> >> > a function would indicate a tail call (e.g., to a noreturn function).
> >> >> >> > If that were the case, printing "path_init+0x0" would be completely
> >> >> >> > wrong, because path_init() just happens to be the function located
> >> >> >> > immediately after lookup_fast().
> >> >> >> >
> >> >> >> > Maybe I could add some special logic to say: "if this return address was
> >> >> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> >> >> >> > use printk_address()."  (The former prints the preceding function, the
> >> >> >> > latter prints the current function.)  Then we could remove the question
> >> >> >> > mark.
> >> >> >> >
> >> >> >> > There's also the question of whether or not the address should be
> >> >> >> > printed again, after it's already been printed by __show_regs().  I
> >> >> >> > don't have a strong opinion either way.
> >> >> >> >
> >> >> >>
> >> >> >> IIRC we don't show the actual faulting function in the call trace, so
> >> >> >> we probably shouldn't duplicate the entry after the show_regs.
> >> >> >
> >> >> > Just to clarify, that's true today for cases where the stack dump starts
> >> >> > from a handler which has regs.  It starts dumping based on regs->ip and
> >> >> > regs->bp, so the regs themselves aren't dumped.
> >> >> >
> >> >> > But for cases where regs are in the middle of the stack, they aren't
> >> >> > detected today, and you'll still see the value of regs->ip dumped with a
> >> >> > question mark.
> >> >> >
> >> >> > That said, with this patch, now that regs in the middle of the stack
> >> >> > *are* being printed, I can't think of a good reason to print the return
> >> >> > address twice: both in regs and the stack trace.  So removing it from
> >> >> > the stack trace is fine with me.
> >> >> >
> >> >> >> That being said, I'm still confused by the question marks.  What
> >> >> >> exactly is going on?  Is the code really doing the right thing wrt
> >> >> >> resuming the unwind?  Is there a git tree with these patches applied
> >> >> >> somewhere so I can look at it easily in context?
> >> >> >
> >> >> > show_trace_log_lvl() is doing two things in parallel: scanning all
> >> >> > kernel text addresses on the stack while simultaneously using the
> >> >> > unwinder to walk the frame pointers.  Only those scanned addresses which
> >> >> > are also found by the unwinder are printed without question marks.
> >> >> >
> >> >> > The pt_regs aren't in a frame of their own; they're just data inside of
> >> >> > a bigger frame.  (You may recall that you objected to my proposal to put
> >> >> > them in their own frame :-))  So that's why the address stored in
> >> >> > regs->ip was printed with a question mark: it's not in the header of a
> >> >> > real frame; it's just data.
> >> >>
> >> >> It wasn't the separate frame part I was objecting to -- it was their
> >> >> encoding on the stack.  Maybe they should unwind as though they're a
> >> >> separate frame.  For example, the unwind API could give the frame that
> >> >> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
> >> >> could literally list regs->ip as its retaddr (and maybe that frame or
> >> >> even the following one should be the one with non-NULL
> >> >> unwind_get_entry_regs).
> >> >
> >> > Having the unwinder treat the pt_regs as a "fake" frame is problematic:
> >> >
> >> > - As I described above, you can't treat regs->ip as a normal return
> >> >   value anyway.
> >> >
> >> > - Also, for exceptions and nested interrupts, the regs are stored on the
> >> >   interrupting stack.  But for non-nested interrupts, they're stored on
> >> >   the thread stack.  So the regs aren't always on the same stack as the
> >> >   corresponding encoded pt_regs pointer.  Another issue is that there's
> >> >   not always a frame after the regs.  For those reasons, creating a
> >> >   "fake" frame abstraction in the state machine is quite a bit trickier
> >> >   than just dealing with those details in the only place that cares
> >> >   about them: show_trace_log_lvl().
> >> >
> >> >> In some sense, the regs belong to the frame that got interrupted, not
> >> >> the frame that did the interrupting.  But maybe that's backwards -- if
> >> >> we have DWARF, then the regs correspond to the regs at the time of a
> >> >> call, and those regs are reasonably likely to contain the arguments to
> >> >> the called function.
> >> >>
> >> >> But regardless of which way this goes, it seems quite awkward to me
> >> >> that regs->ip never shows up as the return addr of any frame as
> >> >> exposed by the unwind API.
> >> >
> >> > Again, regs->ip is special.  It's not a call return address and we
> >> > shouldn't force it to be.
> >>
> >> This is only mostly true.  If the exception was a trap, then it is
> >> (e.g. if a function ends in int3, then regs->ip will be off the end).
> >> But that's just me being pedantic.
> >>
> >> More relevantly, regs->ip is a reliable address indicating a function
> >> that will be returned to if we ever return, and both
> >> show_trace_log_lvl() and the livepatch stuff should interpret it as
> >> such.
> >
> > Actually livepatch doesn't care; once it sees that there are regs, it
> > will bail because the stack is unreliable.
> 
> Would it be better for livepatch not to bail some day?

Not until we have a DWARF unwinder.

> >> Whether this means the unwinder should change or
> >> show_trace_log_lvl() should change isn't a big deal, but I think one
> >> of them should change so we get this right.
> >
> > I have no problem doing so, but can you clarify what you mean?  Earlier
> > you said:
> >
> >   "IIRC we don't show the actual faulting function in the call trace, so
> >   we probably shouldn't duplicate the entry after the show_regs."
> >
> > Maybe I'm misunderstanding, but that seems to contradict what you're
> > saying now.  So which is it?  Do you want the RIP address printed twice
> > (both in the regs printout and in the stack trace)?  Or not?
> 
> I don't have a stong preference as to how many times it's printed.
> But I think we need to get rid of the question mark.  I think that
> means there are two options:
> 
> a) Teach show_stack_log_lvl() that regs->ip is a "reliable" entry and
> print it again.  That will get confused if it's the first instruction
> in a function, so maybe it's not so great.

I proposed a fix for this above, so that it would print regs->ip one
way and a normal return address another way, to avoid the confusion:

> >> >> >> > Maybe I could add some special logic to say: "if this return address was
> >> >> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> >> >> >> > use printk_address()."  (The former prints the preceding function, the
> >> >> >> > latter prints the current function.)  Then we could remove the question
> >> >> >> > mark.

> b) Teach show_stack_log_lvl() that regs->ip is a thing that we just
> printed (via show_regs) and skip the ? entry.
> 
> Option b probably makes more sense.  I think I'm starting to
> understand all this, but maybe I'm still missing something.

I also think b is a good option.  I'll do it for v2 unless others
disagree.

-- 
Josh