[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160722233055.oyruuqfhwp4bjwdt@treble>
Date: Fri, 22 Jul 2016 18:30:55 -0500
From: Josh Poimboeuf <jpoimboe@...hat.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>,
"H . Peter Anvin" <hpa@...or.com>, X86 ML <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Steven Rostedt <rostedt@...dmis.org>,
Brian Gerst <brgerst@...il.com>,
Kees Cook <keescook@...omium.org>,
Peter Zijlstra <peterz@...radead.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Byungchul Park <byungchul.park@....com>
Subject: Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> >> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
> >> >> >> > an interrupt or exception, we can print them. Here's what it looks
> >> >> >> > like:
> >> >> >> >
> >> >> >> > ...
> >> >> >> > [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
> >> >> >> > [<ffffffff8189f558>] async_page_fault+0x28/0x30
> >> >> >> > RIP: 0010:[<ffffffff814529e2>] [<ffffffff814529e2>] __clear_user+0x42/0x70
> >> >> >> > RSP: 0018:ffff88007876fd38 EFLAGS: 00010202
> >> >> >> > RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
> >> >> >> > RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
> >> >> >> > RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
> >> >> >> > R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
> >> >> >> > R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
> >> >> >> > [<ffffffff814529e2>] ? __clear_user+0x42/0x70
> >> >> >> > [<ffffffff814529c3>] ? __clear_user+0x23/0x70
> >> >> >> > [<ffffffff81452a7b>] clear_user+0x2b/0x40
> >> >> >> > ...
> >> >> >>
> >> >> >> This looks wrong. Here are some theories:
> >> >> >>
> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
> >> >> >> Then it's found again as an unreliable address as "?
> >> >> >> __clear_user+0x42/0x70" by scanning the stack. "?
> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> >> >> suppressed because it matched a reliable address?
> >> >> >>
> >> >> >> (b) You actually intended for all the addresses to be printed, in
> >> >> >> which case "? __clear_user+0x42/0x70" should have been
> >> >> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's
> >> >> >> plausible that your state machine got a bit lost leading to "?
> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> >> >> it's a real frame and you didn't find it).
> >> >> >>
> >> >> >> (c) Something else and I'm confused.
> >> >> >
> >> >> > So there's a subtle difference between addresses reported by regs->ip
> >> >> > and normal return addresses. For example:
> >> >> >
> >> >> > ...
> >> >> > [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
> >> >> > [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
> >> >> > RIP: 0010:[<ffffffff8129b350>] [<ffffffff8129b350>] path_init+0x0/0x750
> >> >> > RSP: 0018:ffff880036a3fd80 EFLAGS: 00000296
> >> >> > RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
> >> >> > RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
> >> >> > RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
> >> >> > R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
> >> >> > R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
> >> >> > <EOI>
> >> >> > [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
> >> >> > [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
> >> >> > [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
> >> >> > ...
> >> >> >
> >> >> > In this case the irq hit right after path_lookupat() called into
> >> >> > path_init(). So the "path_init+0x0" printed by __show_regs() is right.
> >> >> >
> >> >> > Note the backtrace reports the same address, but it instead describes it
> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's
> >> >> > because normally, such an address after a call instruction at the end of
> >> >> > a function would indicate a tail call (e.g., to a noreturn function).
> >> >> > If that were the case, printing "path_init+0x0" would be completely
> >> >> > wrong, because path_init() just happens to be the function located
> >> >> > immediately after lookup_fast().
> >> >> >
> >> >> > Maybe I could add some special logic to say: "if this return address was
> >> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
> >> >> > use printk_address()." (The former prints the preceding function, the
> >> >> > latter prints the current function.) Then we could remove the question
> >> >> > mark.
> >> >> >
> >> >> > There's also the question of whether or not the address should be
> >> >> > printed again, after it's already been printed by __show_regs(). I
> >> >> > don't have a strong opinion either way.
> >> >> >
> >> >>
> >> >> IIRC we don't show the actual faulting function in the call trace, so
> >> >> we probably shouldn't duplicate the entry after the show_regs.
> >> >
> >> > Just to clarify, that's true today for cases where the stack dump starts
> >> > from a handler which has regs. It starts dumping based on regs->ip and
> >> > regs->bp, so the regs themselves aren't dumped.
> >> >
> >> > But for cases where regs are in the middle of the stack, they aren't
> >> > detected today, and you'll still see the value of regs->ip dumped with a
> >> > question mark.
> >> >
> >> > That said, with this patch, now that regs in the middle of the stack
> >> > *are* being printed, I can't think of a good reason to print the return
> >> > address twice: both in regs and the stack trace. So removing it from
> >> > the stack trace is fine with me.
> >> >
> >> >> That being said, I'm still confused by the question marks. What
> >> >> exactly is going on? Is the code really doing the right thing wrt
> >> >> resuming the unwind? Is there a git tree with these patches applied
> >> >> somewhere so I can look at it easily in context?
> >> >
> >> > show_trace_log_lvl() is doing two things in parallel: scanning all
> >> > kernel text addresses on the stack while simultaneously using the
> >> > unwinder to walk the frame pointers. Only those scanned addresses which
> >> > are also found by the unwinder are printed without question marks.
> >> >
> >> > The pt_regs aren't in a frame of their own; they're just data inside of
> >> > a bigger frame. (You may recall that you objected to my proposal to put
> >> > them in their own frame :-)) So that's why the address stored in
> >> > regs->ip was printed with a question mark: it's not in the header of a
> >> > real frame; it's just data.
> >>
> >> It wasn't the separate frame part I was objecting to -- it was their
> >> encoding on the stack. Maybe they should unwind as though they're a
> >> separate frame. For example, the unwind API could give the frame that
> >> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
> >> could literally list regs->ip as its retaddr (and maybe that frame or
> >> even the following one should be the one with non-NULL
> >> unwind_get_entry_regs).
> >
> > Having the unwinder treat the pt_regs as a "fake" frame is problematic:
> >
> > - As I described above, you can't treat regs->ip as a normal return
> > value anyway.
> >
> > - Also, for exceptions and nested interrupts, the regs are stored on the
> > interrupting stack. But for non-nested interrupts, they're stored on
> > the thread stack. So the regs aren't always on the same stack as the
> > corresponding encoded pt_regs pointer. Another issue is that there's
> > not always a frame after the regs. For those reasons, creating a
> > "fake" frame abstraction in the state machine is quite a bit trickier
> > than just dealing with those details in the only place that cares
> > about them: show_trace_log_lvl().
> >
> >> In some sense, the regs belong to the frame that got interrupted, not
> >> the frame that did the interrupting. But maybe that's backwards -- if
> >> we have DWARF, then the regs correspond to the regs at the time of a
> >> call, and those regs are reasonably likely to contain the arguments to
> >> the called function.
> >>
> >> But regardless of which way this goes, it seems quite awkward to me
> >> that regs->ip never shows up as the return addr of any frame as
> >> exposed by the unwind API.
> >
> > Again, regs->ip is special. It's not a call return address and we
> > shouldn't force it to be.
>
> This is only mostly true. If the exception was a trap, then it is
> (e.g. if a function ends in int3, then regs->ip will be off the end).
> But that's just me being pedantic.
>
> More relevantly, regs->ip is a reliable address indicating a function
> that will be returned to if we ever return, and both
> show_trace_log_lvl() and the livepatch stuff should interpret it as
> such.
Actually livepatch doesn't care; once it sees that there are regs, it
will bail because the stack is unreliable.
> Whether this means the unwinder should change or
> show_trace_log_lvl() should change isn't a big deal, but I think one
> of them should change so we get this right.
I have no problem doing so, but can you clarify what you mean? Earlier
you said:
"IIRC we don't show the actual faulting function in the call trace, so
we probably shouldn't duplicate the entry after the show_regs."
Maybe I'm misunderstanding, but that seems to contradict what you're
saying now. So which is it? Do you want the RIP address printed twice
(both in the regs printout and in the stack trace)? Or not?
--
Josh
Powered by blists - more mailing lists