linux-kernel - Re: [RFC PATCH] x86/dumpstack: Fix unwind failure due to off-by-one-frame

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAG48ez19qy0s2gnSWEtoF-C57uumdBb8gfoe41-JjqBcJ-5KvQ@mail.gmail.com>
Date:   Tue, 1 Feb 2022 18:38:42 +0100
From:   Jann Horn <jannh@...gle.com>
To:     Josh Poimboeuf <jpoimboe@...hat.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
        Miroslav Benes <mbenes@...e.cz>
Subject: Re: [RFC PATCH] x86/dumpstack: Fix unwind failure due to off-by-one-frame

On Tue, Feb 1, 2022 at 1:30 AM Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> On Thu, Jan 27, 2022 at 01:55:55AM +0100, Jann Horn wrote:
> > (emphasis on the "RFC", not the "PATCH"...)
> >
> > I've hit a bug where __dump_stack() ends up printing a stack trace that
> > consists purely of guesses (all printed frames start with "? ").
> >
> > Debugging the issue, I found that show_trace_log_lvl() is looking at a
> > stack that looks like this:
> >
> >     function             stored value    pointer in show_trace_log_lvl()
> >     ====================================================================
> >             show_stack   saved RIP
> >             show_stack   saved RBP       <-- stack
> >     show_trace_log_lvl   saved RIP       <-- unwind_get_return_address_ptr(...)
> >     show_trace_log_lvl   ...
> >     show_trace_log_lvl   ...
> >
> > show_trace_log_lvl() then iterates up the stack with its `stack`
> > variable; but because `unwind_get_return_address_ptr(&state)` is below the
> > starting point, the two never compile equal, and so `reliable` is never
> > set to 1.
>
> Thanks for reporting!  If I understand correctly, this only happens
> when show_stack() has an 8-byte stack size.

Yes, I think so. (Well, 16 bytes if you count the saved RIP at the top
as part of the frame.)

I just realized that this probably happened to me because I was
compiling with -fno-optimize-sibling-calls (to make stack traces more
readable, because sibling call optimization effectively makes stack
frames randomly disappear, which I find very frustrating).

> > Poking around a bit, I see two issues.
> >
> > The first issue is that __unwind_start() tries to figure out whether
> > `first_frame` is inside the current frame before even having looked up
> > the ORC entry that determines where the current frame ends.
> > That can't work and results in being off-by-one-frame in some cases no
> > matter how we twist the comparison between `state->sp` and `first_frame`.
>
> > The second issue is that show_trace_log_lvl() asks __unwind_start() to
> > stop when it finds the frame containing `stack`, but then tries
> > comparing `unwind_get_return_address_ptr(&state)` (which has to be below
> > `stack`, since it is part of the lower frame) with `stack`.
> > That can't work if __unwind_start() is working properly - we'll have to
> > unwind up another frame.
> >
> > This patch is an attempt to fix that, but I guess there might still be
> > issues with it in the interaction with show_regs_if_on_stack() in
> > show_trace_log_lvl(), or something like that?
> >
> > Another option might be to rework even more how ORC stack walking works,
> > and always compute the location of the next frame in __unwind_start()
> > and unwind_next(), such that it becomes possible to query for the top
> > of the current frame?
> >
> > Or a completely different approach, do more special-casing of different
> > unwinding scenarios in __unwind_start(), such that unwinding a remote
> > task doesn't go through the skip-ahead loop, and unwinding the current
> > task from a starting point is always guaranteed to skip the given frame
> > and stop at the following one? Or something along those lines?
> >
> > That would also make it more obviously correct what happens if a
> > function specifies its own frame as the starting point wrt to changes to
> > that frame's contents before the call to unwind_next()... now that I'm
> > typing this out, I think that might be the best option?
>
> If I understand correctly, this last proposal is what the current
> __unwind_start() code already attempts to do (but obviously fails in the
> above off-by-one case).  It tries to start at the first frame it finds
> *beyond* the given 'first_frame' pointer, rather than the frame
> including it.  That makes the logic simpler, since you don't have to
> find the size of the frame.

Ahh, okay. I missed that those were the intended semantics...


> So I think this bug could be fixed by reverting commit f1d9a2abff66
> ("x86/unwind/orc: Don't skip the first frame for inactive tasks").
>
> Can you confirm?

Yes.

When compiling with "gcc (Debian 11.2.0-13) 11.2.0" and
"-fno-optimize-sibling-calls", "echo l > /proc/sysrq-trigger" prints
an all-guesses trace for the current CPU:

[   99.465299][  T598] sysrq: Show backtrace of all active CPUs
[   99.466130][  T598] NMI backtrace for cpu 0
[   99.466533][  T598] CPU: 0 PID: 598 Comm: bash Not tainted
5.17.0-rc1-00082-g81c3649a14a2-dirty #944
[   99.467602][  T598] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.15.0-1 04/01/2014
[   99.468511][  T598] Call Trace:
[   99.468927][  T598]  <TASK>
[   99.469268][  T598]  ? dump_stack_lvl+0x45/0x59
[   99.469807][  T598]  ? dump_stack+0xc/0xd
[   99.470284][  T598]  ? nmi_cpu_backtrace.cold+0xa4/0xa9
[   99.470760][  T598]  ? lapic_can_unplug_cpu+0x80/0x80
[   99.471330][  T598]  ? nmi_trigger_cpumask_backtrace+0x10e/0x140
[   99.471889][  T598]  ? arch_trigger_cpumask_backtrace+0x15/0x20
[   99.472427][  T598]  ? sysrq_handle_showallcpus+0x13/0x20
[   99.472970][  T598]  ? __handle_sysrq.cold+0x11c/0x37b
[   99.473450][  T598]  ? write_sysrq_trigger+0x3f/0x50
[   99.473963][  T598]  ? proc_reg_write+0x1b3/0x270
[   99.474408][  T598]  ? vfs_write+0x1c7/0x920
[   99.474920][  T598]  ? ksys_write+0xf9/0x1d0
[   99.475507][  T598]  ? __ia32_sys_read+0xb0/0xb0
[   99.476467][  T598]  ? lock_is_held_type+0xd7/0x130
[   99.477071][  T598]  ? syscall_enter_from_user_mode+0x1d/0x50
[   99.477708][  T598]  ? __x64_sys_write+0x6e/0xb0
[   99.478216][  T598]  ? syscall_enter_from_user_mode+0x1d/0x50
[   99.479137][  T598]  ? do_syscall_64+0x43/0x90
[   99.479576][  T598]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   99.480172][  T598]  </TASK>
[   99.480649][  T598] Sending NMI from CPU 0 to CPUs 1-3:

With f1d9a2abff66 reverted:

[   92.114861][  T598] sysrq: Show backtrace of all active CPUs
[   92.115989][  T598] NMI backtrace for cpu 2
[   92.116448][  T598] CPU: 2 PID: 598 Comm: bash Not tainted
5.17.0-rc1-00083-gc0df0cbee2c5-dirty #945
[   92.117493][  T598] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.15.0-1 04/01/2014
[   92.118290][  T598] Call Trace:
[   92.118598][  T598]  <TASK>
[   92.118860][  T598]  dump_stack_lvl+0x45/0x59
[   92.119363][  T598]  dump_stack+0xc/0xd
[   92.119727][  T598]  nmi_cpu_backtrace.cold+0xa4/0xa9
[   92.120183][  T598]  ? lapic_can_unplug_cpu+0x80/0x80
[   92.120655][  T598]  nmi_trigger_cpumask_backtrace+0x10e/0x140
[   92.121328][  T598]  arch_trigger_cpumask_backtrace+0x15/0x20
[   92.121856][  T598]  sysrq_handle_showallcpus+0x13/0x20
[   92.122467][  T598]  __handle_sysrq.cold+0x11c/0x37b
[   92.122924][  T598]  write_sysrq_trigger+0x3f/0x50
[   92.123357][  T598]  proc_reg_write+0x1b3/0x270
[   92.123773][  T598]  vfs_write+0x1c7/0x920
[   92.124151][  T598]  ksys_write+0xf9/0x1d0
[   92.124523][  T598]  ? __ia32_sys_read+0xb0/0xb0
[   92.124998][  T598]  ? lock_is_held_type+0xd7/0x130
[   92.125512][  T598]  ? syscall_enter_from_user_mode+0x1d/0x50
[   92.127120][  T598]  __x64_sys_write+0x6e/0xb0
[   92.127537][  T598]  ? syscall_enter_from_user_mode+0x1d/0x50
[   92.128051][  T598]  do_syscall_64+0x43/0x90
[   92.128444][  T598]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   92.129046][  T598] RIP: 0033:0x7d8d29a7b504
[   92.129442][  T598] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb
b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55
48 89 f5 53
[   92.131109][  T598] RSP: 002b:00007ffd74467c58 EFLAGS: 00000246
ORIG_RAX: 0000000000000001
[   92.131913][  T598] RAX: ffffffffffffffda RBX: 0000000000000002
RCX: 00007d8d29a7b504
[   92.132682][  T598] RDX: 0000000000000002 RSI: 00007d8d2a6dcfb0
RDI: 0000000000000001
[   92.133460][  T598] RBP: 00007d8d2a6dcfb0 R08: 000000000000000a
R09: 00007d8d29b4cca0
[   92.134219][  T598] R10: 000000000000000a R11: 0000000000000246
R12: 00007d8d29b4d760
[   92.134904][  T598] R13: 0000000000000002 R14: 00007d8d29b48760
R15: 0000000000000002
[   92.135591][  T598]  </TASK>
[   92.135961][  T598] Sending NMI from CPU 2 to CPUs 0-1,3: