lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200730205112.2099429-1-ndesaulniers@google.com>
Date:   Thu, 30 Jul 2020 13:51:08 -0700
From:   Nick Desaulniers <ndesaulniers@...gle.com>
To:     Nathan Huckleberry <nhuck15@...il.com>,
        Russell King <linux@...linux.org.uk>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Chunyan Zhang <zhang.lyra@...il.com>,
        clang-built-linux@...glegroups.com,
        Dmitry Safonov <0x7f454c46@...il.com>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-mediatek@...ts.infradead.org,
        Lvqiang Huang <lvqiang.huang@...soc.com>,
        Matthias Brugger <matthias.bgg@...il.com>,
        Nick Desaulniers <ndesaulniers@...gle.com>,
        Miles Chen <miles.chen@...iatek.com>
Subject: [PATCH 0/4] CONFIG_UNWINDER_FRAME_POINTER fixes+cleanups

We received a report of boot failure on stable/4.14.y using Clang with
CONFIG_UNWINDER_FRAME_POINTER. Turns out, this was a cascaded failure
with at least 4 different things going wrong. Working backwards from the
failure:

4) There was no fixup handler in backtrace-clang.S for a specific
address calculation. If an indirect access to that address triggers a
page fault for which no corresponding fixup exists, then a panic is
triggered. Panicking triggers another unwind, and all this repeats in an
infinite loop.  If the unwind was started from within start_kernel(),
this results in a kernel that does not boot.  We can install a fixup
handler to fix the infinite loop, but why was the unwinder accessing an
address that would trigger a fault?

3) The unwinder has multiple conditions to know when to stop unwinding,
but checking for a valid address in the link register was not one of
them. If there was a value for lr that we could check for before using
it, then we could add that as another stopping condition to terminate
unwinding. But the garbage value in lr in the case of save_stack()
wasn't particularly noteworthy from any other address; it was ambiguous
whether we had more frames to continue unwinding through or not, but
what value would we check for?

2) When following a frame chain, we can generally follow the addresses
pushed onto the stack from the link register, lr. The callee generally
pushes this value.  For our particular failure, the value in the link
register upon entry to save_stack() was garbage. The caller,
__mmap_switched, does a tail call into save_stack() since we don't plan
to return control flow back to __mmap_switched. It uses a `b` (branch)
instruction rather than a `bl` (branch+link) which is correct, since
there are no instructions after the `b save_stack` in __mmap_switched.
If we interpret the value of lr that was pushed on the stack in
save_stack(), then it appears that we have further frames to unwind.
When observing an unwind on mainline though, lr upon entry to
save_stack() was 0x00!

It turns out that this exact ambiguity was found+fixed already by
upstream
commit 59b6359dd92d ("ARM: 8702/1: head-common.S: Clear lr before jumping to start_kernel()")
which landed in 4.15-rc1 but was not yet backported to stable/4.14.y.
Sent to stable in:
https://lore.kernel.org/stable/20200730180340.1724137-1-ndesaulniers@google.com/T/#u
That gives us a value in lr upon entry to save_stack() that's noteworthy
as a terminal condition during unwinding.  But why did we start
unwinding in start_kernel() in the first place?

1) A simple WARN_ON_ONCE was being triggered during start_kernel() due
to another patch that landed in v4.15-rc9 but wasn't backported to
stable/4.14.y. Sent to stable in:
https://lore.kernel.org/stable/20200727191746.3644844-1-ndesaulniers@google.com/T/#u

Read (or unwound; pun intended) in the order 1), 2), 3), 4) explains the
cascading chain of failures that lead to a kernel not booting.

Patch 0001 fixes 3) by adding a test for NULL to the conditions to stop
unwinding. This prevents the cascade from going further.
Patch 0002 fixes 4) by adding a fixup handler. It's not strictly
necessary right now, but I get the feeling that we might not be able to
trust the value of the link register pushed on the stack.  I'm guessing
a stack buffer overflow could overwrite this value. Either way,
triggering an exception during unwind should be prevented at all costs
to avoid infinite loops.
Patches 0003/0004 are cleanup/bikeshed, feel free to NACK them and I
don't mind dropping them.  They're just minor touchups I felt helped
readability from when I was debugging these. 0001 (and slightly so 0002)
are the only patches I really care about.

Nick Desaulniers (4):
  ARM: backtrace-clang: check for NULL lr
  ARM: backtrace-clang: add fixup for lr dereference
  ARM: backtrace-clang: give labels more descriptive names
  ARM: backtrace: use more descriptive labels

 arch/arm/lib/backtrace-clang.S | 34 +++++++++++++++++++++-------------
 arch/arm/lib/backtrace.S       | 30 +++++++++++++++---------------
 2 files changed, 36 insertions(+), 28 deletions(-)

-- 
2.28.0.163.g6104cc2f0b6-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ