linux-kernel - Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjoeZ9_aiu+642ur=iGhGjfBQhRPURxX9Py+-B6coctXw@mail.gmail.com>
Date:   Wed, 19 Jun 2019 13:42:53 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Chris Wilson <chris@...is-wilson.co.uk>
Cc:     Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]

On Wed, Jun 19, 2019 at 12:19 PM Chris Wilson <chris@...is-wilson.co.uk> wrote:
>
> > Do you have the oops itself at all?
>
> An example at
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/dmesg0.log
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/boot0.log
>
> The bug causing the oops is clearly a driver problem. The rc5 fallout
> just seems to be because of some shrinker changes affecting some object
> reaping that were unfortunately still active. What perturbed the CI
> team was the machine failed to panic & reboot.

Hmm. It's hard to guess at the cause of that. The oopses themselves
don't look like they are happening in any particularly bad context, so
all the normal reboot-on-oops etc stuff _should_ work.

So it would help a lot if you could bisect the bad problem at least a
bit, if it is at all reproducible. Because with no other clues, it's
hard to even guess at what might be up.

The fact that you say "NMI watchdog firing as we dumped the ftrace"
means that maybe it might be some ftrace / stacktrace issue where the
dumping itself leads to some endless loop, but who knows.

For example, one thing that has happened during this development cycle
is the stacktrace common infrastructure changes (arch_stack_walk() and
friends). I'm, not seeing why that would cause your issues, but I'm
adding a few random people for ftrace / stacktrace changes.

                     Linus