lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjoeZ9_aiu+642ur=iGhGjfBQhRPURxX9Py+-B6coctXw@mail.gmail.com>
Date:   Wed, 19 Jun 2019 13:42:53 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Chris Wilson <chris@...is-wilson.co.uk>
Cc:     Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]

On Wed, Jun 19, 2019 at 12:19 PM Chris Wilson <chris@...is-wilson.co.uk> wrote:
>
> > Do you have the oops itself at all?
>
> An example at
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/dmesg0.log
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/boot0.log
>
> The bug causing the oops is clearly a driver problem. The rc5 fallout
> just seems to be because of some shrinker changes affecting some object
> reaping that were unfortunately still active. What perturbed the CI
> team was the machine failed to panic & reboot.

Hmm. It's hard to guess at the cause of that. The oopses themselves
don't look like they are happening in any particularly bad context, so
all the normal reboot-on-oops etc stuff _should_ work.

So it would help a lot if you could bisect the bad problem at least a
bit, if it is at all reproducible. Because with no other clues, it's
hard to even guess at what might be up.

The fact that you say "NMI watchdog firing as we dumped the ftrace"
means that maybe it might be some ftrace / stacktrace issue where the
dumping itself leads to some endless loop, but who knows.

For example, one thing that has happened during this development cycle
is the stacktrace common infrastructure changes (arch_stack_walk() and
friends). I'm, not seeing why that would cause your issues, but I'm
adding a few random people for ftrace / stacktrace changes.

                     Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ