lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181001152324.72a20bea@gandalf.local.home>
Date:   Mon, 1 Oct 2018 15:23:24 -0400
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Daniel Wang <wonderfly@...gle.com>
Cc:     stable@...r.kernel.org, pmladek@...e.com,
        Alexander.Levin@...rosoft.com, akpm@...ux-foundation.org,
        byungchul.park@....com, dave.hansen@...el.com, hannes@...xchg.org,
        jack@...e.cz, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        mathieu.desnoyers@...icios.com, mgorman@...e.de, mhocko@...nel.org,
        pavel@....cz, penguin-kernel@...ove.SAKURA.ne.jp,
        peterz@...radead.org, tj@...nel.org, torvalds@...ux-foundation.org,
        vbabka@...e.cz, xiyou.wangcong@...il.com, pfeiner@...gle.com
Subject: Re: 4.14 backport request for dbdda842fe96f: "printk: Add console
 owner and waiter logic to load balance console writes"

On Thu, 27 Sep 2018 12:46:01 -0700
Daniel Wang <wonderfly@...gle.com> wrote:

> Prior to this change, the combination of `softlockup_panic=1` and
> `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot path
> is trying to grab the console lock that is held by the stack trace printing
> path. What seems to be happening is that while there are multiple CPUs, only one
> of them is tasked to print the back trace of all CPUs. On a machine with many
> CPUs and a slow serial console (on Google Compute Engine for example), the stack
> trace printing routine hits a timeout and the reboot path kicks in. The latter
> then tries to print something else, but can't get the lock because it's still
> held by earlier printing path. This is easily reproducible on a VM with 16+
> vCPUs on Google Compute Engine - which is a very common scenario.
> 
> A quick repro is available at
> https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 seconds
> into executing repro.sh. Both deadlock analysis and repro are credits to Peter
> Feiner.
> 
> Note that I have read previous discussions on backporting this to stable [1].
> The argument for objecting the backport was that this is a non-trivial fix and
> is supported to prevent hypothetical soft lockups. What we are hitting is a real
> deadlock, in production, however. Hence this request.
> 
> [1] https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3xk3t@pathway.suse.cz/T/#u
> 
> Serial console logs leading up to the deadlock. As can be seen the stack trace
> was incomplete because the printing path hit a timeout.

I'm fine with having this backported.

-- Steve

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ