linux-kernel - Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181001201309.GA9835@amd>
Date:   Mon, 1 Oct 2018 22:13:10 +0200
From:   Pavel Machek <pavel@....cz>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Daniel Wang <wonderfly@...gle.com>, stable@...r.kernel.org,
        pmladek@...e.com, Alexander.Levin@...rosoft.com,
        akpm@...ux-foundation.org, byungchul.park@....com,
        dave.hansen@...el.com, hannes@...xchg.org, jack@...e.cz,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        mathieu.desnoyers@...icios.com, mgorman@...e.de, mhocko@...nel.org,
        penguin-kernel@...ove.SAKURA.ne.jp, peterz@...radead.org,
        tj@...nel.org, torvalds@...ux-foundation.org, vbabka@...e.cz,
        xiyou.wangcong@...il.com, pfeiner@...gle.com
Subject: Re: 4.14 backport request for dbdda842fe96f: "printk: Add console
 owner and waiter logic to load balance console writes"

On Mon 2018-10-01 15:23:24, Steven Rostedt wrote:
> On Thu, 27 Sep 2018 12:46:01 -0700
> Daniel Wang <wonderfly@...gle.com> wrote:
> 
> > Prior to this change, the combination of `softlockup_panic=1` and
> > `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot path
> > is trying to grab the console lock that is held by the stack trace printing
> > path. What seems to be happening is that while there are multiple CPUs, only one
> > of them is tasked to print the back trace of all CPUs. On a machine with many
> > CPUs and a slow serial console (on Google Compute Engine for example), the stack
> > trace printing routine hits a timeout and the reboot path kicks in. The latter
> > then tries to print something else, but can't get the lock because it's still
> > held by earlier printing path. This is easily reproducible on a VM with 16+
> > vCPUs on Google Compute Engine - which is a very common scenario.
> > 
> > A quick repro is available at
> > https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 seconds
> > into executing repro.sh. Both deadlock analysis and repro are credits to Peter
> > Feiner.
> > 
> > Note that I have read previous discussions on backporting this to stable [1].
> > The argument for objecting the backport was that this is a non-trivial fix and
> > is supported to prevent hypothetical soft lockups. What we are hitting is a real
> > deadlock, in production, however. Hence this request.
> > 
> > [1] https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3xk3t@pathway.suse.cz/T/#u
> > 
> > Serial console logs leading up to the deadlock. As can be seen the stack trace
> > was incomplete because the printing path hit a timeout.
> 
> I'm fine with having this backported.

Dunno. Is the patch perhaps a bit too complex? This is not exactly
trivial bugfix.

pavel@duo:/data/l/clean-cg$ git show dbdda842fe96f | diffstat
 printk.c |  108
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-

I see that it is pretty critical to Daniel, but maybe kernel with
console locking redone should no longer be called 4.4?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)