[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20180321072851.GB468@jagdpanzerIV>
Date: Wed, 21 Mar 2018 16:28:51 +0900
From: Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To: bugzilla-daemon@...zilla.kernel.org
Cc: sergey.senozhatsky@...il.com, Steven Rostedt <rostedt@...dmis.org>,
Petr Mladek <pmladek@...e.com>, linux-kernel@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [Bug 199003] console stalled, cause Hard LOCKUP.
On (03/20/18 09:34), bugzilla-daemon@...zilla.kernel.org wrote:
[..]
> Thanks very much.
> commit e480af09c49736848f749a43dff2c902104f6691 avoided the NMI watchdog
> trigger.
Hm, okay... But "touch_nmi_watchdog() everywhere printk/console-related"
is not exactly where I wanted us to be.
By the way e480af09c49736848f749a43dff2c902104f6691 is from 2006.
Are you sure you meant exactly that commit? What kernel do you use?
Are you saying that none of Steven's patches helped on your setups?
> And this patch may avdoid long time blocking:
> https://lkml.org/lkml/2018/3/8/584
>
> We've test it several days.
Hm, printk_deferred is a bit dangerous; it moves console_unlock() to
IRQ. So you still can have the problem of stuck CPUs, it's just now
you shut up the watchdog. Did you test Steven's patches?
A tricky part about printk_deferred() is that it does not use hand off
mechanism. And even more... What we have with "printk vs printk"
sceanrio
CPU0 CPU1 ... CPUN
printk printk
console_unlock hand off printk
console_unlock hand off
console_unlock
turns into a good old "one CPU prints it all" when we have "printk vs
printk_deferred" case. Because printk_deferred just log_store messages
and then _may be_ it grabs the console_sem from IRQ and invokes
console_unlock().
So it's something like this
CPU0 CPU1 ... CPUN
printk printk_deffered
console_unlock printk_deferred
console_unlock
console_unlock
... ... ...
printk_deffered printk_deferred
console_unlock
console_unlock
// offtopic "I can has printk_kthread?"
You now touch_nmi_watchdog() from the console driver [well... at least this
is what e480af09c4973 is doing, but I'm not sure I see how come you didn't
have it applied], so that's why you don't see hard lockups on that CPU0. But
your printing CPU still can stuck, which will defer RCUs on that CPU, etc.
etc. etc. So I'd say that those two approaches
printk_deferred + touch_nmi_watchdog
combined can do quite some harm. One thing for sure - they don't really fix
any problems.
-ss
Powered by blists - more mailing lists