linux-kernel - Re: [PATCH] printk: Avoid softlockups in console

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130116235529.GA10251@quack.suse.cz>
Date:	Thu, 17 Jan 2013 00:55:29 +0100
From:	Jan Kara <jack@...e.cz>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Jan Kara <jack@...e.cz>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	LKML <linux-kernel@...r.kernel.org>, jslaby@...e.cz
Subject: Re: [PATCH] printk: Avoid softlockups in console_unlock()

On Wed 16-01-13 14:50:05, Andrew Morton wrote:
> > > > We fix the issue by printing at most 1 KB of messages (unless we are in an
> > > > early boot stage or oops is happening) in one console_unlock() call. The rest
> > > > of the buffer will be printed either by further callers to printk() or by a
> > > > queued work.
> > > 
> > > Complex.  Did you try just putting a touch_nmi_watchdog() in the loop?
> >   I didn't try that. I suppose touch_nmi_watchdog() +
> > rcu_cpu_stall_reset() would make the messages go away (yes, RCU eventually
> > freaks out as well). But is it really sane that we keep single CPU busy,
> > unusable for anything else, for such a long time? There can be no RCU
> > callbacks processed, no IPIs are processed (which is what triggers
> > softlockup reports), etc.
> 
> What's not sane is doing large amounts of printk over a slow device.
  Yeah, but serial consoles are handy... These sysadmins are pretty
religious about them and when the machine doesn't boot with serial console
enabled they complain. Nitpickers!

> > I agree that if we silence all the warnings, everything will eventually
> > hang waiting for the stalled CPU, that will finish the printing and things
> > start from the beginning (we tried silencing RCU with rcu_cpu_stall_reset()
> > and that makes the machine boot eventually). But it seems like papering
> > over a real problem?  
> 
> Well not really - we're doing what the printk() caller asked us to do -
> to synchronously print stuff.  And simply sitting there pumping out the
> characters is the simplest, most straightforward thing to do.  And
> printk() should be simple and straightforward.
  Except that printk() isn't always synchronous. It writes a message into
the kernel buffer. If there is noone else pumping characters from the buffer
into console at that moment, the printk caller starts doing so. But
otherwise printk() just returns and pumping of characters is left on
whoever started doing that thankless job. So that poor guy ends up doing
the pumping for all the others...

> If this is all a problem then the calling code should stop doing so
> much printing!
  It's mostly a device discovery that triggers the issues in practice. They
have over thousand of SCSI disks attached (multipath in use) and when each
disk prints ~400 bytes of messages (just check your dmesg) you end up with
~30s worth of printing over 115200 console.

> And punting the operation to a kernel thread is a pretty radical change
> - it surely adds significant risk that output will be lost.
  I agree there is a higher chance the output will be lost.

> So hrm, I dunno.  Can we just put the touch_nmi_watchdog() in there
> intially, see if it fixes things?  If people continue to hit problems
> then we can take a second look?
  OK, I'll see if I can get this tested on one of those machines...

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/