linux-kernel - Re: [PATCH v10 1/2] printk: Make printk() completely async

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160406082758.GA3554@quack.suse.cz>
Date:	Wed, 6 Apr 2016 10:27:58 +0200
From:	Jan Kara <jack@...e.cz>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
	Jan Kara <jack@...e.com>, Petr Mladek <pmladek@...e.com>,
	Tejun Heo <tj@...nel.org>,
	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
	linux-kernel@...r.kernel.org,
	Byungchul Park <byungchul.park@....com>,
	Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
	Jan Kara <jack@...e.cz>
Subject: Re: [PATCH v10 1/2] printk: Make printk() completely async

On Mon 04-04-16 15:51:49, Andrew Morton wrote:
> On Tue,  5 Apr 2016 01:57:27 +0900 Sergey Senozhatsky <sergey.senozhatsky@...il.com> wrote:
> 
> > From: Jan Kara <jack@...e.cz>
> > 
> > Currently, printk() sometimes waits for message to be printed to console
> > and sometimes it does not (when console_sem is held by some other
> > process). In case printk() grabs console_sem and starts printing to
> > console, it prints messages from kernel printk buffer until the buffer
> > is empty. When serial console is attached, printing is slow and thus
> > other CPUs in the system have plenty of time to append new messages to
> > the buffer while one CPU is printing. Thus the CPU can spend unbounded
> > amount of time doing printing in console_unlock().  This is especially
> > serious problem if the printk() calling console_unlock() was called with
> > interrupts disabled.
> > 
> > In practice users have observed a CPU can spend tens of seconds printing
> > in console_unlock() (usually during boot when hundreds of SCSI devices
> > are discovered) resulting in RCU stalls (CPU doing printing doesn't
> > reach quiescent state for a long time), softlockup reports (IPIs for the
> > printing CPU don't get served and thus other CPUs are spinning waiting
> > for the printing CPU to process IPIs), and eventually a machine death
> > (as messages from stalls and lockups append to printk buffer faster than
> > we are able to print). So these machines are unable to boot with serial
> > console attached. Another observed issue is that due to slow printk,
> > hardware discovery is slow and udev times out before kernel manages to
> > discover all the attached HW. Also during artificial stress testing SATA
> > disk disappears from the system because its interrupts aren't served for
> > too long.
> > 
> > This patch makes printk() completely asynchronous (similar to what
> > printk_deferred() did until now). It appends message to the kernel
> > printk buffer and wake_up()s a special dedicated kthread to do the
> > printing to console. This has the advantage that printing always happens
> > from a schedulable contex and thus we don't lockup any particular CPU or
> > even interrupts. Also it has the advantage that printk() is fast and
> > thus kernel booting is not slowed down by slow serial console.
> > Disadvantage of this method is that in case of crash there is higher
> > chance that important messages won't appear in console output (we may
> > need working scheduling to print message to console). We somewhat
> > mitigate this risk by switching printk to the original method of
> > immediate printing to console if oops is in progress.  Also for
> > debugging purposes we provide printk.synchronous kernel parameter which
> > resorts to the original printk behavior.
> > 
> > printk() is expected to work under different conditions and in different
> > scenarios, including corner cases of OOM when all of the workers are busy
> > (e.g. allocating memory), thus printk() uses its own dedicated printing
> > kthread, rather than relying on workqueue (even with WQ_MEM_RECLAIM bit
> > set we potentially can receive delays in printing until workqueue
> > declares a ->mayday, as noted by Tetsuo Handa).
> 
> The whole idea remains worrisome.  It is definitely making printk()
> less reliable in the vast majority of cases: what happens if the
> scheduler is busted or random memory has been scribbled on, etc.

So the scheduler dependency was always there due to handling of
console_sem and some console drivers possibly scheduling out while holding
console_sem. Recently the dependency has increased by adding cond_resched()
into console_unlock() loop in 8d91f8b15361 (printk: do cond_resched()
between lines while outputting to consoles). Also the changelog of that
commit explains another case besides 'tons of SCSI devices' which leads to
printk lockups. I have also seen a case where printing of 'martian packets'
killed the machine because it was too intensive (but it has been an older
kernel and I think networking people have added some ratelimiting for those
messages so it shouldn't happen anymore).

Now this commit makes the dependency on the scheduler even harder - unless
we know something is wrong (in_panic set) we rely on scheduler to schedule
the printing kthread. So I agree there is some risk in this patch.

> All this downside to handle (afaict) one special case.  Surely there is
> another way?  For example (but feel free to suggest other approaches!)
> can we put some limit on the number of extra characters which the
> printing task will print?  Once that limit is hit, new printk callers
> will spin until they can get in and do some printing themselves.  Or
> something else?

We have been through this a few times without much success. Last revision
of my patches did what you suggest but although the idea sounds simple, the
implementation turned out to be relatively complex and even you didn't
quite like it [1]. Moreover when I raised this issue at Kernel Summit 2015,
Hannes Reinecke mentioned that even these patches are not enough to make
his huge machine boot with serial console because printing still delays
device probing too much and systemd times out during boot. Then Linus said
that he'd want to keep things simple and just make printk asynchronous and
switch to synchronous in case problems are detected - which is what I and
Sergey have implemented in this patch.

So currently I'm of the opinion that we should merge this patch and see how
well it works in reality. I've tested it and it works for me and Sergey.
Maybe I can try breaking it some more by crashing the in various ways
kernel if it makes you calmer. In case we see problems in some setups,
there's always the kernel option to return to the original behavior and we
can work from there improving what we have...

[1] Message-Id: <20150918151411.a3fa65c3e4f33f9f2ddf1fd8@...ux-foundation.org>

> > index 3d28b50..c23a5bd 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -3122,6 +3122,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >  	printk.time=	Show timing data prefixed to each printk message line
> >  			Format: <bool>  (1/Y/y=enable, 0/N/n=disable)
> >  
> > +	printk.synchronous=
> > +			By default kernel messages are printed to console
> > +			asynchronously (except during early boot or when oops
> > +			is happening). That avoids kernel stalling behind slow
> > +			serial console and thus avoids softlockups, interrupt
> > +			timeouts, or userspace timing out during heavy printing.
> > +			However for debugging problems, printing messages to
> > +			console immediately may be desirable. This option
> > +			enables such behavior.
> 
> Well, it's good that we have this.
> 
> It would be better if it was runtime-controllable - changing boot
> parameters is a bit of a pain.  In fact with this approach, your
> zillions-of-scsi-disks scenario becomes less problematic: do the async
> offloading during the boot process then switch back to the more
> reliable sync printing late in boot.

Doing this should be relatively easy. It would be userspace's decision
whether they want more reliable or faster printk. Sounds fine with me.

> > +static int __init init_printk_kthread(void)
> > +{
> > +	struct task_struct *thread;
> > +
> > +	if (printk_sync)
> > +		return 0;
> > +
> > +	thread = kthread_run(printk_kthread_func, NULL, "printk");
> 
> This gets normal scheduling policy, so a spinning userspace SCHED_FIFO
> task will block printk for ever.  This seems bad.

I have to research this a bit but won't the SCHED_FIFO task that has
potentially unbounded amount of work lockup the CPU even though it does
occasional cond_resched()?

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR