lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 19 Aug 2015 17:38:27 +0200
From:	Jan Kara <jack@...e.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	LKML <linux-kernel@...r.kernel.org>, pmladek@...e.com,
	rostedt@...dmis.org, Gavin Hu <gavin.hu.2010@...il.com>,
	KY Srinivasan <kys@...rosoft.com>, Jan Kara <jack@...e.cz>
Subject: [PATCH 0/4] printk: Softlockup avoidance

From: Jan Kara <jack@...e.cz>

Hello,

since lately there were several attempts at dealing with softlockups due
to heavy printk traffic [1] [2] and I've been also privately pinged by
couple of people about the state of the patch set, I've decided to respin
the patch set.

To remind the original problem:

Currently, console_unlock() prints messages from kernel printk buffer to
console while the buffer is non-empty. When serial console is attached,
printing is slow and thus other CPUs in the system have plenty of time
to append new messages to the buffer while one CPU is printing. Thus the
CPU can spend unbounded amount of time doing printing in console_unlock().
This is especially serious when printk() gets called under some critical
spinlock or with interrupts disabled.
    
In practice users have observed a CPU can spend tens of seconds printing
in console_unlock() (usually during boot when hundreds of SCSI devices
are discovered) resulting in RCU stalls (CPU doing printing doesn't
reach quiescent state for a long time), softlockup reports (IPIs for the
printing CPU don't get served and thus other CPUs are spinning waiting
for the printing CPU to process IPIs), and eventually a machine death
(as messages from stalls and lockups append to printk buffer faster than
we are able to print). So these machines are unable to boot with serial
console attached. Also during artificial stress testing SATA disk
disappears from the system because its interrupts aren't served for too
long.

This series addresses the problem in the following way: If CPU has printed
more that printk_offload (defaults to 1000) characters, it wakes up one
of dedicated printk kthreads (we don't use workqueue because that has
deadlock potential if printk was called from workqueue code). Once we find
out kthread is spinning on a lock, we stop printing, drop console_sem, and
let kthread continue printing. Since there are two printing kthreads, they
will pass printing between them and thus no CPU gets hogged by printing.

Changes since the last posting [3]:
* I have replaced the state machine to pass printing and spinning on
  console_sem with a simple spinlock which makes the code
  somewhat easier to read and verify.
* Some of the patches were merged so I dropped them.

								Honza

[1] https://lkml.org/lkml/2015/7/8/215
[2] http://marc.info/?l=linux-kernel&m=143929238407816&w=2
[3] https://lkml.org/lkml/2014/3/17/68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ