linux-kernel - Re: [bisected] pre-3.16 regression on open() scalability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140619041900.GD4669@linux.vnet.ibm.com>
Date:	Wed, 18 Jun 2014 21:19:00 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Andi Kleen <ak@...ux.intel.com>
Cc:	Dave Hansen <dave.hansen@...el.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Josh Triplett <josh@...htriplett.org>,
	"Chen, Tim C" <tim.c.chen@...el.com>,
	Christoph Lameter <cl@...ux.com>
Subject: Re: [bisected] pre-3.16 regression on open() scalability

On Wed, Jun 18, 2014 at 08:38:16PM -0700, Andi Kleen wrote:
> On Wed, Jun 18, 2014 at 07:13:37PM -0700, Paul E. McKenney wrote:
> > On Wed, Jun 18, 2014 at 06:42:00PM -0700, Andi Kleen wrote:
> > > 
> > > I still think it's totally the wrong direction to pollute so 
> > > many fast paths with this obscure debugging check workaround
> > > unconditionally.
> > 
> > OOM prevention should count for something, I would hope.
> 
> OOM in what scenario? This is getting bizarre.

On the bizarre part, at least we agree on something.  ;-)

CONFIG_NO_HZ_FULL booted with at least one nohz_full CPU.  Said CPU
gets into the kernel and stays there, not necessarily generating RCU
callbacks.  The other CPUs are very likely generating RCU callbacks.
Because the nohz_full CPU is in the kernel, and because there are no
scheduling-clock interrupts on that CPU, grace periods do not complete.
Eventually, the callbacks from the other CPUs (and perhaps also some
from the nohz_full CPU, for that matter) OOM the machine.

Now this scenario constitutes an abuse of CONFIG_NO_HZ_FULL, because it
is intended for CPUs that execute either in userspace (in which case
those CPUs are in extended quiescent states so that RCU can happily
ignore them) or for real-time workloads with low CPU untilization (in
which case RCU sees them go idle, which is also a quiescent state).
But that won't stop people from abusing their kernels and complaining
when things break.

This same thing can also happen without CONFIG_NO_HZ full, though
the system has to work a bit harder.  In this case, the CPU looping
in the kernel has scheduling-clock interrupts, but if all it does
is cond_resched(), RCU is never informed of any quiescent states.
The whole point of this patch is to make those cond_resched() calls,
which are quiescent states, visible to RCU.

> If something keeps looping forever in the kernel creating 
> RCU callbacks without any real quiescent states it's simply broken.

I could get behind that.  But by that definition, there is a lot of
breakage in the current kernel, especially as we move to larger CPU
counts.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/