linux-kernel - Re: [bisected] pre-3.16 regression on open() scalability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1403155458.5189.54.camel@marge.simpson.net>
Date:	Thu, 19 Jun 2014 07:24:18 +0200
From:	Mike Galbraith <umgwanakikbuti@...il.com>
To:	paulmck@...ux.vnet.ibm.com
Cc:	Andi Kleen <ak@...ux.intel.com>,
	Dave Hansen <dave.hansen@...el.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Josh Triplett <josh@...htriplett.org>,
	"Chen, Tim C" <tim.c.chen@...el.com>,
	Christoph Lameter <cl@...ux.com>
Subject: Re: [bisected] pre-3.16 regression on open() scalability

On Wed, 2014-06-18 at 21:19 -0700, Paul E. McKenney wrote: 
> On Wed, Jun 18, 2014 at 08:38:16PM -0700, Andi Kleen wrote:
> > On Wed, Jun 18, 2014 at 07:13:37PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jun 18, 2014 at 06:42:00PM -0700, Andi Kleen wrote:
> > > > 
> > > > I still think it's totally the wrong direction to pollute so 
> > > > many fast paths with this obscure debugging check workaround
> > > > unconditionally.
> > > 
> > > OOM prevention should count for something, I would hope.
> > 
> > OOM in what scenario? This is getting bizarre.
> 
> On the bizarre part, at least we agree on something.  ;-)
> 
> CONFIG_NO_HZ_FULL booted with at least one nohz_full CPU.  Said CPU
> gets into the kernel and stays there, not necessarily generating RCU
> callbacks.  The other CPUs are very likely generating RCU callbacks.
> Because the nohz_full CPU is in the kernel, and because there are no
> scheduling-clock interrupts on that CPU, grace periods do not complete.
> Eventually, the callbacks from the other CPUs (and perhaps also some
> from the nohz_full CPU, for that matter) OOM the machine.
> 
> Now this scenario constitutes an abuse of CONFIG_NO_HZ_FULL, because it
> is intended for CPUs that execute either in userspace (in which case
> those CPUs are in extended quiescent states so that RCU can happily
> ignore them) or for real-time workloads with low CPU untilization (in
> which case RCU sees them go idle, which is also a quiescent state).
> But that won't stop people from abusing their kernels and complaining
> when things break.

IMHO, those people can keep the pieces.

I don't even enable RCU_BOOST in -rt kernels, because that safety net
has a price.  The instant Joe User picks up the -rt shovel, it's his
grave, and he gets to do the digging.  Instead of trying to save his
bacon, I hand him a slightly better shovel, let him prioritize all
kthreads including workqueues.  Joe can dig all he wants to, and it's on
him, I just make sure he has the means to bury himself properly :)

> This same thing can also happen without CONFIG_NO_HZ full, though
> the system has to work a bit harder.  In this case, the CPU looping
> in the kernel has scheduling-clock interrupts, but if all it does
> is cond_resched(), RCU is never informed of any quiescent states.
> The whole point of this patch is to make those cond_resched() calls,
> which are quiescent states, visible to RCU.
> 
> > If something keeps looping forever in the kernel creating 
> > RCU callbacks without any real quiescent states it's simply broken.
> 
> I could get behind that.  But by that definition, there is a lot of
> breakage in the current kernel, especially as we move to larger CPU
> counts.

Not only larger CPU counts: skipping the -rq clock update on wakeup
(cycle saving optimization) turned out to be deadly to boxen with a
zillion disks because our wakeup latency can be so incredibly horrible
that falsely attributing wakeup latency to the next task to run
(watchdog) resulted in it being throttled for long enough that big IO
boxen panicked during boot.

The root cause of that wasn't the optimization, the root was the
horrific amounts of time we can spend locked up in the kernel.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/