linux-kernel - Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140507160504.GC16694@localhost.localdomain>
Date:	Wed, 7 May 2014 18:05:08 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>
Cc:	linux-kernel@...r.kernel.org, mingo@...nel.org, hpa@...or.com,
	paulmck@...ux.vnet.ibm.com, akpm@...ux-foundation.org,
	khilman@...aro.org, tglx@...utronix.de, axboe@...com,
	linux-tip-commits@...r.kernel.org
Subject: Re: [tip:timers/nohz] nohz: Move full nohz kick to its own IPI

On Wed, May 07, 2014 at 05:37:36PM +0200, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 05:29:24PM +0200, Frederic Weisbecker wrote:
> > On Wed, May 07, 2014 at 05:17:35PM +0200, Peter Zijlstra wrote:
> > > On Mon, May 05, 2014 at 05:34:08PM +0200, Frederic Weisbecker wrote:
> > > > On Mon, May 05, 2014 at 05:12:28PM +0200, Peter Zijlstra wrote:
> > > > > > Note the current ordering:
> > > > > > 
> > > > > >     cmpxchg(&qsd->pending, 0, 1)       get ipi
> > > > > >     csd_lock(qsd->csd)                 xchg(&qsd->pending, 1)
> > > > > >     send ipi                           csd_unlock(qsd->csd)
> > > > > > 
> > > > > > 
> > > > > > So there shouldn't be racing updaters. Also ipi sender shouldn't
> > > > > > race with ipi receiver, the update shouldn't always eventually see
> > > > > > the unlock happening.
> > > > > 
> > > > > Yeah, I've not spotted how this particular train wreck happens either.
> > > > > 
> > > > > The problem is reproduction, it took me 9 hours to confirm I could
> > > > > reproduce the problem on my machine. So how long to I run it with this
> > > > > patch reverted to show its gone..
> > > > 
> > > > Maybe it could be favoured cpu hotplug. Anyway converting to irq_work should
> > > > fix it.
> > > 
> > > Ingo needs a commit msg for the revert of this patch; do you think you
> > > have time to look into _why_ this patch is broken and write such a
> > > thing?
> > 
> > I can try but I need to reproduce it. Do you have any clue on how to do so?
> > Also which HEAD were you guys using?
> 
> Ha!, so I was running a tip/master with that commit in -- a few days
> ago, v3.15-rc4-1644-g5c658b0cdf22 might've been it.
> 
> Then I ran it on my dual socket AMD interlagos, with:
> 
> while :; make O=allyesconfig-build/ clean; make O=allyesconfig-build/
> -j96 -s; done
> 
> for 9 hours, and then got empty RCU stall warns and a bricked machine.
> 
> I might still have the .config, but I don't think there was anything
> particularly odd about the config other than having NOHZ_FULL enabled.
> 
> The only way I found this patch was by staring at some RCU stall warns
> Ingo managed to get, sometimes they actually got backtraces in them
> apparently.
> 
> According to Ingo the bigger the machine the faster it reproduces, but
> reproduction times, even for these 32 cpu machines, are in the many
> hours range.

Ok then, I'll try something.

But note that those commits aren't upstream yet and they are in a seperate
branch tip:timers/nohz with no other non-upstream commits.

And I work alone on this branch.

So we can as well zap these commits and replace them with the irq_work_on()
conversion (still preparing that).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/