lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20140904212925.GB3116@lerouge>
Date:	Thu, 4 Sep 2014 23:29:27 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Catalin Iacob <iacobcatalin@...il.com>
Cc:	Dave Jones <davej@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Linux Kernel <linux-kernel@...r.kernel.org>
Subject: Re: nohz fail (was: perf related boot hang.)

On Thu, Sep 04, 2014 at 11:05:02PM +0200, Catalin Iacob wrote:
> On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker <fweisbec@...il.com> wrote:
> > Yeah, that's expected. You need to apply the nine patches on top of -rc1:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >         nohz/fixes
> >
> > "nohz: Restore NMI safe local irq work for local nohz kick" only fixes
> > part of the issue.
> 
> Ok, but if the whole series is needed, isn't it better if it all goes
> into 3.17? Otherwise 3.17 is a clear regression for some users; it's
> definitely for me since before 3.17-rc1 I never saw this bug and now I
> see it every time I do something CPU intensive. Maybe the regression
> is acceptable because the it's confined to some CONFIG_NO_HZ_*
> combination (I think) which is still rather experimental, that's your
> call to make, but it's still a regression.

Yeah the bug is there for a while but likely something got merged in the
last -rc1 that made the bug more likely to happen.

This is probably due to the fact that we converted remote nohz kick to use
irq work instead of the scheduler IPI. So it fires more likely and if we
are unlucky enough, some tick sees the irq work before the irq work IPI
can fire.

Or some code enqueues that irq work from the tick itself.

Awyway you're right that it belongs to the category of regressions. Unfortunately
the fix is invasive.

Also I don't know much users of nohz full so probably this won't
have much impact. Or this could be a good way to know who uses this feature after all :o)

I'm not sure what I should do. Lets see how the final fix will look like, Peter
is proposing some simplifications. Then we'll know better.

BTW, do you run some specific workloads to trigger this?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ