linux-kernel - Re: Crashes with 874bbfe600a6 in 3.18.25

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFzpBgyWHh9bHUNW2vX+nJRLAmtXV3VFVazppb+SaY78AQ@mail.gmail.com>
Date:	Tue, 9 Feb 2016 10:06:04 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Tejun Heo <tj@...nel.org>
Cc:	Mike Galbraith <umgwanakikbuti@...il.com>,
	Michal Hocko <mhocko@...nel.org>, Jiri Slaby <jslaby@...e.cz>,
	Thomas Gleixner <tglx@...utronix.de>,
	Petr Mladek <pmladek@...e.com>, Jan Kara <jack@...e.cz>,
	Ben Hutchings <ben@...adent.org.uk>,
	Sasha Levin <sasha.levin@...cle.com>, Shaohua Li <shli@...com>,
	LKML <linux-kernel@...r.kernel.org>,
	stable <stable@...r.kernel.org>,
	Daniel Bilik <daniel.bilik@...system.cz>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: Crashes with 874bbfe600a6 in 3.18.25

On Tue, Feb 9, 2016 at 9:51 AM, Tejun Heo <tj@...nel.org> wrote:
>>
>>  (a) actually dequeue timers and work queues that are bound to a
>> particular CPU when a CPU goes down.
>>
> This goes the same for work items and timers.  If we want to do
> explicit dequeueing or flushing of cpu-bound stuff on cpu down, we'll
> have to either dedicate *_on() interfaces for correctness or introduce
> a separate set of interfaces to use for optimization and correctness.

We already do that. "add_timer_on()" for timers, and cpu !=
WORK_CPU_UNBOUND for work items.

>    Maybe we can get away with
> declaring that _on() usages are absolute.

I really think that anything else would be odd as hell. If you asked
for a timer (or work) on a particular CPU, and you get it on another
one, that's a bug.

It's much better to just dequeue those entries and say "sorry, your
CPU went away".

Of course, we could play around with just run them early at CPU-down
time (and anybody trying to requeue would get an error because the CPU
is in the process of going down), but that sounds like more work for
any users, and like a much more fundamental difference. The "just
silently dequeue" makes more sense, and pairs well with anything that
sets things up on CPU-up time (which a percpu entity will have to do
anyway).

> So, how about reverting 874bbfe6 and performing random foreign
> queueing during -rc's for a couple cycles so that we can at least find
> out the broken ones quickly in devel branch and backport fixes as
> they're found?

Yeah, that sounds good to me. Having some "cpu work/timer debug"
config option that ends up spreading out non-cpu-specific timers and
work in order to find bugs sounds like a good idea. And I don't think
it should be limited to rc releases, I think lots of people might be
willing to run that (the same way we had people - and even
distributions - that did PAGEALLOC_DEBUG which is a lot bigger
hammer).

             Linus