linux-kernel - Re: Workqueues splat due to ending up on wrong CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191203174547.GG2889@paulmck-ThinkPad-P72>
Date:   Tue, 3 Dec 2019 09:45:47 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Tejun Heo <tj@...nel.org>, jiangshanlai@...il.com,
        linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: Workqueues splat due to ending up on wrong CPU

On Tue, Dec 03, 2019 at 11:00:10AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 02, 2019 at 03:39:44PM -0800, Paul E. McKenney wrote:
> 
> > I think that I do not understand the code, but I never let that stop
> > me from asking stupid questions!  ;-)
> > 
> > Suppose that a given worker is bound to a particular CPU, but has no
> > work pending, and is therefore sleeping in the schedule() call near the
> > end of worker_thread().  During this time, its CPU goes offline and then
> > comes back online.  Doesn't this break that task's affinity to that CPU?
> 
> No. The thing about sleeping tasks is that they're not in fact on any
> CPU at all. Only when a task wakes up do we concern ourselves with
> placing it. If at that time we find the CPU it was constrained to is no
> longer with us, then we go break affinity.
> 
> But if the CPU went away and came back while the task was asleep, it
> will not notice anything.

Good point, and yes, you have told me this before.

Furthermore, in all of these cases, the process was supposed to be
running on CPU 0, which cannot be taken offline on any of the systems
under test.  Which is leading me to wonder if the workqueue CPU-online
notifier is sometimes moving more kthreads to the newly onlined CPU than
it is supposed to.  Tejun, could that be happening?

							Thanx, Paul