linux-kernel - Re: Workqueues splat due to ending up on wrong CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191203181359.GD2196666@devbig004.ftw2.facebook.com>
Date:   Tue, 3 Dec 2019 10:13:59 -0800
From:   Tejun Heo <tj@...nel.org>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>, jiangshanlai@...il.com,
        linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: Workqueues splat due to ending up on wrong CPU

Hello, Paul.

On Tue, Dec 03, 2019 at 09:45:47AM -0800, Paul E. McKenney wrote:
> Good point, and yes, you have told me this before.
> 
> Furthermore, in all of these cases, the process was supposed to be
> running on CPU 0, which cannot be taken offline on any of the systems
> under test.  Which is leading me to wonder if the workqueue CPU-online
> notifier is sometimes moving more kthreads to the newly onlined CPU than
> it is supposed to.  Tejun, could that be happening?

All the warnings that you posted are from rescuers and they jump
around different cpus so that it's on the correct cpu for the specific
work item being rescued.  This is a completely separate thing from the
usual worker management and rescuers don't interact with hot[un]plug
callbacks in any way.  I think something like the following is what's
happening:

* A work item is queued to CPU5 but it hasn't been dispatched for a
  bit so rescuer gets summoned.  The rescuer executes the work item
  and stays there.

* CPU 5 goes down.  The rescuer is asleep and doesn't get affected.

* CPU 5 is coming up.  It has online set but the stopper hasn't been
  enabled yet.

* A work item was queued on CPU0 but hasn't been dispatched for a
  bit, so rescuer is woken up.

* Rescuer wakes up fine on CPU5 as it's online.  Seeing the CPU0 work
  item, the rescuer tries to migrate to CPU0 by calling
  set_cpus_allowed_ptr(); however, stopper isn't up yet and migration
  doesn't actually happen.

* Boom.  Rescuer is now executing CPU0 work item on CPU5.

Thanks.

-- 
tejun