lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191203181359.GD2196666@devbig004.ftw2.facebook.com>
Date:   Tue, 3 Dec 2019 10:13:59 -0800
From:   Tejun Heo <tj@...nel.org>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>, jiangshanlai@...il.com,
        linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: Workqueues splat due to ending up on wrong CPU

Hello, Paul.

On Tue, Dec 03, 2019 at 09:45:47AM -0800, Paul E. McKenney wrote:
> Good point, and yes, you have told me this before.
> 
> Furthermore, in all of these cases, the process was supposed to be
> running on CPU 0, which cannot be taken offline on any of the systems
> under test.  Which is leading me to wonder if the workqueue CPU-online
> notifier is sometimes moving more kthreads to the newly onlined CPU than
> it is supposed to.  Tejun, could that be happening?

All the warnings that you posted are from rescuers and they jump
around different cpus so that it's on the correct cpu for the specific
work item being rescued.  This is a completely separate thing from the
usual worker management and rescuers don't interact with hot[un]plug
callbacks in any way.  I think something like the following is what's
happening:

* A work item is queued to CPU5 but it hasn't been dispatched for a
  bit so rescuer gets summoned.  The rescuer executes the work item
  and stays there.

* CPU 5 goes down.  The rescuer is asleep and doesn't get affected.

* CPU 5 is coming up.  It has online set but the stopper hasn't been
  enabled yet.

* A work item was queued on CPU0 but hasn't been dispatched for a
  bit, so rescuer is woken up.

* Rescuer wakes up fine on CPU5 as it's online.  Seeing the CPU0 work
  item, the rescuer tries to migrate to CPU0 by calling
  set_cpus_allowed_ptr(); however, stopper isn't up yet and migration
  doesn't actually happen.

* Boom.  Rescuer is now executing CPU0 work item on CPU5.

Thanks.

-- 
tejun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ