linux-kernel - Re: [PATCH 00/10] workqueue: break affinity initiatively

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJhGHyBhtTDWw_xZ28_+CguhVx=x7pds0dZVkUT7YqjkjUdbNQ@mail.gmail.com>
Date:   Tue, 15 Dec 2020 17:46:23 +0800
From:   Lai Jiangshan <jiangshanlai@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Lai Jiangshan <laijs@...ux.alibaba.com>,
        Hillf Danton <hdanton@...a.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Qian Cai <cai@...hat.com>,
        Vincent Donnefort <vincent.donnefort@....com>,
        Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH 00/10] workqueue: break affinity initiatively

On Tue, Dec 15, 2020 at 4:49 PM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Tue, Dec 15, 2020 at 04:14:26PM +0800, Lai Jiangshan wrote:
> > On Tue, Dec 15, 2020 at 3:50 PM Peter Zijlstra <peterz@...radead.org> wrote:
> > >
> > > On Tue, Dec 15, 2020 at 01:44:53PM +0800, Lai Jiangshan wrote:
> > > > I don't know how the scheduler distinguishes all these
> > > > different cases under the "new assumption".
> > >
> > > The special case is:
> > >
> > >   (p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1
> > >
> > >
> >
> > So unbound per-node workers can possibly match this test. So there is code
> > needed to handle for unbound workers/pools which is done by this patchset.
>
> Curious; how could a per-node worker match this? Only if the node is a
> single CPU, or otherwise too?

We have /sys/devices/virtual/workqueue/cpumask which can be read/written
to access to wq_unbound_cpumask.

A per-node worker's cpumask is wq_unbound_cpumask&possible_cpumask_of_the_node.
Since wq_unbound_cpumask can be changed by system adim, so a per-node
worker's cpumask is possible to be single CPU.

wq_unbound_cpumask is used when a system adim wants to isolate some
CPUs from unbound workqueques.  But I think it is rare case when the
admin causes a per-node worker's cpumask to be single CPU.

Even it is a rare case, we have to handle it.

>
> > Is this the code of is_per_cpu_kthread()? I think I should have also
> > used this function in workqueue and don't break affinity for unbound
> > workers have more than 1 cpu.
>
> Yes, that function captures it. If you want to use it, feel free to move
> it to include/linux/sched.h.

I will.  "single CPU" for unbound workers/pools is the rare case
and enough to bring the code to break affinity for unbound workers.
If we optimize for the common cases (multiple CPUs for unbound workers),
the optimization seems like additional code works only in the slow
path (hotunplug).

I will try it and see if it is worth.

>
> This class of threads is 'special', since it needs to violate the
> regular hotplug rules, and migrate_disable() made it just this little
> bit more special. It basically comes down to how we need certain per-cpu
> kthreads to run on a CPU while it's brought up, before userspace is
> allowed on, and similarly they need to run on the CPU after userspace is
> no longer allowed on in order to bring it down.
>
> (IOW, they must be allowed to violate the active mask)
>
> Due to migrate_disable() we had to move the migration code from the very
> last cpu-down stage, to earlier. This in turn brought the expectation
> (which is normally met) that per-cpu kthreads will stop/park or
> otherwise make themselves scarce when the CPU goes down. We can no
> longer force migrate them.

Thanks for explaining the rationale.

>
> Workqueues are the sole exception to that, they've got some really
> 'dodgy' hotplug behaviour.
>

Indeed.  No one want to wait for workqueue when hotunplug, so we have
to do something after the fact.