linux-kernel - Re: [RFC PATCH v3 2/3] workqueue: Unbind workers before sending them to exit()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmh8ro2d4du.mognet@vschneid.remote.csb>
Date:   Fri, 05 Aug 2022 17:47:09 +0100
From:   Valentin Schneider <vschneid@...hat.com>
To:     Lai Jiangshan <jiangshanlai@...il.com>
Cc:     LKML <linux-kernel@...r.kernel.org>, Tejun Heo <tj@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Phil Auld <pauld@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>
Subject: Re: [RFC PATCH v3 2/3] workqueue: Unbind workers before sending
 them to exit()

On 05/08/22 11:16, Lai Jiangshan wrote:
> On Tue, Aug 2, 2022 at 4:42 PM Valentin Schneider <vschneid@...hat.com> wrote:
>> +/*
>> + * Unlikely as it may be, a worker could wake after destroy_worker() has
>> + * happened but before reap_workers(). WORKER_DIE would be set in worker->flags,
>> + * so it would be able to kfree(worker) and head out to do_exit().
>> + *
>> + * Rather than make the reaper wait for each to-be-reaped kworker to exit and
>> + * kfree(worker) itself, make the kworkers (which have nothing to do but go
>> + * do_exit() anyway) wait for the reaper to be done with them.
>> + */
>> +static void worker_wait_reaped(struct worker *worker)
>> +{
>> +       WARN_ON_ONCE(current != worker->task);
>> +
>> +       for (;;) {
>> +               set_current_state(TASK_INTERRUPTIBLE);
>> +               if (READ_ONCE(worker->reaped))
>> +                       break;
>> +               schedule();
>> +       }
>> +       __set_current_state(TASK_RUNNING);
>> +}
>
>
> It is not a good idea to add this scheduler-ist code here.
>
> Using wq_pool_attach_mutex to protects the whole body of idle_reaper_fn()
> can stop the worker from freeing itself since the worker has to
> get the mutex before exiting.
>

Right, there's worker_detach_from_pool() before kfree(worker), hadn't
thought of that. I want to limit how many locks I'm hoarding with the
reaper, but given that one is for attach/detach I think that's OK - and I
also really don't like this worker_wait_reaped() function, so will be happy
to get rid of it. I'll give this a try, thanks!

> And I don't think batching destruction is a good idea since
> it is not a hot path.
>

The batching is mostly there because checking & removing a worker from its
pool->idle_list has to be done under pool->lock, but changing its affinity
requires a sleepable context, so I batched that outside of the spinlock
section.

>>         while (too_many_workers(pool)) {
>> -               struct worker *worker;
>>                 unsigned long expires;
>> +               unsigned long now = jiffies;
>>
>>                 /* idle_list is kept in LIFO order, check the last one */
>>                 worker = list_entry(pool->idle_list.prev, struct worker, entry);
>>                 expires = worker->last_active + IDLE_WORKER_TIMEOUT;
>>
>> -               if (time_before(jiffies, expires)) {
>> -                       mod_timer(&pool->idle_timer, expires);
>> +               /*
>> +                * Careful: queueing a work item from here can and will cause a
>> +                * self-deadlock when dealing with an unbound pool. However,
>> +                * here the delay *cannot* be zero and *has* to be in the
>> +                * future, which works.
>> +                */
>> +               if (time_before(now, expires)) {
>
> IMHO, using raw_spin_unlock_irq(&pool->lock) here is better than
> violating locking rules *overtly* and documenting that it can not be
> really violated. But It would bring a "goto" statement.

I was worried about serializing accesses to pool->idle_reaper_work and its
underlying timer (worker_enter_idle() vs idle_reaper_fn()), though I think
the worst that can happen if idle_reaper_fn() does that without holding
pool->lock is worker_enter_idle() pushing back the timer to
IDLE_WORKER_TIMEOUT (rather than (last_active + IDLE_WORKER_TIMEOUT) -
now).

>> +                       mod_delayed_work(system_unbound_wq,
>> +                                        &pool->idle_reaper_work,
>> +                                        expires - now);
>>                         break;
>>                 }

>> @@ -5030,11 +5128,8 @@ static void rebind_workers(struct worker_pool *pool)
>>          * of all workers first and then clear UNBOUND.  As we're called
>>          * from CPU_ONLINE, the following shouldn't fail.
>>          */
>> -       for_each_pool_worker(worker, pool) {
>> -               kthread_set_per_cpu(worker->task, pool->cpu);
>> -               WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
>> -                                                 pool->attrs->cpumask) < 0);
>> -       }
>> +       for_each_pool_worker(worker, pool)
>> +               rebind_worker(worker, pool);
>
>
> It is better to skip the workers which are WORKER_DIE.
> Or just detach the worker when reaping it.

Hadn't even thought about this racing with to-be-destroyed workers. Having
worker_detach_from_pool() done by the worker itself is convenient for the
serialization with wq_pool_attach_mutex as you suggested, let me scratch my
head some more.

>
>>
>>         raw_spin_lock_irq(&pool->lock);
>>
>> diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
>> index e00b1204a8e9..a3d60e10a76f 100644
>> --- a/kernel/workqueue_internal.h
>> +++ b/kernel/workqueue_internal.h
>> @@ -46,6 +46,7 @@ struct worker {
>>         unsigned int            flags;          /* X: flags */
>>         int                     id;             /* I: worker id */
>>         int                     sleeping;       /* None */
>> +       int                     reaped;         /* None */
>>
>>         /*
>>          * Opaque string set with work_set_desc().  Printed out with task
>> --
>> 2.31.1
>>