linux-kernel - Re: [PATCH rfc] workqueue: honour cond

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <87d008odew.fsf@notabene.neil.brown.name>
Date:   Fri, 20 Nov 2020 15:33:27 +1100
From:   NeilBrown <neilb@...e.de>
To:     Hillf Danton <hdanton@...a.com>
Cc:     TJ <tj@...nel.org>, Trond Myklebust <trondmy@...merspace.com>,
        PeterZ <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

On Fri, Nov 20 2020, Hillf Danton wrote:

> On Fri, 20 Nov 2020 10:07:56 +1100 NeilBrown wrote:
>>On Wed, Nov 18 2020, Hillf Danton wrote:
>>> On Wed, 18 Nov 2020 16:11:44 +1100 NeilBrown wrote:
>>>> On Wed, Nov 18 2020, Hillf Danton wrote:
>>>> ...
>>>> I don't think this is a good idea.
>>>
>>> Let me add a few more words.
>>>
>>>> cond_resched() is expected to be called often.  Adding all this extra
>>>
>>> They are those only invoked in concurrency-managed worker contexts and
>>> are thus supposed to be less often than thought; what is more the callers
>>> know what they are doing if a schedule() follows up, needless to say it
>>> is an ant-antenna-size add-in to check WORKER_CPU_INTENSIVE given
>>> 	WARN_ON_ONCE(workqueue_mustnt_use_cpu())
>>> added in cond_resched().
>>
>>"supposed to be less often" is the central point here.
>
> No, it is not in any shape, see below.
>
>>Because the facts are that they sometime happen with high frequency
>>despite what is "supposed" to happen.
>
> Feel free to point me to a couple of such workers. I want to see
> how high it is and why.

The patch should suggest some.
Any work item which calls iput() might find it self in iput_final() and
then truncate_inode_pages_range() which will call cond_resched() once
for every 16 or fewer pages.  If there are millions of pages ....

When a reply is received for an async NFS request (e.g. WRITE, but
several others), the processing happens in a workqueue (nfsiod), and this
will often call iput(), but rarely will that lead to iput_final().
Also, lots of non-workqueue code calls iput(), so adding code to an
inner-loop would cost everyone.

Any worker which allocates memory might find itself in
should_reclaim_retry() which calls cond_resched().  I don't know how
frequently this will fire.

The slab memory allocator uses a system_wq worker to reap a cache.  I
don't know exactly what that means but cache_reap() seems to need to
call cond_resched() periodically.  Maybe it should use be a
WQ_CPU_INTENSIVE workqueue, but there isn't a system_cpu_wq....
Using system_unbound_wq() as it is doing per-CPU work.

>
>>Either the assumption that CM-workers don't call cond_resched() is
>>wrong, or the code that schedules such workers on CM-queues is wrong.
>>
>>I much prefer the perspective that the assumption is wrong.  If that is
>>agreed then we need to handle that circumstance without making
>>cond_resched() more expensive.
>
> This is the central point I think; it is a mile in between what
> you are trying to fix and what you are adding in cond_resched().

My latest patch only adds a WARNING to cond_resched(), so that we can
find problem code before it becomes a problem.  I did previously try
adding more to cond_resched(), and PeterZ didn't like that at all.

I agree that fixing the problem cannot be in cond_resched().  I think
that finding the scope of the problem is best done by instrumenting
cond_resched() (when DEBUG_KERNEL is selected).

>
>>Note that adding WARN_ON_ONCE() does not make it more expensive as it is
>>only enabled with KERNEL_DEBUG (and WQ_WATCHDOG, though the particular
>>config option could be changed). It isn't needed in production.
>
> Because cond_resched() is not the right place from the beginning
> for debugging like this, something in workqueue's backyard by
> design.  It's been there for a while, in production or not.

I don't understand your reasoning.  I don't see why one subsystem cannot
provide debugging to help some other subsystem.  Many subsystems add
"might_sleep()", not to detect bugs in themselves but to detect bugs in
their callers.  Adding a WARNING to cond_resched() helps us find bugs in
code that calls cond_resched()...

>>
>>If the workqueue maintainers are unmovable in the position that a
>
> They are open to any good thoughts, yesterday and tomorrow.
>
>>CM-workitem must not use excessive CPU ever, and so must not call
>>cond_resched(), then I can take that back to the NFS maintainers and
>>negotiate different workqueue settings. 
>
> That sounds like an easy road to go without either touching
> cond_resched() or adding a couple of lines in workqueue.  But
> the rising question is why you are branching to a new direction
> overnight if you think your thoughts are fine to fix an issue
> you observed in wq's property.

I'm branching off because I'm getting push-back and so am trying to
explore the problem space.
My first idea was to add WQ_CPU_INTENSIVE to the nfsiod workqueue, but
Trond wondered what was special about NFS.  Many filesystems call iput
from a workqueue, so finding a solution that helps them all is best.
I then suggested getting cond_resched() to do something more useful when
called by a worker.  PeterZ didn't like the overhead.
Also, TJ seemed to be against auto-adjusting for cpu-intensive code,
preferring the right sort of workqueue to be chosen up front.

I'm not really well placed to assess the validity of any of these
objections, so I'm trying to respond to them without completely giving
in to any of them.  Hence the "new direction overnight".

As a "user" of workqueues I would much much rather there was only one,
and that it always did the right thing.  Maybe I would have to put up
with two, but we currently have
 system_wq, system_highpri_wq, system_long_wq,
 system_unbound_wq, system_freezable_wq, system_power_efficient_wq,
 system_freezable_power_efficient_wq
plus the ability to create your own.  It is an embarrassment of riches
and I really wonder how many people know how to choose the right one.

So I'm not very keen on "make sure you choose the right type of wq", but
if that really is best, I certainly want automatic help to know when
I've made a harmful choice.

>
>>But as I've said, I think this
>>is requiring the decision to be made in a place that is not well
>>positioned to make it.
>
> I say no to asking NFS to take a pill because WQ got a cold.

That's good to know!  I think that means that when a work item happens
to consume a lot of CPU, it needs to stop blocking other work items and
instead share the CPU with them.

Thanks,
NeilBrown

Download attachment "signature.asc" of type "application/pgp-signature" (854 bytes)