linux-kernel - Re: [PATCH rfc] workqueue: honour cond

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201109075058.GC12240@dhcp22.suse.cz>
Date:   Mon, 9 Nov 2020 08:50:58 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     NeilBrown <neilb@...e.de>
Cc:     Tejun Heo <tj@...nel.org>, Lai Jiangshan <jiangshanlai@...il.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Trond Myklebust <trondmy@...merspace.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

On Mon 09-11-20 13:54:59, Neil Brown wrote:
> 
> If a worker task for a normal (bound, non-CPU-intensive) calls
> cond_resched(), this allows other non-workqueue processes to run, but
> does *not* allow other workqueue workers to run.  This is because
> workqueue will only attempt to run one task at a time on each CPU,
> unless the current task actually sleeps.
> 
> This is already a problem for should_reclaim_retry() in mm/page_alloc.c,
> which inserts a small sleep just to convince workqueue to allow other
> workers to run.
> 
> It can also be a problem for NFS when closing a very large file (e.g.
> 100 million pages in memory).  NFS can call the final iput() from a
> workqueue, which can then take long enough to trigger a workqueue-lockup
> warning, and long enough for performance problems to be observed.
> 
> I added a WARN_ON_ONCE() in cond_resched() to report when it is run from
> a workqueue, and got about 20 hits during boot, many of system_wq (aka
> "events") which suggests there is a real chance that worker are being
> delayed unnecessarily be tasks which are written to politely relinquish
> the CPU.
> 
> I think that once a worker calls cond_resched(), it should be treated as
> though it was run from a WQ_CPU_INTENSIVE queue, because only cpu-intensive
> tasks need to call cond_resched().  This would allow other workers to be
> scheduled.
> 
> The following patch achieves this I believe.

I cannot really judge the implementation because my understanding of the
WQ concurrency control is very superficial but I echo that the existing
behavior is really nonintuitive. It certainly burnt me for the oom
situations where the page allocator cannot make much progress to reclaim
memory and it has to retry really hard. Having to handle worker context
explicitly/differently is error prone and as your example of final iput
in NFS shows that the allocator is not the only path affected so having
a general solution is better.

That being said I would really love to see cond_resched to work
transparently.

Thanks!
-- 
Michal Hocko
SUSE Labs