linux-kernel - Re: Is it safe for kthreadd to drain_all

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LSU.2.11.1704071141110.3348@eggly.anvils>
Date:   Fri, 7 Apr 2017 11:46:16 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Michal Hocko <mhocko@...nel.org>
cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: Is it safe for kthreadd to drain_all_pages?

On Fri, 7 Apr 2017, Michal Hocko wrote:
> On Fri 07-04-17 09:58:17, Hugh Dickins wrote:
> > On Fri, 7 Apr 2017, Michal Hocko wrote:
> > > On Fri 07-04-17 09:25:33, Hugh Dickins wrote:
> > > [...]
> > > > 24 hours so far, and with a clean /var/log/messages.  Not conclusive
> > > > yet, and of course I'll leave it running another couple of days, but
> > > > I'm increasingly sure that it works as you intended: I agree that
> > > > 
> > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> > > > mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch
> > > > 
> > > > should go to Linus as soon as convenient.  Though I think the commit
> > > > message needs something a bit stronger than "Quite annoying though".
> > > > Maybe add a line:
> > > > 
> > > > Fixes serious hang under load, observed repeatedly on 4.11-rc.
> > > 
> > > Yeah, it is much less theoretical now. I will rephrase and ask Andrew to
> > > update the chagelog and send it to Linus once I've got your final go.
> > 
> > I don't know akpm's timetable, but your fix being more than a two-liner,
> > I think it would be better if it could get into rc6, than wait another
> > week for rc7, just in case others then find problems with it.  So I
> > think it's safer *not* to wait for my final go, but proceed on the
> > assumption that it will follow a day later.
> 
> Fair enough. Andrew, could you update the changelog of
> mm-move-pcp-and-lru-pcp-drainging-into-single-wq.patch
> and send it to Linus along with
> mm-move-pcp-and-lru-pcp-drainging-into-single-wq-fix.patch before rc6?
> 
> I would add your Teste-by Hugh but I guess you want to give your testing
> more time before feeling comfortable to give it.

Yes, fair enough: at the moment it's just
Half-Tested-by: Hugh Dickins <hughd@...gle.com>
and I hope to take the Half- off in about 21 hours.
But I certainly wouldn't mind if it found its way to Linus without my
final seal of approval.

> ---
> mm: move pcp and lru-pcp draining into single wq
> 
> We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
> vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
> per cpu lru caches.  This seems more than necessary because both can run
> on a single WQ.  Both do not block on locks requiring a memory allocation
> nor perform any allocations themselves.  We will save one rescuer thread
> this way.
> 
> On the other hand drain_all_pages() queues work on the system wq which
> doesn't have rescuer and so this depend on memory allocation (when all
> workers are stuck allocating and new ones cannot be created). Initially
> we thought this would be more of a theoretical problem but Hugh Dickins
> has reported:
> : 4.11-rc has been giving me hangs after hours of swapping load.  At
> : first they looked like memory leaks ("fork: Cannot allocate memory");
> : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
> : before looking at /proc/meminfo one time, and the stat_refresh stuck
> : in D state, waiting for completion of flush_work like many kworkers.
> : kthreadd waiting for completion of flush_work in drain_all_pages().
> 
> This worker should be using WQ_RECLAIM as well in order to guarantee
> a forward progress. We can reuse the same one as for lru draining and
> vmstat.
> 
> Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
> Fixes: 0ccce3b92421 ("mm, page_alloc: drain per-cpu pages from workqueue context")
> Signed-off-by: Michal Hocko <mhocko@...e.com>
> Suggested-by: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
> Acked-by: Vlastimil Babka <vbabka@...e.cz>
> Acked-by: Mel Gorman <mgorman@...e.de>
> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
> -- 
> Michal Hocko
> SUSE Labs