linux-kernel - Re: [RFC] mm/vmscan.c: avoid possible long latency caused by too_many

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YIvLHX74rFrq0n9w@dhcp22.suse.cz>
Date:   Fri, 30 Apr 2021 11:17:17 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Yu Zhao <yuzhao@...gle.com>
Cc:     Xing Zhengjun <zhengjun.xing@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Huang Ying <ying.huang@...el.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Shakeel Butt <shakeelb@...gle.com>, wfg@...l.ustc.edu.cn,
        Rik van Riel <riel@...riel.com>,
        Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [RFC] mm/vmscan.c: avoid possible long latency caused by
 too_many_isolated()

On Fri 30-04-21 02:34:28, Yu Zhao wrote:
> On Thu, Apr 29, 2021 at 4:00 AM Michal Hocko <mhocko@...e.com> wrote:
[...]
> > Still, I do not think that the above heuristic will work properly.
> > Different reclaimers have a different reclaim target (e.g. lower zones
> > and/or numa node mask) and strength (e.g.  GFP_NOFS vs. GFP_KERNEL). A
> > simple count based throttling would be be prone to different sorts of
> > priority inversions.
> 
> I see where your concern is coming from. Let's look at it from
> multiple angles, and hopefully this will clear things up.
> 
> 1, looking into this approach:
> This approach limits the number of direct reclaimers without any bias.
> It doesn't favor or disfavor anybody. IOW, everyone has an equal
> chance to run, regardless of the reclaim parameters. So where does the
> inversion come from?

Say you have a flood of GFP_KERNEL allocations contending with *MOVABLE
allocations. The former will not be able to reclaim for any non-kernel
zones. Similar effect would be contention of a heavy GFP_NOFS workload
condending with others but not being able to release filesystem
metadata.

> 2, comparing it with the existing code:
> Both try to limit direct reclaims,: one by the number of isolated
> pages and the other by the number of concurrent direct reclaimers.
> Neither numbers are correlated with any parameters you mentioned above
> except the following:
> 
> too_many_isolated()
> {
> ...
>         /*
>          * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they
>          * won't get blocked by normal direct-reclaimers, forming a circular
>          * deadlock.
>          */
>         if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>                 inactive >>= 3;
> ...
> }
> 
> Let's at the commit that added the above:
> 
> commit 3cf23841b4b7 ("mm/vmscan.c: avoid possible deadlock caused by
> too_many_isolated()"):
> Date:   Tue Dec 18 14:23:31 2012 -0800
> 
>     Neil found that if too_many_isolated() returns true while performing
>     direct reclaim we can end up waiting for other threads to complete their
>     direct reclaim.  If those threads are allowed to enter the FS or IO to
>     free memory, but this thread is not, then it is possible that those
>     threads will be waiting on this thread and so we get a circular deadlock.
> 
>     some task enters direct reclaim with GFP_KERNEL
>       => too_many_isolated() false
>         => vmscan and run into dirty pages
>           => pageout()
>             => take some FS lock
>               => fs/block code does GFP_NOIO allocation
>                 => enter direct reclaim again
>                   => too_many_isolated() true
>                     => waiting for others to progress, however the other
>                        tasks may be circular waiting for the FS lock..
> 
> Hmm, how could reclaim be recursive nowadays?

I do not think it is. And I doubt it was back then and I also think the
above is not suggesting a recursion really. I tries to avoid a situation
when fs/block layer cannot make a fwd progress because it is being
blocked.
 
> __alloc_pages_slowpath()
> {
> ...
> 
>         /* Avoid recursion of direct reclaim */
>         if (current->flags & PF_MEMALLOC)
>                 goto nopage;
> 
>         /* Try direct reclaim and then allocating */
>         page = __alloc_pages_direct_reclaim()
> ...
> }
> 
> Let's assume it still could, do you remember the following commit?
> 
> commit db73ee0d4637 ("mm, vmscan: do not loop on too_many_isolated for ever")
> Date:   Wed Sep 6 16:21:11 2017 -0700
> 
> If too_many_isolated() does loop forever anymore, how could the above
> deadlock happen? IOW, why would we need the first commit nowadays?

Whether the Neil's commit is still needed would require a deeper
analysis. Even back then we didn't perform pageout for fs dirty pages
from the direct reclaim IIRC.

> If you don't remember the second commit, let me jog your memory:

Yes i do remember that one and that was handling a dependency between
kswapd (which is allowed to perform pageout on diryt fs data) which
is blocked and it prevents direct reclaimers to make a fwd progress e.g.
by declaring OOM. This was mostly a band aid rather than a systematic
solution. And it clearly shows limits of the existing approach. Please
note that I am not trying to defend what we have now. I am just pointing
out that strict count based approach will hit other problems.

> Author: Michal Hocko <mhocko@...e.com>
> 
> 3, thinking abstractly
> A problem hard to solve in one domain can become a walk in the park in
> another domain. This problem is a perfect example: it's different to
> solve based on the number of isolated pages; but it becomes a lot
> easier based on the number of direct reclaimers.

This would be really true if all those reclaimers where equal in their
capabilities. But they are not due to reclaim constrains if nothing
else.

IMHO the best way forward would be removing the throttling from the
reclaim path altogether. The reclaim should be only throttled by the
work it does. Both allocator and memcg charging path implement some sort
of retry logic and I believe this would be much better suited to
implement any backoff.
-- 
Michal Hocko
SUSE Labs