linux-kernel - Re: [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5efa255b-3689-0c91-1980-c0f0562d84e9@google.com>
Date: Thu, 5 Sep 2024 11:05:09 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Usama Arif <usamaarif642@...il.com>
cc: Hugh Dickins <hughd@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
    Yu Zhao <yuzhao@...gle.com>, linux-mm@...ck.org, hannes@...xchg.org, 
    riel@...riel.com, shakeel.butt@...ux.dev, roman.gushchin@...ux.dev, 
    david@...hat.com, npache@...hat.com, baohua@...nel.org, 
    ryan.roberts@....com, rppt@...nel.org, willy@...radead.org, 
    cerasuolodomenico@...il.com, ryncsn@...il.com, corbet@....net, 
    linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, 
    kernel-team@...a.com, Shuang Zhai <zhais@...gle.com>
Subject: Re: [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated
 thp

On Thu, 5 Sep 2024, Usama Arif wrote:
> On 05/09/2024 09:46, Hugh Dickins wrote:
> > On Fri, 30 Aug 2024, Usama Arif wrote:
> > 
> >> From: Yu Zhao <yuzhao@...gle.com>
> >>
> >> If a tail page has only two references left, one inherited from the
> >> isolation of its head and the other from lru_add_page_tail() which we
> >> are about to drop, it means this tail page was concurrently zapped.
> >> Then we can safely free it and save page reclaim or migration the
> >> trouble of trying it.
> >>
> >> Signed-off-by: Yu Zhao <yuzhao@...gle.com>
> >> Tested-by: Shuang Zhai <zhais@...gle.com>
> >> Acked-by: Johannes Weiner <hannes@...xchg.org>
> >> Signed-off-by: Usama Arif <usamaarif642@...il.com>
> > 
> > I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
> > it is only an optimization, and unless a persuasive performance case
> > can be made to extend it, it ought to go (perhaps revisited later).
> > 
> 
> I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.
> 
> Its an optimization and underused shrinker doesn't depend on it.
> Its possible that folio->new_folio below might fix it? But if it doesn't,
> I can retry later on to make this work and resend it only if it alone shows
> a significant performance improvement.
> 
> Thanks a lot for debugging this! and sorry it caused an issue.
> 
> 
> > The problem I kept hitting was that all my work, requiring compaction and
> > reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
> > because nr_isolated_anon had grown high - and remained high even when the
> > load had all been killed.
> > 
> > Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
> > is the one to blame. I was intending to send this patch to "fix" it:
> > 
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
> >  			folio_clear_active(new_folio);
> >  			folio_clear_unevictable(new_folio);
> >  			list_del(&new_folio->lru);
> > +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
> > +						folio_is_file_lru(folio));
> 
> Maybe this should have been below? (Notice the folio->new_folio)
> 
> +			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
> +						folio_is_file_lru(new_folio));

Yes, how very stupid of me (I'm well aware of the earlier bugfixes here,
and ought not to have done a blind copy and paste of those lines): thanks.

However... it makes no difference. I gave yours a run, expecting a
much smaller negative number, but actually it works out much the same:
because, of course, by this point in the code "folio" is left pointing
to a small folio, and is almost equivalent to new_folio for the
node_stat calculations.

(I say "almost" because I guess there's a chance that the page at
folio got reused as part of a larger folio meanwhile, which would
throw the stats off (if they weren't off already).)

So, even with your fix to my fix, this code doesn't work.
Whereas a revert of this 1/6 does work: nr_isolated_anon and
nr_isolated_file come to 0 when the system is at rest, as expected
(and as silence from vmstat_refresh confirms - /proc/vmstat itself
presents negative stats as 0, in order to hide transient oddities).

Hugh

> 
> >  			if (!folio_batch_add(&free_folios, new_folio)) {
> >  				mem_cgroup_uncharge_folios(&free_folios);
> >  				free_unref_folios(&free_folios);
> > 
> > And that ran nicely, until I terminated the run and did
> > grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
> > at the end: stat_refresh kindly left a pr_warn in dmesg to say
> > nr_isolated_anon -334013737
> > 
> > My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
> > know how many pages they isolated and decremented the stats by, and
> > increment by that same number at the end; whereas other split_huge_pagers
> > (migration?) decrement one by one as they go through the list afterwards.
> > 
> > I've run out of time (I'm about to take a break): I gave up researching
> > who needs what, and was already feeling this optimization does too much
> > second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
> > rather admits to that).
> > 
> > And I don't think it's as simple as moving the node_stat_sub_folio()
> > into 2/6 where the zero pte is substituted: that would probably handle
> > the vast majority of cases, but aren't there others which pass the
> > folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
> > or racily truncated now that the folio has been unlocked, for example?
> > 
> > Hugh