linux-kernel - Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f4fc787-c1d4-4d29-355e-d5ae0737e55c@google.com>
Date: Sun, 27 Oct 2024 00:07:14 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: David Hildenbrand <david@...hat.com>
cc: Hugh Dickins <hughd@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
    Usama Arif <usamaarif642@...il.com>, Yang Shi <shy828301@...il.com>, 
    Wei Yang <richard.weiyang@...il.com>, 
    "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>, 
    Matthew Wilcox <willy@...radead.org>, Johannes Weiner <hannes@...xchg.org>, 
    Baolin Wang <baolin.wang@...ux.alibaba.com>, 
    Barry Song <baohua@...nel.org>, Kefeng Wang <wangkefeng.wang@...wei.com>, 
    Ryan Roberts <ryan.roberts@....com>, Nhat Pham <nphamcs@...il.com>, 
    Zi Yan <ziy@...dia.com>, Chris Li <chrisl@...nel.org>, 
    linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming
 and locking

On Thu, 24 Oct 2024, David Hildenbrand wrote:
> On 24.10.24 06:13, Hugh Dickins wrote:
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> > 
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.  
> 
> Yes please. I stumbled into that myself recently -- leftover from previous
> rework.
> 
> It would have been reasonable to move that into a separate (follow-up?) patch.

I was expecting someone to ask for a breakup: I plead, as the one who
will try some backports, to keep this all in one to make that job easier.

I doubt I'll go back further than 6.6 LTS.  Porting this patch is easy,
the hard part is gathering together the little bits and pieces it also
needs to be good (I'm thinking of folio migration freezing anon to 0
in particular, that will lead to others).

> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.  
> 
> I recall this can happen, not sure on 5.6 though: assume we have an anon THP
> with 1 PMD mapping and a single PTE mapping for simplicity.
> 
> Assume we want to migrate that folio and first remove the PMD mapping, then
> the PTE mapping. After removing the PMD mapping, we add it to the deferred
> split queue (single PTE mapped).
> 
> Now assume migration fails and we remove migration entries -> remap.
> 
> We now have a PMD-mapped THP on the deferred split queue.
> 
> (again, I might be wrong but that's from memory without digging into the code)

I believe you're right, thank you for doing the heavy thinking for me.
Then I came up with 5.8's 5503fbf2b0b8 ("khugepaged: allow to
collapse PTE-mapped compound pages") introducing another way.

It looks to me like there is a case for backporting this patch,
but not a very strong case: worth some effort, but not too much -
I've not heard of the predicted issues in practice.

> 
> 
> > I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
> 
> We only remap PTEs from the swapcache, never PMDs.

Right, thanks.

> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool
> > partially_mapped)
> >    if (!partially_mapped && !split_underused_thp)
> >     return;
> >   -	/*
> > -	 * The try_to_unmap() in page reclaim path might reach here too,
> > -	 * this may cause a race condition to corrupt deferred split queue.
> > -	 * And, if page reclaim is already handling the same folio, it is
> > -	 * unnecessary to handle it again in shrinker.
> > -	 *
> > -	 * Check the swapcache flag to determine if the folio is being
> > -	 * handled by page reclaim since THP swap would add the folio into
> > -	 * swap cache before calling try_to_unmap().
> > -	 */
> > -	if (folio_test_swapcache(folio))
> > -		return;
> > -
> >    spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >    if (partially_mapped) {
> >   		if (!folio_test_partially_mapped(folio)) {

I'm changing this part in v2: keeping the return on swapcache, updating
the comment to say it's no longer essential, but a likely optimization.

I noticed the stats for deferred split were very different when I made
this change, largely because I'm doing a lot of unrealistic swapping
no doubt, but I don't want to take the risk of changing behaviour
when all we're trying to fix is list corruption.

> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio
> > *folio, unsigned int order)
> >   #endif
> >   }
> >   
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >    if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> 
> The rmappable check here is still confusing for me. I assume we want to
> exclude hugetlb or others that reuse the field for other purposes ...

Yes, I believe that is what it's about.  Somewhere else excludes hugetlb
explicitly instead.  I know Matthew would have liked to find a better name
than "large_rmappable" (aren't hugetlb folios large and rmappable?), but
nobody suggested a better.

You've reminded me to worry about how hugetlb pages get through
free_tail_page_prepare()'s case 2 check for list_empty(_deferred_list).
Ah, there's an INIT_LIST_HEAD(&folio->_deferred_list) in mm/hugetlb.c,
before freeing a hugetlb folio.  (The kind of detail which may be
different in backports.)

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >     unsigned long pfn = folio_pfn(folio);
> >     unsigned int order = folio_order(folio);
> >   -		folio_undo_large_rmappable(folio);
> > +		if (mem_cgroup_disabled())
> > +			folio_unqueue_deferred_split(folio);
> > +
> 
> Is it worth adding a comment here where we take care of ths with memcg
> enabled?

I've confessed to Yang Shi that I got this wrong: so in v2 there is
not even a line to add such a comment on (but comment in commit message).

> 
> It's complicated stuff, nothing jumped at me, but it's late here so my brain
> is not fully functional ...

Even your not fully functional brain is so much quicker than mine:
many thanks for applying it.

Hugh