[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkrWuTEsjq++M4hBkuhAao9d92nOoaCJdij9aHvUP018VQ@mail.gmail.com>
Date: Thu, 24 Oct 2024 18:25:28 -0700
From: Yang Shi <shy828301@...il.com>
To: David Hildenbrand <david@...hat.com>
Cc: Hugh Dickins <hughd@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>,
Usama Arif <usamaarif642@...il.com>, Wei Yang <richard.weiyang@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>, Matthew Wilcox <willy@...radead.org>,
Johannes Weiner <hannes@...xchg.org>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
Barry Song <baohua@...nel.org>, Kefeng Wang <wangkefeng.wang@...wei.com>,
Ryan Roberts <ryan.roberts@....com>, Nhat Pham <nphamcs@...il.com>, Zi Yan <ziy@...dia.com>,
Chris Li <chrisl@...nel.org>, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
On Thu, Oct 24, 2024 at 1:52 PM David Hildenbrand <david@...hat.com> wrote:
>
> On 24.10.24 06:13, Hugh Dickins wrote:
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without). The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> >
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.
>
> Yes please. I stumbled into that myself recently -- leftover from
> previous rework.
>
> It would have been reasonable to move that into a separate (follow-up?)
> patch.
>
> > But that and its out-of-line __callee are mm
> > internals of very limited usability: add comment and WARN_ON_ONCEs to
> > check usage; and return a bool to say if a deferred split was unqueued,
> > which can then be used in WARN_ON_ONCEs around safety checks (sparing
> > callers the arcane conditionals in __folio_unqueue_deferred_split()).
> >
> > Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> > without checking and unqueueing a THP folio from deferred split list;
> > which is unfortunate, since the split_queue_lock depends on the memcg
> > (when memcg is enabled); so swapout has been unqueueing such THPs later,
> > when freeing the folio, using the pgdat's lock instead: potentially
> > corrupting the memcg's list. __remove_mapping() has frozen refcount to
> > 0 here, so no problem with calling folio_unqueue_deferred_split() before
> > resetting memcg_data.
> >
> > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > shrinker memcg aware"): which included a check on swapcache before adding
> > to deferred queue (which can now be removed), but no check on deferred
> > queue before adding THP to swapcache (maybe other circumstances prevented
> > it at that time, but not now).
> >
> > Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> > folio->memcg_data without checking and unqueueing a THP folio from the
> > deferred list, sometimes corrupting "from" memcg's list, like swapout.
> > Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> > used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> > mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> > of course unqueues), or skip it if that fails. Not ideal, but moving
> > charge has been requested, and khugepaged should repair the THP later:
> > nobody wants new custom unqueueing code just for this deprecated case.
> >
> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.
>
> I recall this can happen, not sure on 5.6 though: assume we have an anon
> THP with 1 PMD mapping and a single PTE mapping for simplicity.
>
> Assume we want to migrate that folio and first remove the PMD mapping,
> then the PTE mapping. After removing the PMD mapping, we add it to the
> deferred split queue (single PTE mapped).
>
> Now assume migration fails and we remove migration entries -> remap.
>
> We now have a PMD-mapped THP on the deferred split queue.
>
> (again, I might be wrong but that's from memory without digging into the
> code)
It sounds possible to me on 5.6 too. I didn't see
remove_migration_ptes() remove PMD-mapped THP from deferred list. As I
said in earlier email, this case should be not common and we were luck
not to hit it.
>
>
> > I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
>
> We only remap PTEs from the swapcache, never PMDs.
>
> >
> > [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> > large but not PMD-mapped: that's safe, but perhaps not intended: it's
> > arguable whether the deprecated feature should be updated to work
> > better with the new feature; but certainly not in this patch.]
> >
> > Backport to 6.11 should be straightforward. Earlier backports must take
> > care that other _deferred_list fixes and dependencies are included. It
> > is unclear whether these fixes are realistically needed before 6.12.
> >
> > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> > Signed-off-by: Hugh Dickins <hughd@...gle.com>
> > Cc: <stable@...r.kernel.org>
> > ---
> > mm/huge_memory.c | 35 +++++++++++++++++++++--------------
> > mm/internal.h | 10 +++++-----
> > mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
> > mm/memcontrol.c | 8 +++++---
> > mm/migrate.c | 4 ++--
> > mm/page_alloc.c | 4 +++-
> > mm/swap.c | 4 ++--
> > mm/vmscan.c | 4 ++--
> > 8 files changed, 65 insertions(+), 29 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a1d345f1680c..dc7d5bb76495 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
> > return split_huge_page_to_list_to_order(&folio->page, list, ret);
> > }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio)
> > +/*
> > + * __folio_unqueue_deferred_split() is not to be called directly:
> > + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> > + * limits its calls to those folios which may have a _deferred_list for
> > + * queueing THP splits, and that list is (racily observed to be) non-empty.
> > + *
> > + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> > + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> > + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> > + *
> > + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> > + * therefore important to unqueue deferred split before changing folio memcg.
> > + */
> > +bool __folio_unqueue_deferred_split(struct folio *folio)
> > {
> > struct deferred_split *ds_queue;
> > unsigned long flags;
> > + bool unqueued = false;
> > +
> > + WARN_ON_ONCE(folio_ref_count(folio));
> > + WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
> >
> > ds_queue = get_deferred_split_queue(folio);
> > spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
> > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> > }
> > list_del_init(&folio->_deferred_list);
> > + unqueued = true;
> > }
> > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> > +
> > + return unqueued; /* useful for debug warnings */
> > }
> >
> > /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> > if (!partially_mapped && !split_underused_thp)
> > return;
> >
> > - /*
> > - * The try_to_unmap() in page reclaim path might reach here too,
> > - * this may cause a race condition to corrupt deferred split queue.
> > - * And, if page reclaim is already handling the same folio, it is
> > - * unnecessary to handle it again in shrinker.
> > - *
> > - * Check the swapcache flag to determine if the folio is being
> > - * handled by page reclaim since THP swap would add the folio into
> > - * swap cache before calling try_to_unmap().
> > - */
> > - if (folio_test_swapcache(folio))
> > - return;
> > -
> > spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > if (partially_mapped) {
> > if (!folio_test_partially_mapped(folio)) {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
> > #endif
> > }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> > {
> > if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
>
> The rmappable check here is still confusing for me. I assume we want to
> exclude hugetlb or others that reuse the field for other purposes ...
>
> > - return;
> > + return false;
> >
> > /*
> > * At this point, there is no one trying to add the folio to
> > @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
> > * to check without acquiring the split_queue_lock.
> > */
> > if (data_race(list_empty(&folio->_deferred_list)))
> > - return;
> > + return false;
> >
> > - __folio_undo_large_rmappable(folio);
> > + return __folio_unqueue_deferred_split(folio);
> > }
> >
> > static inline struct folio *page_rmappable_folio(struct page *page)
> > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> > index 81d8819f13cd..f8744f5630bb 100644
> > --- a/mm/memcontrol-v1.c
> > +++ b/mm/memcontrol-v1.c
> > @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
> > css_get(&to->css);
> > css_put(&from->css);
> >
> > + /* Warning should never happen, so don't worry about refcount non-0 */
> > + WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> > folio->memcg_data = (unsigned long)to;
> >
> > __folio_memcg_unlock(from);
> > @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> > enum mc_target_type target_type;
> > union mc_target target;
> > struct folio *folio;
> > + bool tried_split_before = false;
> >
> > +retry_pmd:
> > ptl = pmd_trans_huge_lock(pmd, vma);
> > if (ptl) {
> > if (mc.precharge < HPAGE_PMD_NR) {
> > @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> > target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> > if (target_type == MC_TARGET_PAGE) {
> > folio = target.folio;
> > + /*
> > + * Deferred split queue locking depends on memcg,
> > + * and unqueue is unsafe unless folio refcount is 0:
> > + * split or skip if on the queue? first try to split.
> > + */
> > + if (!list_empty(&folio->_deferred_list)) {
> > + spin_unlock(ptl);
> > + if (!tried_split_before)
> > + split_folio(folio);
> > + folio_unlock(folio);
> > + folio_put(folio);
> > + if (tried_split_before)
> > + return 0;
> > + tried_split_before = true;
> > + goto retry_pmd;
> > + }
> > + /*
> > + * So long as that pmd lock is held, the folio cannot
> > + * be racily added to the _deferred_list, because
> > + * __folio_remove_rmap() will find !partially_mapped.
> > + */
> > if (folio_isolate_lru(folio)) {
> > if (!mem_cgroup_move_account(folio, true,
> > mc.from, mc.to)) {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2703227cce88..06df2af97415 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> > struct obj_cgroup *objcg;
> >
> > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > - VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> > - !folio_test_hugetlb(folio) &&
> > - !list_empty(&folio->_deferred_list), folio);
> >
> > /*
> > * Nobody should be changing or seriously looking at
> > @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> > ug->nr_memory += nr_pages;
> > ug->pgpgout++;
> >
> > + WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> > folio->memcg_data = 0;
> > }
> >
> > @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> >
> > /* Transfer the charge and the css ref */
> > commit_charge(new, memcg);
> > +
> > + /* Warning should never happen, so don't worry about refcount non-0 */
> > + WARN_ON_ONCE(folio_unqueue_deferred_split(old));
> > old->memcg_data = 0;
> > }
> >
> > @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
> > VM_BUG_ON_FOLIO(oldid, folio);
> > mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
> >
> > + folio_unqueue_deferred_split(folio);
> > folio->memcg_data = 0;
> >
> > if (!mem_cgroup_is_root(memcg))
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index df91248755e4..691f25ee2489 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> > folio_test_large_rmappable(folio)) {
> > if (!folio_ref_freeze(folio, expected_count))
> > return -EAGAIN;
> > - folio_undo_large_rmappable(folio);
> > + folio_unqueue_deferred_split(folio);
> > folio_ref_unfreeze(folio, expected_count);
> > }
> >
> > @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> > }
> >
> > /* Take off deferred split queue while frozen and memcg set */
> > - folio_undo_large_rmappable(folio);
> > + folio_unqueue_deferred_split(folio);
> >
> > /*
> > * Now we know that no one else is looking at the folio:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> > unsigned long pfn = folio_pfn(folio);
> > unsigned int order = folio_order(folio);
> >
> > - folio_undo_large_rmappable(folio);
> > + if (mem_cgroup_disabled())
> > + folio_unqueue_deferred_split(folio);
> > +
>
> Is it worth adding a comment here where we take care of ths with memcg
> enabled?
>
> It's complicated stuff, nothing jumped at me, but it's late here so my
> brain is not fully functional ...
>
> --
> Cheers,
>
> David / dhildenb
>
Powered by blists - more mailing lists