[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com>
Date: Tue, 27 May 2025 23:11:28 +0800
From: Kairui Song <ryncsn@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, Baolin Wang <baolin.wang@...ux.alibaba.com>,
Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, David Hildenbrand <david@...hat.com>,
Johannes Weiner <hannes@...xchg.org>, Hugh Dickins <hughd@...gle.com>,
Kalesh Singh <kaleshsingh@...gle.com>, LKML <linux-kernel@...r.kernel.org>,
linux-mm <linux-mm@...ck.org>, Nhat Pham <nphamcs@...il.com>,
Ryan Roberts <ryan.roberts@....com>, Kemeng Shi <shikemeng@...weicloud.com>,
Tim Chen <tim.c.chen@...ux.intel.com>, Matthew Wilcox <willy@...radead.org>,
"Huang, Ying" <ying.huang@...ux.alibaba.com>, Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@...il.com> wrote:
>
> On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@...il.com> wrote:
> > >
> > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@...il.com> wrote:
> > > >
> > > > Barry Song <21cnbao@...il.com> 于 2025年5月21日周三 06:33写道:
> > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > >
> > > > That will cause quite a performance regression I think. Good thing is,
> > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > performance, eliminating things like this. One more reason to justify
> > > > the approach :)
> >
> > Hi Barry,
> >
> > >
> > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > moves the swap PTE
> > > using non-synchronized I/O. Technically, this seems possible; however,
> > > I was unable
> > > to reproduce it, likely because the time window between swapin_readahead and
> > > taking the page table lock within do_swap_page() is too short.
> >
> > Thank you so much for trying this!
> >
> > I have been trying to reproduce it too, and so far I didn't observe
> > any crash or warn. I added following debug code:
> >
> > static __always_inline
> > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> > pmd_t dummy_pmdval;
> > pmd_t dst_pmdval;
> > struct folio *src_folio = NULL;
> > + struct folio *tmp_folio = NULL;
> > struct anon_vma *src_anon_vma = NULL;
> > struct mmu_notifier_range range;
> > int err = 0;
> > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> > if (!src_folio)
> > folio = filemap_get_folio(swap_address_space(entry),
> > swap_cache_index(entry));
> > + udelay(get_random_u32_below(1000));
> > + tmp_folio = filemap_get_folio(swap_address_space(entry),
> > + swap_cache_index(entry));
> > + if (!IS_ERR_OR_NULL(tmp_folio)) {
> > + if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > + pr_err("UFFDIO_MOVE: UNSTABLE folio
> > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > tmp_folio->swap.val);
> > + }
> > + folio_put(tmp_folio);
> > + }
> > if (!IS_ERR_OR_NULL(folio)) {
> > if (folio_test_large(folio)) {
> > err = -EBUSY;
> > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> > err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > dst_pte, src_pte,
> > orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > dst_ptl, src_ptl, src_folio);
> > + if (tmp_folio != folio && !err)
> > + pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > check: %lx -> %lx\n", folio, tmp_folio);
> > }
> >
> > And I saw these two prints are getting triggered like this (not a real
> > issue though, just help to understand the problem)
> > ...
> > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > fffffdffc7ccac80 (51)
> > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > fffffdffc3435e00 (3b)
> > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > fffffdffc34ab8c0 (76)
> > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > fffffdffc34ab8c0 (6d)
> > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > fffffdffc34eda80 (32)
> > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > fffffdffc38cedc0 (b)
> > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > fffffdffc3465cc0 (21)
> > ...
> >
> > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > folio is indeed freed completely from the swap cache after the first
> > lookup, so another swapout can reuse the entry. But as you mentioned
> > __remove_mapping won't release a folio if the refcount check fails, so
> > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > are many places that can happen. But these two helpers won't free a
> > folio from swap cache if its swap count is not zero. And the folio
> > will either be swapped out (swap count non zero), or mapped (freeing
> > it is fine, PTE is non_swap, and another swapout will still use the
> > same folio).
> >
> > So after more investigation and dumping the pages, it's actually the
> > second lookup (tmp_folio) seeing the entry being reused by another
> > page table entry, after the first folio is swapped back and released.
> > So the page table check below will always fail just fine.
> >
> > But this also proves the first look up can see a completely irrelevant
> > folio too: If the src folio is swapped out, but got swapped back and
> > freed, then another folio B shortly got added to swap cache reuse the
> > src folio's old swap entry, then the folio B got seen by the look up
> > here and get freed from swap cache, then src folio got swapped out
> > again also reusing the same entry, then we have a problem as PTE seems
> > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > not wrong:
> >
> > Something like this:
> > CPU1 CPU2
> > move_pages_pte
> > entry = pte_to_swp_entry(orig_src_pte);
> > | Got Swap Entry S1 from src_pte
> > ...
> > <swapin src_pte, using folio A>
>
> I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> sure where folio B comes from in the statement `<someone else tried to
> swap out folio B>`.
>
> If that assumption is correct, and folio A is still in the swapcache,
> how could someone swap in folio B without hitting folio A? That would
> suggest folio A must have been removed from the swapcache earlier—right?
>
> > <free folio A from swap cache freeing S1>
> > <someone else try swap out folio B >
Sorry my bad, I think I made people think folio B is related to
src_pte at this point. What I actually mean is that: Another random
folio B, unrelated to src_pte, could got swapped out, and using the
swap entry S1.
> > <put folio B to swap cache using S1 >
> > ...
> > folio = swap_cache_get_folio(S1)
> > | Got folio B here !!!
> > move_swap_pte
> > <free folio B from swap cache>
> > | Holding a reference doesn't pin the cache
> > | as we have demonstrated
> > <Swapout folio A also using S1>
> > double_pt_lock
> > is_pte_pages_stable
> > | Passed because of S1 is reused
> > folio_move_anon_rmap(...)
> > | Moved invalid folio B here !!!
> >
> > But this is extremely hard to reproduce though, even if doing it
> > deliberately...
> >
> > So I think a "folio_swap_contains" or equivalent check here is a good
> > thing to have, to make it more robust and easier to understand. The
> > checking after locking a folio has very tiny overhead and can
> > definitely ensure the folio's swap entry is valid and stable.
> >
> > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > here might seem problematic, but it's still not a real problem. That's
> > the case where the swapin in src region happens after the lookup, and
> > before the PTE lock. It will pass the PTE check without moving the
> > folio. But the folio is guaranteed to be a completely new folio here
> > because the folio can't be added back to the page table without
> > holding the PTE lock, and if that happens the following PTE check here
> > will fail.
> >
> > So I think we should patch the current kernel only adding a
> > "folio_swap_contains" equivalent check here, and maybe more comments,
> > how do you think?
>
> The description appears to have some inconsistencies.
> Would you mind rephrasing it?
Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
fffffdffc385fb00" part first, as both you and me have come into a
conclusion that "filemap_get_folio() returns NULL before
move_swap_pte, but a folio was added to swap cache" is OK, and this
output only proves that happens.
So the problematic race is:
Here move_pages_pte is moving src_pte to dst_pte, and it begins with
src_pte == swap entry S1, and S1 isn't cached.
CPU1 CPU2
move_pages_pte()
entry = pte_to_swp_entry(orig_src_pte);
| src_pte is absent, and got entry == S1
... < Somehow interrupted> ...
<swapin src_pte, using folio A>
| folio A is just a new allocated folio
| for resolving the swap in fault.
<free folio A from swap cache freeing S1>
| swap in fault is resolved, src_pte
| now points to folio A, so folio A
| can get freed just fine.
| And now S1 is free to be used
| by anyone.
<someone else try swap out another folio B >
| Folio B is a completely unrelated
| folio swapped out by random process.
| (has nothing to do with src_pte)
| But S1 is freed so it may use S1
| as its swap entry.
<put folio B to swap cache with index S1 >
...
folio = filemap_get_folio(S1)
| The lookup is using S1, so it
| got folio B here !!!
... < Somehow interrupted> ...
<free folio B from swap cache>
| Folio B could fail to be swapped out
| or got swapped in again, so it can
| be freed by folio_free_swap or
| swap cache reclaim.
| CPU1 is holding a reference but it
| doesn't pin the swap cache folio
| as I have demonstrated with the
| test C program previously.
| New S1 is free to be used again.
<Swapout src_pte again using S1>
| No thing blocks this from happening
| The swapout is still using folio A,
| and src_pte == S1.
folio_trylock(folio)
move_swap_pte
double_pt_lock
is_pte_pages_stable
| Passed because of S1 is reused so src_pte == S1.
folio_move_anon_rmap(...)
| Moved invalid folio B here !!!
It's a long and complex one, I don't think it's practically possible
to happen in reality but in theory doable, once in a million maybe...
Still we have to fix it, or did I got anything wrong here?
>
> Thanks
> barry
Powered by blists - more mailing lists