linux-kernel - Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4ymRwXhQdzabstHhkK0OM0JEWtvR3tjeyQppm7sKZ8FUw@mail.gmail.com>
Date: Tue, 27 May 2025 19:58:47 +1200
From: Barry Song <21cnbao@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: akpm@...ux-foundation.org, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, David Hildenbrand <david@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Hugh Dickins <hughd@...gle.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, LKML <linux-kernel@...r.kernel.org>, 
	linux-mm <linux-mm@...ck.org>, Nhat Pham <nphamcs@...il.com>, 
	Ryan Roberts <ryan.roberts@....com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Tim Chen <tim.c.chen@...ux.intel.com>, Matthew Wilcox <willy@...radead.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@...il.com> wrote:
>
> On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@...il.com> wrote:
> >
> > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@...il.com> wrote:
> > >
> > > Barry Song <21cnbao@...il.com> 于 2025年5月21日周三 06:33写道：
> > > >
> > > > On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > >
> > > > > On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@...il.com> wrote:
> > > > > >
> > > > > > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@...il.com> wrote:
> > > > > > >
> > > > > > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@...il.com> wrote:
> > > > > > > >
> > > > > > > > > From: Kairui Song <kasong@...cent.com>
> > > > > > > >
> > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > > > --- a/mm/userfaultfd.c
> > > > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > > > >                               goto retry;
> > > > > > > > >                       }
> > > > > > > > >               }
> > > > > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > > > > +                     err = -EBUSY;
> > > > > > > > > +                     goto out;
> > > > > > > > > +             }
> > > > > > > >
> > > > > > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > > > > > are stable:
> > > > > > > >
> > > > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > > > >                 return -EAGAIN;
> > > > > > > >         }
> > > > > > >
> > > > > > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > > > > > folio and ptes are unlocked. So is it possible that someone else
> > > > > > > swapped in the entries, then swapped them out again using the same
> > > > > > > entries?
> > > > > > >
> > > > > > > The folio will be different here but PTEs are still the same value to
> > > > > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > > > > similar races with anon fault or shmem. I think more strict checking
> > > > > > > won't hurt here.
> > > > > >
> > > > > > This doesn't seem to be the same case as the one you fixed in
> > > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > > > > > case, there was no one hitting the swap cache, and you used
> > > > > > swap_prepare() to set up the cache to fix the issue.
> > > > > >
> > > > > > By the way, if we're not hitting the swap cache, src_folio will be
> > > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > > > > > not guard against that case either.
> > > > >
> > > > > Ah, that's true, it should be moved inside the if (folio) {...} block
> > > > > above. Thanks for catching this!
> > > > >
> > > > > > But I suspect we won't have a problem, since we're not swapping in —
> > > > > > we didn't read any stale data, right? Swap-in will only occur after we
> > > > > > move the PTEs.
> > > > >
> > > > > My concern is that a parallel swapin / swapout could result in the
> > > > > folio to be a completely irrelevant or invalid folio.
> > > > >
> > > > > It's not about the dst, but in the move src side, something like:
> > > > >
> > > > > CPU1                             CPU2
> > > > > move_pages_pte
> > > > >   folio = swap_cache_get_folio(...)
> > > > >     | Got folio A here
> > > > >   move_swap_pte
> > > > >                                  <swapin src_pte, using folio A>
> > > > >                                  <swapout src_pte, put folio A>
> > > > >                                    | Now folio A is no longer valid.
> > > > >                                    | It's very unlikely but here SWAP
> > > > >                                    | could reuse the same entry as above.
> > > >
> > > >
> > > > swap_cache_get_folio() does increment the folio's refcount, but it seems this
> > > > doesn't prevent do_swap_page() from freeing the swap entry after swapping
> > > > in src_pte with folio A, if it's a read fault.
> > > > for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
> > > > will be false:
> > > >
> > > > static inline bool should_try_to_free_swap(struct folio *folio,
> > > >                                            struct vm_area_struct *vma,
> > > >                                            unsigned int fault_flags)
> > > > {
> > > >        ...
> > > >
> > > >         /*
> > > >          * If we want to map a page that's in the swapcache writable, we
> > > >          * have to detect via the refcount if we're really the exclusive
> > > >          * user. Try freeing the swapcache to get rid of the swapcache
> > > >          * reference only in case it's likely that we'll be the exlusive user.
> > > >          */
> > > >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > > >                 folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > > > }
> > > >
> > > > and for swapout, __removing_mapping does check refcount as well:
> > > >
> > > > static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> > > >                             bool reclaimed, struct mem_cgroup *target_memcg)
> > > > {
> > > >         refcount = 1 + folio_nr_pages(folio);
> > > >         if (!folio_ref_freeze(folio, refcount))
> > > >                 goto cannot_free;
> > > >
> > > > }
> > > >
> > > > However, since __remove_mapping() occurs after pageout(), it seems
> > > > this also doesn't prevent swapout from allocating a new swap entry to
> > > > fill src_pte.
> > > >
> > > > It seems your concern is valid—unless I'm missing something.
> > > > Do you have a reproducer? If so, this will likely need a separate fix
> > > > patch rather than being hidden in this patchset.
> > >
> > > Thanks for the analysis. I don't have a reproducer yet, I did some
> > > local experiments and that seems possible, but the race window is so
> > > tiny and it's very difficult to make the swap entry reuse to collide
> > > with that, I'll try more but in theory this seems possible, or at
> > > least looks very fragile.
> > >
> > > And yeah, let's patch the kernel first if that's a real issue.
> > >
> > > >
> > > > >     double_pt_lock
> > > > >     is_pte_pages_stable
> > > > >       | Passed because of entry reuse.
> > > > >     folio_move_anon_rmap(...)
> > > > >       | Moved invalid folio A.
> > > > >
> > > > > And could it be possible that the swap_cache_get_folio returns NULL
> > > > > here, but later right before the double_pt_lock, a folio is added to
> > > > > swap cache? Maybe we better check the swap cache after clear and
> > > > > releasing dst lock, but before releasing src lock?
> > > >
> > > > It seems you're suggesting that a parallel swap-in allocates and adds
> > > > a folio to the swap cache, but the PTE has not yet been updated from
> > > > a swap entry to a present mapping?
> > > >
> > > > As long as do_swap_page() adds the folio to the swap cache
> > > > before updating the PTE to present, this scenario seems possible.
> > >
> > > Yes, that's two kinds of problems here. I suspected there could be an
> > > ABA problem while working on the series, but wasn't certain. And just
> > > realised there could be another missed cache read here thanks to your
> > > review and discussion :)
> > >
> > > >
> > > > It seems we need to double-check:
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index bc473ad21202..976053bd2bf1 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> > > > struct vm_area_struct *dst_vma,
> > > >         if (src_folio) {
> > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > +       } else {
> > > > +               struct folio *folio =
> > > > filemap_get_folio(swap_address_space(entry),
> > > > +                                       swap_cache_index(entry));
> > > > +               if (!IS_ERR_OR_NULL(folio)) {
> > > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > > +                       return -EAGAIN;
> > > > +               }
> > > >         }
> > > > -
> > > >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > > >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > > >         orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
> > >
> > > Maybe it has to get even dirtier here to call swapcache_prepare too to
> > > cover the SYNC_IO case?
> > >
> > > >
> > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > >
> > > That will cause quite a performance regression I think. Good thing is,
> > > that's exactly the problem this series is solving by dropping the SYNC
> > > IO swapin path and never bypassing the swap cache, while improving the
> > > performance, eliminating things like this. One more reason to justify
> > > the approach :)
>
> Hi Barry,
>
> >
> > I attempted to reproduce the scenario where a folio is added to the swapcache
> > after filemap_get_folio() returns NULL but before move_swap_pte()
> > moves the swap PTE
> > using non-synchronized I/O. Technically, this seems possible; however,
> > I was unable
> > to reproduce it, likely because the time window between swapin_readahead and
> > taking the page table lock within do_swap_page() is too short.
>
> Thank you so much for trying this!
>
> I have been trying to reproduce it too, and so far I didn't observe
> any crash or warn. I added following debug code:
>
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>         pmd_t dummy_pmdval;
>         pmd_t dst_pmdval;
>         struct folio *src_folio = NULL;
> +       struct folio *tmp_folio = NULL;
>         struct anon_vma *src_anon_vma = NULL;
>         struct mmu_notifier_range range;
>         int err = 0;
> @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 if (!src_folio)
>                         folio = filemap_get_folio(swap_address_space(entry),
>                                         swap_cache_index(entry));
> +               udelay(get_random_u32_below(1000));
> +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> +                                       swap_cache_index(entry));
> +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> tmp_folio->swap.val);
> +                       }
> +                       folio_put(tmp_folio);
> +               }
>                 if (!IS_ERR_OR_NULL(folio)) {
>                         if (folio_test_large(folio)) {
>                                 err = -EBUSY;
> @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> dst_pte, src_pte,
>                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
>                                 dst_ptl, src_ptl, src_folio);
> +               if (tmp_folio != folio && !err)
> +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> check: %lx -> %lx\n", folio, tmp_folio);
>         }
>
> And I saw these two prints are getting triggered like this (not a real
> issue though, just help to understand the problem)
> ...
> [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> fffffdffc7ccac80 (51)
> [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> fffffdffc3435e00 (3b)
> [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> fffffdffc34ab8c0 (76)
> [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> fffffdffc34ab8c0 (6d)
> [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> fffffdffc34eda80 (32)
> [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> fffffdffc38cedc0 (b)
> [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> fffffdffc3465cc0 (21)
> ...
>
> The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> fffffdffc3853540 (53)" worries me at first. On first look it seems the
> folio is indeed freed completely from the swap cache after the first
> lookup, so another swapout can reuse the entry. But as you mentioned
> __remove_mapping won't release a folio if the refcount check fails, so
> they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> are many places that can happen. But these two helpers won't free a
> folio from swap cache if its swap count is not zero. And the folio
> will either be swapped out (swap count non zero), or mapped (freeing
> it is fine, PTE is non_swap, and another swapout will still use the
> same folio).
>
> So after more investigation and dumping the pages, it's actually the
> second lookup (tmp_folio) seeing the entry being reused by another
> page table entry, after the first folio is swapped back and released.
> So the page table check below will always fail just fine.
>
> But this also proves the first look up can see a completely irrelevant
> folio too: If the src folio is swapped out, but got swapped back and
> freed, then another folio B shortly got added to swap cache reuse the
> src folio's old swap entry, then the folio B got seen by the look up
> here and get freed from swap cache, then src folio got swapped out
> again also reusing the same entry, then we have a problem as PTE seems
> untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> not wrong:
>
> Something like this:
> CPU1                             CPU2
> move_pages_pte
>   entry = pte_to_swp_entry(orig_src_pte);
>     | Got Swap Entry S1 from src_pte
>   ...
>                                  <swapin src_pte, using folio A>

I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
sure where folio B comes from in the statement `<someone else tried to
swap out folio B>`.

If that assumption is correct, and folio A is still in the swapcache,
how could someone swap in folio B without hitting folio A? That would
suggest folio A must have been removed from the swapcache earlier—right?

>                                  <free folio A from swap cache freeing S1>
>                                  <someone else try swap out folio B >
>                                  <put folio B to swap cache using S1 >
>                                 ...
>   folio = swap_cache_get_folio(S1)
>     | Got folio B here !!!
>   move_swap_pte
>                                  <free folio B from swap cache>
>                                    | Holding a reference doesn't pin the cache
>                                    | as we have demonstrated
>                                  <Swapout folio A also using S1>
>     double_pt_lock
>     is_pte_pages_stable
>       | Passed because of S1 is reused
>     folio_move_anon_rmap(...)
>       | Moved invalid folio B here !!!
>
> But this is extremely hard to reproduce though, even if doing it
> deliberately...
>
> So I think a "folio_swap_contains" or equivalent check here is a good
> thing to have, to make it more robust and easier to understand. The
> checking after locking a folio has very tiny overhead and can
> definitely ensure the folio's swap entry is valid and stable.
>
> The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> here might seem problematic, but it's still not a real problem. That's
> the case where the swapin in src region happens after the lookup, and
> before the PTE lock. It will pass the PTE check without moving the
> folio. But the folio is guaranteed to be a completely new folio here
> because the folio can't be added back to the page table without
> holding the PTE lock, and if that happens the following PTE check here
> will fail.
>
> So I think we should patch the current kernel only adding a
> "folio_swap_contains" equivalent check here, and maybe more comments,
> how do you think?

The description appears to have some inconsistencies.
Would you mind rephrasing it?

Thanks
barry