lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com>
Date: Tue, 27 May 2025 23:11:28 +0800
From: Kairui Song <ryncsn@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, David Hildenbrand <david@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Hugh Dickins <hughd@...gle.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, LKML <linux-kernel@...r.kernel.org>, 
	linux-mm <linux-mm@...ck.org>, Nhat Pham <nphamcs@...il.com>, 
	Ryan Roberts <ryan.roberts@....com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Tim Chen <tim.c.chen@...ux.intel.com>, Matthew Wilcox <willy@...radead.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@...il.com> wrote:
>
> On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@...il.com> wrote:
> > >
> > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@...il.com> wrote:
> > > >
> > > > Barry Song <21cnbao@...il.com> 于 2025年5月21日周三 06:33写道:
> > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > >
> > > > That will cause quite a performance regression I think. Good thing is,
> > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > performance, eliminating things like this. One more reason to justify
> > > > the approach :)
> >
> > Hi Barry,
> >
> > >
> > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > moves the swap PTE
> > > using non-synchronized I/O. Technically, this seems possible; however,
> > > I was unable
> > > to reproduce it, likely because the time window between swapin_readahead and
> > > taking the page table lock within do_swap_page() is too short.
> >
> > Thank you so much for trying this!
> >
> > I have been trying to reproduce it too, and so far I didn't observe
> > any crash or warn. I added following debug code:
> >
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >         pmd_t dummy_pmdval;
> >         pmd_t dst_pmdval;
> >         struct folio *src_folio = NULL;
> > +       struct folio *tmp_folio = NULL;
> >         struct anon_vma *src_anon_vma = NULL;
> >         struct mmu_notifier_range range;
> >         int err = 0;
> > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 if (!src_folio)
> >                         folio = filemap_get_folio(swap_address_space(entry),
> >                                         swap_cache_index(entry));
> > +               udelay(get_random_u32_below(1000));
> > +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> > +                                       swap_cache_index(entry));
> > +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> > +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > tmp_folio->swap.val);
> > +                       }
> > +                       folio_put(tmp_folio);
> > +               }
> >                 if (!IS_ERR_OR_NULL(folio)) {
> >                         if (folio_test_large(folio)) {
> >                                 err = -EBUSY;
> > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > dst_pte, src_pte,
> >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> >                                 dst_ptl, src_ptl, src_folio);
> > +               if (tmp_folio != folio && !err)
> > +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > check: %lx -> %lx\n", folio, tmp_folio);
> >         }
> >
> > And I saw these two prints are getting triggered like this (not a real
> > issue though, just help to understand the problem)
> > ...
> > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > fffffdffc7ccac80 (51)
> > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > fffffdffc3435e00 (3b)
> > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > fffffdffc34ab8c0 (76)
> > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > fffffdffc34ab8c0 (6d)
> > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > fffffdffc34eda80 (32)
> > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > fffffdffc38cedc0 (b)
> > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > fffffdffc3465cc0 (21)
> > ...
> >
> > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > folio is indeed freed completely from the swap cache after the first
> > lookup, so another swapout can reuse the entry. But as you mentioned
> > __remove_mapping won't release a folio if the refcount check fails, so
> > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > are many places that can happen. But these two helpers won't free a
> > folio from swap cache if its swap count is not zero. And the folio
> > will either be swapped out (swap count non zero), or mapped (freeing
> > it is fine, PTE is non_swap, and another swapout will still use the
> > same folio).
> >
> > So after more investigation and dumping the pages, it's actually the
> > second lookup (tmp_folio) seeing the entry being reused by another
> > page table entry, after the first folio is swapped back and released.
> > So the page table check below will always fail just fine.
> >
> > But this also proves the first look up can see a completely irrelevant
> > folio too: If the src folio is swapped out, but got swapped back and
> > freed, then another folio B shortly got added to swap cache reuse the
> > src folio's old swap entry, then the folio B got seen by the look up
> > here and get freed from swap cache, then src folio got swapped out
> > again also reusing the same entry, then we have a problem as PTE seems
> > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > not wrong:
> >
> > Something like this:
> > CPU1                             CPU2
> > move_pages_pte
> >   entry = pte_to_swp_entry(orig_src_pte);
> >     | Got Swap Entry S1 from src_pte
> >   ...
> >                                  <swapin src_pte, using folio A>
>
> I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> sure where folio B comes from in the statement `<someone else tried to
> swap out folio B>`.
>
> If that assumption is correct, and folio A is still in the swapcache,
> how could someone swap in folio B without hitting folio A? That would
> suggest folio A must have been removed from the swapcache earlier—right?
>
> >                                  <free folio A from swap cache freeing S1>
> >                                  <someone else try swap out folio B >

Sorry my bad, I think I made people think folio B is related to
src_pte at this point. What I actually mean is that: Another random
folio B, unrelated to src_pte, could got swapped out, and using the
swap entry S1.

> >                                  <put folio B to swap cache using S1 >
> >                                 ...
> >   folio = swap_cache_get_folio(S1)
> >     | Got folio B here !!!
> >   move_swap_pte
> >                                  <free folio B from swap cache>
> >                                    | Holding a reference doesn't pin the cache
> >                                    | as we have demonstrated
> >                                  <Swapout folio A also using S1>
> >     double_pt_lock
> >     is_pte_pages_stable
> >       | Passed because of S1 is reused
> >     folio_move_anon_rmap(...)
> >       | Moved invalid folio B here !!!
> >
> > But this is extremely hard to reproduce though, even if doing it
> > deliberately...
> >
> > So I think a "folio_swap_contains" or equivalent check here is a good
> > thing to have, to make it more robust and easier to understand. The
> > checking after locking a folio has very tiny overhead and can
> > definitely ensure the folio's swap entry is valid and stable.
> >
> > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > here might seem problematic, but it's still not a real problem. That's
> > the case where the swapin in src region happens after the lookup, and
> > before the PTE lock. It will pass the PTE check without moving the
> > folio. But the folio is guaranteed to be a completely new folio here
> > because the folio can't be added back to the page table without
> > holding the PTE lock, and if that happens the following PTE check here
> > will fail.
> >
> > So I think we should patch the current kernel only adding a
> > "folio_swap_contains" equivalent check here, and maybe more comments,
> > how do you think?
>
> The description appears to have some inconsistencies.
> Would you mind rephrasing it?

Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
fffffdffc385fb00" part first, as both you and me have come into a
conclusion that "filemap_get_folio() returns NULL before
move_swap_pte, but a folio was added to swap cache" is OK, and this
output only proves that happens.

So the problematic race is:

Here move_pages_pte is moving src_pte to dst_pte, and it begins with
src_pte == swap entry S1, and S1 isn't cached.

CPU1                             CPU2
move_pages_pte()
  entry = pte_to_swp_entry(orig_src_pte);
    | src_pte is absent, and got entry == S1
  ... < Somehow interrupted> ...
                                 <swapin src_pte, using folio A>
                                   | folio A is just a new allocated folio
                                   | for resolving the swap in fault.
                                 <free folio A from swap cache freeing S1>
                                   | swap in fault is resolved, src_pte
                                   | now points to folio A, so folio A
                                   | can get freed just fine.
                                   | And now S1 is free to be used
                                   | by anyone.
                                 <someone else try swap out another folio B >
                                   | Folio B is a completely unrelated
                                   | folio swapped out by random process.
                                   | (has nothing to do with src_pte)
                                   | But S1 is freed so it may use S1
                                   | as its swap entry.
                                 <put folio B to swap cache with index S1 >
                                 ...
  folio = filemap_get_folio(S1)
    | The lookup is using S1, so it
    | got folio B here !!!
  ... < Somehow interrupted> ...
                                 <free folio B from swap cache>
                                   | Folio B could fail to be swapped out
                                   | or got swapped in again, so it can
                                   | be freed by folio_free_swap or
                                   | swap cache reclaim.
                                   | CPU1 is holding a reference but it
                                   | doesn't pin the swap cache folio
                                   | as I have demonstrated with the
                                   | test C program previously.
                                   | New S1 is free to be used again.
                                 <Swapout src_pte again using S1>
                                   | No thing blocks this from happening
                                   | The swapout is still using folio A,
                                   | and src_pte == S1.
  folio_trylock(folio)
  move_swap_pte
    double_pt_lock
    is_pte_pages_stable
      | Passed because of S1 is reused so src_pte == S1.
    folio_move_anon_rmap(...)
      | Moved invalid folio B here !!!

It's a long and complex one, I don't think it's practically possible
to happen in reality but in theory doable, once in a million maybe...
Still we have to fix it, or did I got anything wrong here?

>
> Thanks
> barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ