linux-kernel - Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4wo6u1WSXdzj8RUUDNdk5_YCfLV_mcJtvhiv2UonXw+nw@mail.gmail.com>
Date: Fri, 23 May 2025 14:29:54 +1200
From: Barry Song <21cnbao@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: akpm@...ux-foundation.org, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, David Hildenbrand <david@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Hugh Dickins <hughd@...gle.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, LKML <linux-kernel@...r.kernel.org>, 
	linux-mm <linux-mm@...ck.org>, Nhat Pham <nphamcs@...il.com>, 
	Ryan Roberts <ryan.roberts@....com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Tim Chen <tim.c.chen@...ux.intel.com>, Matthew Wilcox <willy@...radead.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@...il.com> wrote:
>
> Barry Song <21cnbao@...il.com> 于 2025年5月21日周三 06:33写道：
> >
> > On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@...il.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@...il.com> wrote:
> > > >
> > > > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@...il.com> wrote:
> > > > >
> > > > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@...il.com> wrote:
> > > > > >
> > > > > > > From: Kairui Song <kasong@...cent.com>
> > > > > >
> > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > --- a/mm/userfaultfd.c
> > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > >                               goto retry;
> > > > > > >                       }
> > > > > > >               }
> > > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > > +                     err = -EBUSY;
> > > > > > > +                     goto out;
> > > > > > > +             }
> > > > > >
> > > > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > > > are stable:
> > > > > >
> > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > >                 return -EAGAIN;
> > > > > >         }
> > > > >
> > > > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > > > folio and ptes are unlocked. So is it possible that someone else
> > > > > swapped in the entries, then swapped them out again using the same
> > > > > entries?
> > > > >
> > > > > The folio will be different here but PTEs are still the same value to
> > > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > > similar races with anon fault or shmem. I think more strict checking
> > > > > won't hurt here.
> > > >
> > > > This doesn't seem to be the same case as the one you fixed in
> > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > > > case, there was no one hitting the swap cache, and you used
> > > > swap_prepare() to set up the cache to fix the issue.
> > > >
> > > > By the way, if we're not hitting the swap cache, src_folio will be
> > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > > > not guard against that case either.
> > >
> > > Ah, that's true, it should be moved inside the if (folio) {...} block
> > > above. Thanks for catching this!
> > >
> > > > But I suspect we won't have a problem, since we're not swapping in —
> > > > we didn't read any stale data, right? Swap-in will only occur after we
> > > > move the PTEs.
> > >
> > > My concern is that a parallel swapin / swapout could result in the
> > > folio to be a completely irrelevant or invalid folio.
> > >
> > > It's not about the dst, but in the move src side, something like:
> > >
> > > CPU1                             CPU2
> > > move_pages_pte
> > >   folio = swap_cache_get_folio(...)
> > >     | Got folio A here
> > >   move_swap_pte
> > >                                  <swapin src_pte, using folio A>
> > >                                  <swapout src_pte, put folio A>
> > >                                    | Now folio A is no longer valid.
> > >                                    | It's very unlikely but here SWAP
> > >                                    | could reuse the same entry as above.
> >
> >
> > swap_cache_get_folio() does increment the folio's refcount, but it seems this
> > doesn't prevent do_swap_page() from freeing the swap entry after swapping
> > in src_pte with folio A, if it's a read fault.
> > for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
> > will be false:
> >
> > static inline bool should_try_to_free_swap(struct folio *folio,
> >                                            struct vm_area_struct *vma,
> >                                            unsigned int fault_flags)
> > {
> >        ...
> >
> >         /*
> >          * If we want to map a page that's in the swapcache writable, we
> >          * have to detect via the refcount if we're really the exclusive
> >          * user. Try freeing the swapcache to get rid of the swapcache
> >          * reference only in case it's likely that we'll be the exlusive user.
> >          */
> >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> >                 folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > }
> >
> > and for swapout, __removing_mapping does check refcount as well:
> >
> > static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> >                             bool reclaimed, struct mem_cgroup *target_memcg)
> > {
> >         refcount = 1 + folio_nr_pages(folio);
> >         if (!folio_ref_freeze(folio, refcount))
> >                 goto cannot_free;
> >
> > }
> >
> > However, since __remove_mapping() occurs after pageout(), it seems
> > this also doesn't prevent swapout from allocating a new swap entry to
> > fill src_pte.
> >
> > It seems your concern is valid—unless I'm missing something.
> > Do you have a reproducer? If so, this will likely need a separate fix
> > patch rather than being hidden in this patchset.
>
> Thanks for the analysis. I don't have a reproducer yet, I did some
> local experiments and that seems possible, but the race window is so
> tiny and it's very difficult to make the swap entry reuse to collide
> with that, I'll try more but in theory this seems possible, or at
> least looks very fragile.
>
> And yeah, let's patch the kernel first if that's a real issue.
>
> >
> > >     double_pt_lock
> > >     is_pte_pages_stable
> > >       | Passed because of entry reuse.
> > >     folio_move_anon_rmap(...)
> > >       | Moved invalid folio A.
> > >
> > > And could it be possible that the swap_cache_get_folio returns NULL
> > > here, but later right before the double_pt_lock, a folio is added to
> > > swap cache? Maybe we better check the swap cache after clear and
> > > releasing dst lock, but before releasing src lock?
> >
> > It seems you're suggesting that a parallel swap-in allocates and adds
> > a folio to the swap cache, but the PTE has not yet been updated from
> > a swap entry to a present mapping?
> >
> > As long as do_swap_page() adds the folio to the swap cache
> > before updating the PTE to present, this scenario seems possible.
>
> Yes, that's two kinds of problems here. I suspected there could be an
> ABA problem while working on the series, but wasn't certain. And just
> realised there could be another missed cache read here thanks to your
> review and discussion :)
>
> >
> > It seems we need to double-check:
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..976053bd2bf1 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> > struct vm_area_struct *dst_vma,
> >         if (src_folio) {
> >                 folio_move_anon_rmap(src_folio, dst_vma);
> >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +       } else {
> > +               struct folio *folio =
> > filemap_get_folio(swap_address_space(entry),
> > +                                       swap_cache_index(entry));
> > +               if (!IS_ERR_OR_NULL(folio)) {
> > +                       double_pt_unlock(dst_ptl, src_ptl);
> > +                       return -EAGAIN;
> > +               }
> >         }
> > -
> >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> >  #ifdef CONFIG_MEM_SOFT_DIRTY
> >         orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
>
> Maybe it has to get even dirtier here to call swapcache_prepare too to
> cover the SYNC_IO case?
>
> >
> > Let me run test case [1] to check whether this ever happens. I guess I need to
> > hack kernel a bit to always add folio to swapcache even for SYNC IO.
>
> That will cause quite a performance regression I think. Good thing is,
> that's exactly the problem this series is solving by dropping the SYNC
> IO swapin path and never bypassing the swap cache, while improving the
> performance, eliminating things like this. One more reason to justify
> the approach :)

I attempted to reproduce the scenario where a folio is added to the swapcache
after filemap_get_folio() returns NULL but before move_swap_pte()
moves the swap PTE
using non-synchronized I/O. Technically, this seems possible; however,
I was unable
to reproduce it, likely because the time window between swapin_readahead and
taking the page table lock within do_swap_page() is too short.

Upon reconsideration, even if this situation occurs, it is not an issue because
move_swap_pte() obtains both the source and destination page table locks,
and *clears* the source PTE. Thus, when do_swap_page() subsequently acquires
the source page table lock for src, it cannot map the new swapcache folio
to the PTE since pte_same will return false.

>
> >
> > [1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gmail.com/
>
> I'll try this too.
>
> >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Also, -EBUSY is somehow incorrect error code.
> > > > >
> > > > > Yes, thanks, I'll use EAGAIN here just like move_swap_pte.
> > > > >
> > > > >
> > > > > >
> > > > > > >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > > > > > >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > > > > >                               dst_ptl, src_ptl, src_folio);
> > > > > > >
> > > > > >
> > > >
> >

Thanks
Barry