lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMgjq7DEPHd5TPXtq86pyu-PGQZ+8KOk_6TS-yD7edmTxS2-SQ@mail.gmail.com>
Date: Sat, 31 May 2025 03:24:33 +0800
From: Kairui Song <ryncsn@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, David Hildenbrand <david@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Hugh Dickins <hughd@...gle.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, LKML <linux-kernel@...r.kernel.org>, 
	linux-mm <linux-mm@...ck.org>, Nhat Pham <nphamcs@...il.com>, 
	Ryan Roberts <ryan.roberts@....com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Tim Chen <tim.c.chen@...ux.intel.com>, Matthew Wilcox <willy@...radead.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

On Fri, May 30, 2025 at 4:49 PM Kairui Song <ryncsn@...il.com> wrote:
>
> On Tue, May 27, 2025 at 11:11 PM Kairui Song <ryncsn@...il.com> wrote:
> >
> > On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@...il.com> wrote:
> > >
> > > On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@...il.com> wrote:
> > > >
> > > > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@...il.com> wrote:
> > > > >
> > > > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@...il.com> wrote:
> > > > > >
> > > > > > Barry Song <21cnbao@...il.com> 于 2025年5月21日周三 06:33写道:
> > > > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > > > >
> > > > > > That will cause quite a performance regression I think. Good thing is,
> > > > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > > > performance, eliminating things like this. One more reason to justify
> > > > > > the approach :)
> > > >
> > > > Hi Barry,
> > > >
> > > > >
> > > > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > > > moves the swap PTE
> > > > > using non-synchronized I/O. Technically, this seems possible; however,
> > > > > I was unable
> > > > > to reproduce it, likely because the time window between swapin_readahead and
> > > > > taking the page table lock within do_swap_page() is too short.
> > > >
> > > > Thank you so much for trying this!
> > > >
> > > > I have been trying to reproduce it too, and so far I didn't observe
> > > > any crash or warn. I added following debug code:
> > > >
> > > >  static __always_inline
> > > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >         pmd_t dummy_pmdval;
> > > >         pmd_t dst_pmdval;
> > > >         struct folio *src_folio = NULL;
> > > > +       struct folio *tmp_folio = NULL;
> > > >         struct anon_vma *src_anon_vma = NULL;
> > > >         struct mmu_notifier_range range;
> > > >         int err = 0;
> > > > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                 if (!src_folio)
> > > >                         folio = filemap_get_folio(swap_address_space(entry),
> > > >                                         swap_cache_index(entry));
> > > > +               udelay(get_random_u32_below(1000));
> > > > +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> > > > +                                       swap_cache_index(entry));
> > > > +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> > > > +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > > > +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> > > > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > > > tmp_folio->swap.val);
> > > > +                       }
> > > > +                       folio_put(tmp_folio);
> > > > +               }
> > > >                 if (!IS_ERR_OR_NULL(folio)) {
> > > >                         if (folio_test_large(folio)) {
> > > >                                 err = -EBUSY;
> > > > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > > > dst_pte, src_pte,
> > > >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > >                                 dst_ptl, src_ptl, src_folio);
> > > > +               if (tmp_folio != folio && !err)
> > > > +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > > > check: %lx -> %lx\n", folio, tmp_folio);
> > > >         }
> > > >
> > > > And I saw these two prints are getting triggered like this (not a real
> > > > issue though, just help to understand the problem)
> > > > ...
> > > > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > > > fffffdffc7ccac80 (51)
> > > > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > > > fffffdffc3435e00 (3b)
> > > > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > > > fffffdffc34ab8c0 (76)
> > > > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > > > fffffdffc34ab8c0 (6d)
> > > > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > > > fffffdffc34eda80 (32)
> > > > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > > > fffffdffc38cedc0 (b)
> > > > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > > > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > > > fffffdffc3465cc0 (21)
> > > > ...
> > > >
> > > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > > > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > > > folio is indeed freed completely from the swap cache after the first
> > > > lookup, so another swapout can reuse the entry. But as you mentioned
> > > > __remove_mapping won't release a folio if the refcount check fails, so
> > > > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > > > are many places that can happen. But these two helpers won't free a
> > > > folio from swap cache if its swap count is not zero. And the folio
> > > > will either be swapped out (swap count non zero), or mapped (freeing
> > > > it is fine, PTE is non_swap, and another swapout will still use the
> > > > same folio).
> > > >
> > > > So after more investigation and dumping the pages, it's actually the
> > > > second lookup (tmp_folio) seeing the entry being reused by another
> > > > page table entry, after the first folio is swapped back and released.
> > > > So the page table check below will always fail just fine.
> > > >
> > > > But this also proves the first look up can see a completely irrelevant
> > > > folio too: If the src folio is swapped out, but got swapped back and
> > > > freed, then another folio B shortly got added to swap cache reuse the
> > > > src folio's old swap entry, then the folio B got seen by the look up
> > > > here and get freed from swap cache, then src folio got swapped out
> > > > again also reusing the same entry, then we have a problem as PTE seems
> > > > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > > > not wrong:
> > > >
> > > > Something like this:
> > > > CPU1                             CPU2
> > > > move_pages_pte
> > > >   entry = pte_to_swp_entry(orig_src_pte);
> > > >     | Got Swap Entry S1 from src_pte
> > > >   ...
> > > >                                  <swapin src_pte, using folio A>
> > >
> > > I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> > > sure where folio B comes from in the statement `<someone else tried to
> > > swap out folio B>`.
> > >
> > > If that assumption is correct, and folio A is still in the swapcache,
> > > how could someone swap in folio B without hitting folio A? That would
> > > suggest folio A must have been removed from the swapcache earlier—right?
> > >
> > > >                                  <free folio A from swap cache freeing S1>
> > > >                                  <someone else try swap out folio B >
> >
> > Sorry my bad, I think I made people think folio B is related to
> > src_pte at this point. What I actually mean is that: Another random
> > folio B, unrelated to src_pte, could got swapped out, and using the
> > swap entry S1.
> >
> > > >                                  <put folio B to swap cache using S1 >
> > > >                                 ...
> > > >   folio = swap_cache_get_folio(S1)
> > > >     | Got folio B here !!!
> > > >   move_swap_pte
> > > >                                  <free folio B from swap cache>
> > > >                                    | Holding a reference doesn't pin the cache
> > > >                                    | as we have demonstrated
> > > >                                  <Swapout folio A also using S1>
> > > >     double_pt_lock
> > > >     is_pte_pages_stable
> > > >       | Passed because of S1 is reused
> > > >     folio_move_anon_rmap(...)
> > > >       | Moved invalid folio B here !!!
> > > >
> > > > But this is extremely hard to reproduce though, even if doing it
> > > > deliberately...
> > > >
> > > > So I think a "folio_swap_contains" or equivalent check here is a good
> > > > thing to have, to make it more robust and easier to understand. The
> > > > checking after locking a folio has very tiny overhead and can
> > > > definitely ensure the folio's swap entry is valid and stable.
> > > >
> > > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > > > here might seem problematic, but it's still not a real problem. That's
> > > > the case where the swapin in src region happens after the lookup, and
> > > > before the PTE lock. It will pass the PTE check without moving the
> > > > folio. But the folio is guaranteed to be a completely new folio here
> > > > because the folio can't be added back to the page table without
> > > > holding the PTE lock, and if that happens the following PTE check here
> > > > will fail.
> > > >
> > > > So I think we should patch the current kernel only adding a
> > > > "folio_swap_contains" equivalent check here, and maybe more comments,
> > > > how do you think?
> > >
> > > The description appears to have some inconsistencies.
> > > Would you mind rephrasing it?
> >
> > Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
> > fffffdffc385fb00" part first, as both you and me have come into a
> > conclusion that "filemap_get_folio() returns NULL before
> > move_swap_pte, but a folio was added to swap cache" is OK, and this
> > output only proves that happens.
> >
> > So the problematic race is:
> >
> > Here move_pages_pte is moving src_pte to dst_pte, and it begins with
> > src_pte == swap entry S1, and S1 isn't cached.
> >
> > CPU1                             CPU2
> > move_pages_pte()
> >   entry = pte_to_swp_entry(orig_src_pte);
> >     | src_pte is absent, and got entry == S1
> >   ... < Somehow interrupted> ...
> >                                  <swapin src_pte, using folio A>
> >                                    | folio A is just a new allocated folio
> >                                    | for resolving the swap in fault.
> >                                  <free folio A from swap cache freeing S1>
> >                                    | swap in fault is resolved, src_pte
> >                                    | now points to folio A, so folio A
> >                                    | can get freed just fine.
> >                                    | And now S1 is free to be used
> >                                    | by anyone.
> >                                  <someone else try swap out another folio B >
> >                                    | Folio B is a completely unrelated
> >                                    | folio swapped out by random process.
> >                                    | (has nothing to do with src_pte)
> >                                    | But S1 is freed so it may use S1
> >                                    | as its swap entry.
> >                                  <put folio B to swap cache with index S1 >
> >                                  ...
> >   folio = filemap_get_folio(S1)
> >     | The lookup is using S1, so it
> >     | got folio B here !!!
> >   ... < Somehow interrupted> ...
> >                                  <free folio B from swap cache>
> >                                    | Folio B could fail to be swapped out
> >                                    | or got swapped in again, so it can
> >                                    | be freed by folio_free_swap or
> >                                    | swap cache reclaim.
> >                                    | CPU1 is holding a reference but it
> >                                    | doesn't pin the swap cache folio
> >                                    | as I have demonstrated with the
> >                                    | test C program previously.
> >                                    | New S1 is free to be used again.
> >                                  <Swapout src_pte again using S1>
> >                                    | No thing blocks this from happening
> >                                    | The swapout is still using folio A,
> >                                    | and src_pte == S1.
> >   folio_trylock(folio)
> >   move_swap_pte
> >     double_pt_lock
> >     is_pte_pages_stable
> >       | Passed because of S1 is reused so src_pte == S1.
> >     folio_move_anon_rmap(...)
> >       | Moved invalid folio B here !!!
> >
> > It's a long and complex one, I don't think it's practically possible
> > to happen in reality but in theory doable, once in a million maybe...
> > Still we have to fix it, or did I got anything wrong here?
>
> Hi Barry,
>
> I managed to reproduce this issue, by hacking the kernel a bit (Only
> adding only delay to increase the race window, and adding a WARN to
> indicate the problem)
>
> 1. Applying following patch for kernel:
> ===
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..1d710adf9839 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -15,6 +15,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/delay.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -1100,6 +1101,10 @@ static int move_swap_pte(struct mm_struct *mm,
> struct vm_area_struct *dst_vma,
>          * occur and hit the swapcache after moving the PTE.
>          */
>         if (src_folio) {
> +               if (WARN_ON(src_folio->swap.val !=
> pte_to_swp_entry(orig_src_pte).val))
> +                       pr_err("Moving folio %lx (folio->swap = %lx),
> orig_src_pte = %lx\n",
> +                              (unsigned long)src_folio, src_folio->swap.val,
> +                              pte_to_swp_entry(orig_src_pte).val);
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
>         }
> @@ -1388,9 +1393,13 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                  * folios in the swapcache. This issue needs to be resolved
>                  * separately to allow proper handling.
>                  */
> +               pr_err("DEBUG: Will do the lookup using entry %lx,
> wait 3s...\n", entry.val);
> +               mdelay(1000 * 3);
>                 if (!src_folio)
>                         folio = filemap_get_folio(swap_address_space(entry),
>                                         swap_cache_index(entry));
> +               pr_err("DEBUG: Got folio value %lx, wait 3s...\n",
> (unsigned long)folio);
> +               mdelay(1000 * 3);
>                 if (!IS_ERR_OR_NULL(folio)) {
>                         if (folio_test_large(folio)) {
>                                 err = -EBUSY;
>
> 2. Save following program in userspace (didn't bother with error check
> for simpler code):
> ===
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <sys/syscall.h>
> #include <linux/userfaultfd.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <unistd.h>
> #include <poll.h>
> #include <errno.h>
>
> #define PAGE_SIZE 4096
> /* Need to consume all slots so define the swap device size here */
> #define SWAP_DEVICE_SIZE (PAGE_SIZE * 1024 - 1)
>
> char *src, *race, *dst, *place_holder;
> int uffd;
>
> void read_in(char *p) {
>     /* This test program initials memory with 0xAA to bypass zeromap */
>     while (*((volatile char*)p) != 0xAA);
> }
>
> void *reader_thread(void *arg) {
>     /* Test requires kernel to wait upon uffd move */
>     read_in(dst);
>     return NULL;
> }
>
> void *fault_handler_thread(void *arg) {
>     int ret;
>     struct uffd_msg msg;
>     struct uffdio_move move;
>     struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
>     pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
>     pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
>
>     while (1) {
>         poll(&pollfd, 1, -1);
>         read(uffd, &msg, sizeof(msg));
>
>         move.src = (unsigned long)src + (msg.arg.pagefault.address -
>             (unsigned long)dst);
>         move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
>         move.len = PAGE_SIZE;
>         move.mode = 0;
>
>         ioctl(uffd, UFFDIO_MOVE, &move);
>     }
>     return NULL;
> }
>
> int main() {
>     pthread_t fault_handler_thr, reader_thr;
>     struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
>     struct uffdio_register uffdio_register;
>
>     src = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS, -1, 0);
>     dst = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS, -1, 0);
>     memset(src, 0xAA, PAGE_SIZE);
>     madvise(src, PAGE_SIZE, MADV_PAGEOUT);
>
>     /* Consume all slots on swap device left only one entry (S1) */
>     place_holder = mmap(NULL, SWAP_DEVICE_SIZE - 1, PROT_READ |
> PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>     memset(place_holder, 0xAA, SWAP_DEVICE_SIZE - 1);
>     madvise(place_holder, SWAP_DEVICE_SIZE - 1, MADV_PAGEOUT);
>
>     /* Setup uffd handler and dst reader */
>     uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>     ioctl(uffd, UFFDIO_API, &uffdio_api);
>     uffdio_register.range.start = (unsigned long)dst;
>     uffdio_register.range.len = PAGE_SIZE;
>     uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
>     ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
>     pthread_create(&fault_handler_thr, NULL, fault_handler_thread, NULL);
>     pthread_create(&reader_thr, NULL, reader_thread, NULL);
>
>     /* Wait for UFFDIO to start */
>     sleep(1);
>
>     /* Release src folio (A) from swap, freeing the entry S1 */
>     read_in(src);
>
>     /* Swapout another race folio (B) using S1 */
>     race = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_ANONYMOUS, -1, 0);
>     memset(race, 0xAA, PAGE_SIZE);
>     madvise(race, PAGE_SIZE, MADV_PAGEOUT);
>
>     /* Wait for UFFDIO swap lookup to see the race folio (B) */
>     sleep(3);
>
>     /* Free the race folio (B) from swap */
>     read_in(race);
>     /* And swap out src folio (A) again, using S1 */
>     madvise(src, PAGE_SIZE, MADV_PAGEOUT);
>
>     /* Kernel should have moved a wrong folio by now */
>
>     pthread_join(reader_thr, NULL);
>     pthread_cancel(fault_handler_thr);
>     pthread_join(fault_handler_thr, NULL);
>     munmap(race, PAGE_SIZE);
>     munmap(src, PAGE_SIZE);
>     munmap(dst, PAGE_SIZE);
>     close(uffd);
>
>     return 0;
> }
>
> 3. Run the test with (ensure no other swap device is mounted and
> current dir is on a block device):
> ===
> dd if=/dev/zero of=swap.img bs=1M count=1; mkswap swap.img; swapon
> swap.img; gcc test-uffd.c && ./a.out
>
> Then we get the WARN:
> [  348.200587] ------------[ cut here ]------------
> [  348.200599] WARNING: CPU: 7 PID: 1856 at mm/userfaultfd.c:1104
> move_pages_pte+0xdb8/0x11a0
> [  348.207544] Modules linked in: loop
> [  348.209401] CPU: 7 UID: 0 PID: 1856 Comm: a.out Kdump: loaded Not
> tainted 6.15.0-rc6ptch-00381-g99f00d7c6c6f-dirty #304
> PREEMPT(voluntary)
> [  348.214579] Hardware name: QEMU QEMU Virtual Machine, BIOS
> edk2-stable202408-prebuilt.qemu.org 08/13/2024
> [  348.218656] pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [  348.222013] pc : move_pages_pte+0xdb8/0x11a0
> [  348.224062] lr : move_pages_pte+0x928/0x11a0
> [  348.225881] sp : ffff800088b2b8f0
> [  348.227360] x29: ffff800088b2b970 x28: 0000000000000000 x27: 0000ffffbc920000
> [  348.230228] x26: fffffdffc335e4a8 x25: 0000000000000001 x24: fffffdffc3e4dd40
> [  348.233159] x23: 080000010d792403 x22: ffff0000cd792900 x21: ffff0000c5a6d2c0
> [  348.236339] x20: fffffdffc335e4a8 x19: 0000000000001004 x18: 0000000000000006
> [  348.239269] x17: 0000ffffbc920000 x16: 0000ffffbc922fff x15: 0000000000000003
> [  348.242703] x14: ffff8000812c3b68 x13: 0000000000000003 x12: 0000000000000003
> [  348.245947] x11: 0000000000000000 x10: ffff800081e4feb8 x9 : 0000000000000001
> [  348.249284] x8 : 0000000000000000 x7 : 6f696c6f6620746f x6 : 47203a4755424544
> [  348.252071] x5 : ffff8000815789e3 x4 : ffff8000815789e5 x3 : 0000000000000000
> [  348.255358] x2 : ffff0001fed2aef0 x1 : 0000000000000000 x0 : fffffdffc335e4a8
> [  348.258134] Call trace:
> [  348.259468]  move_pages_pte+0xdb8/0x11a0 (P)
> [  348.261348]  move_pages+0x3c0/0x738
> [  348.262987]  userfaultfd_ioctl+0x3d8/0x1f98
> [  348.264916]  __arm64_sys_ioctl+0x88/0xd0
> [  348.266779]  invoke_syscall+0x64/0xec
> [  348.268347]  el0_svc_common+0xa8/0xd8
> [  348.269967]  do_el0_svc+0x1c/0x28
> [  348.271711]  el0_svc+0x40/0xe0
> [  348.273345]  el0t_64_sync_handler+0x78/0x108
> [  348.274821]  el0t_64_sync+0x19c/0x1a0
> [  348.276117] ---[ end trace 0000000000000000 ]---
> [  348.278638] Moving folio fffffdffc3e4dd40 (folio->swap = 0), orig_src_pte = 1
>
> That's the new added WARN, but the test program also hung with D
> forever, and with errors with other tests like:
> [  406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
> type:MM_ANONPAGES val:-1
> [  406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
> type:MM_SHMEMPAGES val:1
>
> Because the kernel just moved the wrong folio, so unmap takes forever
> looking for the missing folio, and counting went wrong too.
>
> So this race is real. It's extremely unlikely to happen because it
> requires multiple collisions of multiple tiny race windows, however
> it's not impossible.
>
> I'll post a fix very soon.

On second thought, the "filemap_get_folio() returns NULL before
move_swap_pte, but a folio was added to swap cache" case is also
buggy. It can also be reproduced with the program above with slight
modification:

--- test-uffd.c 2025-05-30 08:34:00.485206529 +0000
+++ test-uffd-same-folio.c      2025-05-30 19:04:13.826078271 +0000
@@ -83,20 +83,20 @@
     /* Release src folio (A) from swap, freeing the entry S1 */
     read_in(src);

-    /* Swapout another race folio (B) using S1 */
+    /* Swapout and free another race folio (B) forcing reclaiming S1
and folio (A) */
     race = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_ANONYMOUS, -1, 0);
     memset(race, 0xAA, PAGE_SIZE);
     madvise(race, PAGE_SIZE, MADV_PAGEOUT);
+    read_in(race);
+    printf("RECLAMING A?\n");

-    /* Wait for UFFDIO swap lookup to see the race folio (B) */
+    /* Wait for UFFDIO swap lookup to see NULL */
     sleep(3);

-    /* Free the race folio (B) from swap */
-    read_in(race);
     /* And swap out src folio (A) again, using S1 */
     madvise(src, PAGE_SIZE, MADV_PAGEOUT);

-    /* Kernel should have moved a wrong folio by now */
+    /* Kernel should have moved folio (A) but it didn't */

     pthread_join(reader_thr, NULL);
     pthread_cancel(fault_handler_thr);

I'll fix them together.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ