linux-kernel - Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CTbSeNq7HEzZP6raK_6ngC6HAzt46ZSBu0f2RQftZUBQ@mail.gmail.com>
Date: Fri, 30 May 2025 16:49:02 +0800
From: Kairui Song <ryncsn@...il.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, David Hildenbrand <david@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Hugh Dickins <hughd@...gle.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, LKML <linux-kernel@...r.kernel.org>, 
	linux-mm <linux-mm@...ck.org>, Nhat Pham <nphamcs@...il.com>, 
	Ryan Roberts <ryan.roberts@....com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Tim Chen <tim.c.chen@...ux.intel.com>, Matthew Wilcox <willy@...radead.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention

On Tue, May 27, 2025 at 11:11 PM Kairui Song <ryncsn@...il.com> wrote:
>
> On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@...il.com> wrote:
> >
> > On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@...il.com> wrote:
> > >
> > > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@...il.com> wrote:
> > > >
> > > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@...il.com> wrote:
> > > > >
> > > > > Barry Song <21cnbao@...il.com> 于 2025年5月21日周三 06:33写道：
> > > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > > >
> > > > > That will cause quite a performance regression I think. Good thing is,
> > > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > > performance, eliminating things like this. One more reason to justify
> > > > > the approach :)
> > >
> > > Hi Barry,
> > >
> > > >
> > > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > > moves the swap PTE
> > > > using non-synchronized I/O. Technically, this seems possible; however,
> > > > I was unable
> > > > to reproduce it, likely because the time window between swapin_readahead and
> > > > taking the page table lock within do_swap_page() is too short.
> > >
> > > Thank you so much for trying this!
> > >
> > > I have been trying to reproduce it too, and so far I didn't observe
> > > any crash or warn. I added following debug code:
> > >
> > >  static __always_inline
> > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > >         pmd_t dummy_pmdval;
> > >         pmd_t dst_pmdval;
> > >         struct folio *src_folio = NULL;
> > > +       struct folio *tmp_folio = NULL;
> > >         struct anon_vma *src_anon_vma = NULL;
> > >         struct mmu_notifier_range range;
> > >         int err = 0;
> > > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                 if (!src_folio)
> > >                         folio = filemap_get_folio(swap_address_space(entry),
> > >                                         swap_cache_index(entry));
> > > +               udelay(get_random_u32_below(1000));
> > > +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> > > +                                       swap_cache_index(entry));
> > > +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> > > +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > > +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> > > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > > tmp_folio->swap.val);
> > > +                       }
> > > +                       folio_put(tmp_folio);
> > > +               }
> > >                 if (!IS_ERR_OR_NULL(folio)) {
> > >                         if (folio_test_large(folio)) {
> > >                                 err = -EBUSY;
> > > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > > dst_pte, src_pte,
> > >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > >                                 dst_ptl, src_ptl, src_folio);
> > > +               if (tmp_folio != folio && !err)
> > > +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > > check: %lx -> %lx\n", folio, tmp_folio);
> > >         }
> > >
> > > And I saw these two prints are getting triggered like this (not a real
> > > issue though, just help to understand the problem)
> > > ...
> > > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > > fffffdffc7ccac80 (51)
> > > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > > fffffdffc3435e00 (3b)
> > > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > > fffffdffc34ab8c0 (76)
> > > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > > fffffdffc34ab8c0 (6d)
> > > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > > fffffdffc34eda80 (32)
> > > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > > fffffdffc38cedc0 (b)
> > > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > > fffffdffc3465cc0 (21)
> > > ...
> > >
> > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > > folio is indeed freed completely from the swap cache after the first
> > > lookup, so another swapout can reuse the entry. But as you mentioned
> > > __remove_mapping won't release a folio if the refcount check fails, so
> > > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > > are many places that can happen. But these two helpers won't free a
> > > folio from swap cache if its swap count is not zero. And the folio
> > > will either be swapped out (swap count non zero), or mapped (freeing
> > > it is fine, PTE is non_swap, and another swapout will still use the
> > > same folio).
> > >
> > > So after more investigation and dumping the pages, it's actually the
> > > second lookup (tmp_folio) seeing the entry being reused by another
> > > page table entry, after the first folio is swapped back and released.
> > > So the page table check below will always fail just fine.
> > >
> > > But this also proves the first look up can see a completely irrelevant
> > > folio too: If the src folio is swapped out, but got swapped back and
> > > freed, then another folio B shortly got added to swap cache reuse the
> > > src folio's old swap entry, then the folio B got seen by the look up
> > > here and get freed from swap cache, then src folio got swapped out
> > > again also reusing the same entry, then we have a problem as PTE seems
> > > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > > not wrong:
> > >
> > > Something like this:
> > > CPU1                             CPU2
> > > move_pages_pte
> > >   entry = pte_to_swp_entry(orig_src_pte);
> > >     | Got Swap Entry S1 from src_pte
> > >   ...
> > >                                  <swapin src_pte, using folio A>
> >
> > I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> > sure where folio B comes from in the statement `<someone else tried to
> > swap out folio B>`.
> >
> > If that assumption is correct, and folio A is still in the swapcache,
> > how could someone swap in folio B without hitting folio A? That would
> > suggest folio A must have been removed from the swapcache earlier—right?
> >
> > >                                  <free folio A from swap cache freeing S1>
> > >                                  <someone else try swap out folio B >
>
> Sorry my bad, I think I made people think folio B is related to
> src_pte at this point. What I actually mean is that: Another random
> folio B, unrelated to src_pte, could got swapped out, and using the
> swap entry S1.
>
> > >                                  <put folio B to swap cache using S1 >
> > >                                 ...
> > >   folio = swap_cache_get_folio(S1)
> > >     | Got folio B here !!!
> > >   move_swap_pte
> > >                                  <free folio B from swap cache>
> > >                                    | Holding a reference doesn't pin the cache
> > >                                    | as we have demonstrated
> > >                                  <Swapout folio A also using S1>
> > >     double_pt_lock
> > >     is_pte_pages_stable
> > >       | Passed because of S1 is reused
> > >     folio_move_anon_rmap(...)
> > >       | Moved invalid folio B here !!!
> > >
> > > But this is extremely hard to reproduce though, even if doing it
> > > deliberately...
> > >
> > > So I think a "folio_swap_contains" or equivalent check here is a good
> > > thing to have, to make it more robust and easier to understand. The
> > > checking after locking a folio has very tiny overhead and can
> > > definitely ensure the folio's swap entry is valid and stable.
> > >
> > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > > here might seem problematic, but it's still not a real problem. That's
> > > the case where the swapin in src region happens after the lookup, and
> > > before the PTE lock. It will pass the PTE check without moving the
> > > folio. But the folio is guaranteed to be a completely new folio here
> > > because the folio can't be added back to the page table without
> > > holding the PTE lock, and if that happens the following PTE check here
> > > will fail.
> > >
> > > So I think we should patch the current kernel only adding a
> > > "folio_swap_contains" equivalent check here, and maybe more comments,
> > > how do you think?
> >
> > The description appears to have some inconsistencies.
> > Would you mind rephrasing it?
>
> Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
> fffffdffc385fb00" part first, as both you and me have come into a
> conclusion that "filemap_get_folio() returns NULL before
> move_swap_pte, but a folio was added to swap cache" is OK, and this
> output only proves that happens.
>
> So the problematic race is:
>
> Here move_pages_pte is moving src_pte to dst_pte, and it begins with
> src_pte == swap entry S1, and S1 isn't cached.
>
> CPU1                             CPU2
> move_pages_pte()
>   entry = pte_to_swp_entry(orig_src_pte);
>     | src_pte is absent, and got entry == S1
>   ... < Somehow interrupted> ...
>                                  <swapin src_pte, using folio A>
>                                    | folio A is just a new allocated folio
>                                    | for resolving the swap in fault.
>                                  <free folio A from swap cache freeing S1>
>                                    | swap in fault is resolved, src_pte
>                                    | now points to folio A, so folio A
>                                    | can get freed just fine.
>                                    | And now S1 is free to be used
>                                    | by anyone.
>                                  <someone else try swap out another folio B >
>                                    | Folio B is a completely unrelated
>                                    | folio swapped out by random process.
>                                    | (has nothing to do with src_pte)
>                                    | But S1 is freed so it may use S1
>                                    | as its swap entry.
>                                  <put folio B to swap cache with index S1 >
>                                  ...
>   folio = filemap_get_folio(S1)
>     | The lookup is using S1, so it
>     | got folio B here !!!
>   ... < Somehow interrupted> ...
>                                  <free folio B from swap cache>
>                                    | Folio B could fail to be swapped out
>                                    | or got swapped in again, so it can
>                                    | be freed by folio_free_swap or
>                                    | swap cache reclaim.
>                                    | CPU1 is holding a reference but it
>                                    | doesn't pin the swap cache folio
>                                    | as I have demonstrated with the
>                                    | test C program previously.
>                                    | New S1 is free to be used again.
>                                  <Swapout src_pte again using S1>
>                                    | No thing blocks this from happening
>                                    | The swapout is still using folio A,
>                                    | and src_pte == S1.
>   folio_trylock(folio)
>   move_swap_pte
>     double_pt_lock
>     is_pte_pages_stable
>       | Passed because of S1 is reused so src_pte == S1.
>     folio_move_anon_rmap(...)
>       | Moved invalid folio B here !!!
>
> It's a long and complex one, I don't think it's practically possible
> to happen in reality but in theory doable, once in a million maybe...
> Still we have to fix it, or did I got anything wrong here?

Hi Barry,

I managed to reproduce this issue, by hacking the kernel a bit (Only
adding only delay to increase the race window, and adding a WARN to
indicate the problem)

1. Applying following patch for kernel:
===

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..1d710adf9839 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -15,6 +15,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
+#include <linux/delay.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
@@ -1100,6 +1101,10 @@ static int move_swap_pte(struct mm_struct *mm,
struct vm_area_struct *dst_vma,
         * occur and hit the swapcache after moving the PTE.
         */
        if (src_folio) {
+               if (WARN_ON(src_folio->swap.val !=
pte_to_swp_entry(orig_src_pte).val))
+                       pr_err("Moving folio %lx (folio->swap = %lx),
orig_src_pte = %lx\n",
+                              (unsigned long)src_folio, src_folio->swap.val,
+                              pte_to_swp_entry(orig_src_pte).val);
                folio_move_anon_rmap(src_folio, dst_vma);
                src_folio->index = linear_page_index(dst_vma, dst_addr);
        }
@@ -1388,9 +1393,13 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                 * folios in the swapcache. This issue needs to be resolved
                 * separately to allow proper handling.
                 */
+               pr_err("DEBUG: Will do the lookup using entry %lx,
wait 3s...\n", entry.val);
+               mdelay(1000 * 3);
                if (!src_folio)
                        folio = filemap_get_folio(swap_address_space(entry),
                                        swap_cache_index(entry));
+               pr_err("DEBUG: Got folio value %lx, wait 3s...\n",
(unsigned long)folio);
+               mdelay(1000 * 3);
                if (!IS_ERR_OR_NULL(folio)) {
                        if (folio_test_large(folio)) {
                                err = -EBUSY;

2. Save following program in userspace (didn't bother with error check
for simpler code):
===
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <linux/userfaultfd.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <poll.h>
#include <errno.h>

#define PAGE_SIZE 4096
/* Need to consume all slots so define the swap device size here */
#define SWAP_DEVICE_SIZE (PAGE_SIZE * 1024 - 1)

char *src, *race, *dst, *place_holder;
int uffd;

void read_in(char *p) {
    /* This test program initials memory with 0xAA to bypass zeromap */
    while (*((volatile char*)p) != 0xAA);
}

void *reader_thread(void *arg) {
    /* Test requires kernel to wait upon uffd move */
    read_in(dst);
    return NULL;
}

void *fault_handler_thread(void *arg) {
    int ret;
    struct uffd_msg msg;
    struct uffdio_move move;
    struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
    pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
    pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);

    while (1) {
        poll(&pollfd, 1, -1);
        read(uffd, &msg, sizeof(msg));

        move.src = (unsigned long)src + (msg.arg.pagefault.address -
            (unsigned long)dst);
        move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
        move.len = PAGE_SIZE;
        move.mode = 0;

        ioctl(uffd, UFFDIO_MOVE, &move);
    }
    return NULL;
}

int main() {
    pthread_t fault_handler_thr, reader_thr;
    struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
    struct uffdio_register uffdio_register;

    src = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANONYMOUS, -1, 0);
    dst = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANONYMOUS, -1, 0);
    memset(src, 0xAA, PAGE_SIZE);
    madvise(src, PAGE_SIZE, MADV_PAGEOUT);

    /* Consume all slots on swap device left only one entry (S1) */
    place_holder = mmap(NULL, SWAP_DEVICE_SIZE - 1, PROT_READ |
PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    memset(place_holder, 0xAA, SWAP_DEVICE_SIZE - 1);
    madvise(place_holder, SWAP_DEVICE_SIZE - 1, MADV_PAGEOUT);

    /* Setup uffd handler and dst reader */
    uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    ioctl(uffd, UFFDIO_API, &uffdio_api);
    uffdio_register.range.start = (unsigned long)dst;
    uffdio_register.range.len = PAGE_SIZE;
    uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
    ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
    pthread_create(&fault_handler_thr, NULL, fault_handler_thread, NULL);
    pthread_create(&reader_thr, NULL, reader_thread, NULL);

    /* Wait for UFFDIO to start */
    sleep(1);

    /* Release src folio (A) from swap, freeing the entry S1 */
    read_in(src);

    /* Swapout another race folio (B) using S1 */
    race = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_ANONYMOUS, -1, 0);
    memset(race, 0xAA, PAGE_SIZE);
    madvise(race, PAGE_SIZE, MADV_PAGEOUT);

    /* Wait for UFFDIO swap lookup to see the race folio (B) */
    sleep(3);

    /* Free the race folio (B) from swap */
    read_in(race);
    /* And swap out src folio (A) again, using S1 */
    madvise(src, PAGE_SIZE, MADV_PAGEOUT);

    /* Kernel should have moved a wrong folio by now */

    pthread_join(reader_thr, NULL);
    pthread_cancel(fault_handler_thr);
    pthread_join(fault_handler_thr, NULL);
    munmap(race, PAGE_SIZE);
    munmap(src, PAGE_SIZE);
    munmap(dst, PAGE_SIZE);
    close(uffd);

    return 0;
}

3. Run the test with (ensure no other swap device is mounted and
current dir is on a block device):
===
dd if=/dev/zero of=swap.img bs=1M count=1; mkswap swap.img; swapon
swap.img; gcc test-uffd.c && ./a.out

Then we get the WARN:
[  348.200587] ------------[ cut here ]------------
[  348.200599] WARNING: CPU: 7 PID: 1856 at mm/userfaultfd.c:1104
move_pages_pte+0xdb8/0x11a0
[  348.207544] Modules linked in: loop
[  348.209401] CPU: 7 UID: 0 PID: 1856 Comm: a.out Kdump: loaded Not
tainted 6.15.0-rc6ptch-00381-g99f00d7c6c6f-dirty #304
PREEMPT(voluntary)
[  348.214579] Hardware name: QEMU QEMU Virtual Machine, BIOS
edk2-stable202408-prebuilt.qemu.org 08/13/2024
[  348.218656] pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[  348.222013] pc : move_pages_pte+0xdb8/0x11a0
[  348.224062] lr : move_pages_pte+0x928/0x11a0
[  348.225881] sp : ffff800088b2b8f0
[  348.227360] x29: ffff800088b2b970 x28: 0000000000000000 x27: 0000ffffbc920000
[  348.230228] x26: fffffdffc335e4a8 x25: 0000000000000001 x24: fffffdffc3e4dd40
[  348.233159] x23: 080000010d792403 x22: ffff0000cd792900 x21: ffff0000c5a6d2c0
[  348.236339] x20: fffffdffc335e4a8 x19: 0000000000001004 x18: 0000000000000006
[  348.239269] x17: 0000ffffbc920000 x16: 0000ffffbc922fff x15: 0000000000000003
[  348.242703] x14: ffff8000812c3b68 x13: 0000000000000003 x12: 0000000000000003
[  348.245947] x11: 0000000000000000 x10: ffff800081e4feb8 x9 : 0000000000000001
[  348.249284] x8 : 0000000000000000 x7 : 6f696c6f6620746f x6 : 47203a4755424544
[  348.252071] x5 : ffff8000815789e3 x4 : ffff8000815789e5 x3 : 0000000000000000
[  348.255358] x2 : ffff0001fed2aef0 x1 : 0000000000000000 x0 : fffffdffc335e4a8
[  348.258134] Call trace:
[  348.259468]  move_pages_pte+0xdb8/0x11a0 (P)
[  348.261348]  move_pages+0x3c0/0x738
[  348.262987]  userfaultfd_ioctl+0x3d8/0x1f98
[  348.264916]  __arm64_sys_ioctl+0x88/0xd0
[  348.266779]  invoke_syscall+0x64/0xec
[  348.268347]  el0_svc_common+0xa8/0xd8
[  348.269967]  do_el0_svc+0x1c/0x28
[  348.271711]  el0_svc+0x40/0xe0
[  348.273345]  el0t_64_sync_handler+0x78/0x108
[  348.274821]  el0t_64_sync+0x19c/0x1a0
[  348.276117] ---[ end trace 0000000000000000 ]---
[  348.278638] Moving folio fffffdffc3e4dd40 (folio->swap = 0), orig_src_pte = 1

That's the new added WARN, but the test program also hung with D
forever, and with errors with other tests like:
[  406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_ANONPAGES val:-1
[  406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_SHMEMPAGES val:1

Because the kernel just moved the wrong folio, so unmap takes forever
looking for the missing folio, and counting went wrong too.

So this race is real. It's extremely unlikely to happen because it
requires multiple collisions of multiple tiny race windows, however
it's not impossible.

I'll post a fix very soon.