linux-kernel - Re: [PATCH v2 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGsJ_4zSqQax-Ct-zz3uMZ_dD7vnvTjWZ_BsnXBnMVpVNNDYUA@mail.gmail.com>
Date: Fri, 21 Nov 2025 08:55:23 +0800
From: Barry Song <21cnbao@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, Nhat Pham <nphamcs@...il.com>, 
	Yosry Ahmed <yosry.ahmed@...ux.dev>, David Hildenbrand <david@...nel.org>, 
	Johannes Weiner <hannes@...xchg.org>, Youngjun Park <youngjun.park@....com>, 
	Hugh Dickins <hughd@...gle.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Ying Huang <ying.huang@...ux.alibaba.com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-kernel@...r.kernel.org, 
	Kairui Song <kasong@...cent.com>
Subject: Re: [PATCH v2 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO

Hi Kairui,

>
> +       /*
> +        * If a large folio already belongs to anon mapping, then we
> +        * can just go on and map it partially.

this is right.

> +        * If not, with the large swapin check above failing, the page table
> +        * have changed, so sub pages might got charged to the wrong cgroup,
> +        * or even should be shmem. So we have to free it and fallback.
> +        * Nothing should have touched it, both anon and shmem checks if a
> +        * large folio is fully appliable before use.

I'm curious about one case:

- Process 1: nr_pages are in swap.
- Process 2: "nr_pages - m" pages are in swap (with m slots already
  unmapped).

Sequence:

1. Process 1 swap-ins the page, allocates it, and adds it to the
   swapcache — but the rmap hasn’t been added yet.
2. Process 2 swap-ins the same folio and finds it in the swapcache, but
   it’s not associated with anon_mapping yet.

What will process 2 do in this situation? Does it go to out_nomap? If so,
what happens on the second swapin attempt? Will it keep retrying
indefinitely until Process 1 completes the rmap installation?

> +        *
> +        * This will be removed once we unify folio allocation in the swap cache
> +        * layer, where allocation of a folio stabilizes the swap entries.
> +        */
> +       if (!folio_test_anon(folio) && folio_test_large(folio) &&
> +           nr_pages != folio_nr_pages(folio)) {
> +               if (!WARN_ON_ONCE(folio_test_dirty(folio)))
> +                       swap_cache_del_folio(folio);
> +               goto out_nomap;
> +       }
> +
>         /*
>          * Check under PT lock (to protect against concurrent fork() sharing
>          * the swap entry concurrently) for certainly exclusive pages.
>          */
>         if (!folio_test_ksm(folio)) {
> +               /*
> +                * The can_swapin_thp check above ensures all PTE have
> +                * same exclusivenss, only check one PTE is fine.

typos? exclusiveness ? Checking just one PTE is fine ?

> +                */
>                 exclusive = pte_swp_exclusive(vmf->orig_pte);
> +               if (exclusive)
> +                       check_swap_exclusive(folio, entry, nr_pages);
>                 if (folio != swapcache) {
>                         /*
>                          * We have a fresh page that is not exposed to the
> @@ -4985,18 +4962,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         vmf->orig_pte = pte_advance_pfn(pte, page_idx);
>
>         /* ksm created a completely new copy */
> -       if (unlikely(folio != swapcache && swapcache)) {
> +       if (unlikely(folio != swapcache)) {
>                 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>                 folio_add_lru_vma(folio, vma);
>         } else if (!folio_test_anon(folio)) {
>                 /*
> -                * We currently only expect small !anon folios which are either
> -                * fully exclusive or fully shared, or new allocated large
> -                * folios which are fully exclusive. If we ever get large
> -                * folios within swapcache here, we have to be careful.
> +                * We currently only expect !anon folios that are fully
> +                * mappable. See the comment after can_swapin_thp above.
>                  */
> -               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
> -               VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +               VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
> +               VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);

We have this guard to ensure that a large folio is always added to the
rmap in one shot, since we only support partial rmap addition for folios
that have already been mapped before.

It now seems you rely on repeated page faults to ensure the partially
mapped process runs after the fully mapped one, which doesn’t look ideal
to me as it may cause priority inversion.

Thanks
Barry