linux-kernel - Re: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4xsUwUH_VyeYaXHURqTS66Fbuxa00GTM5izwK-=Vg_20g@mail.gmail.com>
Date: Tue, 4 Nov 2025 11:47:35 +0800
From: Barry Song <21cnbao@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, Nhat Pham <nphamcs@...il.com>, 
	Johannes Weiner <hannes@...xchg.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>, 
	David Hildenbrand <david@...hat.com>, Youngjun Park <youngjun.park@....com>, 
	Hugh Dickins <hughd@...gle.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-kernel@...r.kernel.org, 
	Kairui Song <kasong@...cent.com>
Subject: Re: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO

On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@...il.com> wrote:
>
> From: Kairui Song <kasong@...cent.com>
>
> Now the overhead of the swap cache is trivial, bypassing the swap
> cache is no longer a valid optimization. So unify the swapin path using
> the swap cache. This changes the swap in behavior in multiple ways:
>
> We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
> the indicator to bypass both the swap cache and readahead. The swap
> count check is not a good indicator for readahead. It existed because
> the previously swap design made readahead strictly coupled with swap
> cache bypassing. We actually want to always bypass readahead for
> SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
> swap cache will cause redundant IO.

I suppose it’s not only redundant I/O, but also causes additional memory
copies, as each swap-in allocates a new folio. Using swapcache allows the
folio to be shared instead?

>
> Now that limitation is gone, with the new introduced helpers and design,
> we will always swap cache, so this check can be simplified to check
> SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
> SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.
>
> The second thing here is that this enabled a large swap for all swap
> entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
> also coupled with swap cache bypassing, and so the count checking side
> effect also makes large swap in less effective. Now this is also fixed.
> We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
> cases.
>

In your cover letter, you mentioned: “it’s especially better for workloads
with swap count > 1 on SYNC_IO devices, about ~20% gain in the above test.”
Is this improvement mainly from mTHP swap-in?


> And to catch potential issues with large swap in, especially with page
> exclusiveness and swap cache, more debug sanity checks and comments are
> added. But overall, the code is simpler. And new helper and routines
> will be used by other components in later commits too. And now it's
> possible to rely on the swap cache layer for resolving synchronization
> issues, which will also be done by a later commit.
>
> Worth mentioning that for a large folio workload, this may cause more
> serious thrashing. This isn't a problem with this commit, but a generic
> large folio issue. For a 4K workload, this commit increases the
> performance.
>
> Signed-off-by: Kairui Song <kasong@...cent.com>
> ---
>  mm/memory.c     | 136 +++++++++++++++++++++-----------------------------------
>  mm/swap.h       |   6 +++
>  mm/swap_state.c |  27 +++++++++++
>  3 files changed, 84 insertions(+), 85 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..9a43d4811781 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> +/* Sanity check that a folio is fully exclusive */
> +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> +                                unsigned int nr_pages)
> +{
> +       do {
> +               VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
> +               entry.val++;
> +       } while (--nr_pages);
> +}
>
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> @@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
>  vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
> -       struct folio *swapcache, *folio = NULL;
> -       DECLARE_WAITQUEUE(wait, current);
> +       struct folio *swapcache = NULL, *folio;
>         struct page *page;
>         struct swap_info_struct *si = NULL;
>         rmap_t rmap_flags = RMAP_NONE;
> -       bool need_clear_cache = false;
>         bool exclusive = false;
>         swp_entry_t entry;
>         pte_t pte;
>         vm_fault_t ret = 0;
> -       void *shadow = NULL;
>         int nr_pages;
>         unsigned long page_idx;
>         unsigned long address;
> @@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         folio = swap_cache_get_folio(entry);
>         if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
> -       swapcache = folio;
> -

I wonder if we should move swap_update_readahead() elsewhere. Since for
sync IO you’ve completely dropped readahead, why do we still need to call
update_readahead()?

Thanks
Barry