[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4xsUwUH_VyeYaXHURqTS66Fbuxa00GTM5izwK-=Vg_20g@mail.gmail.com>
Date: Tue, 4 Nov 2025 11:47:35 +0800
From: Barry Song <21cnbao@...il.com>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>, Nhat Pham <nphamcs@...il.com>,
Johannes Weiner <hannes@...xchg.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>,
David Hildenbrand <david@...hat.com>, Youngjun Park <youngjun.park@....com>,
Hugh Dickins <hughd@...gle.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
"Huang, Ying" <ying.huang@...ux.alibaba.com>, Kemeng Shi <shikemeng@...weicloud.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-kernel@...r.kernel.org,
Kairui Song <kasong@...cent.com>
Subject: Re: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@...il.com> wrote:
>
> From: Kairui Song <kasong@...cent.com>
>
> Now the overhead of the swap cache is trivial, bypassing the swap
> cache is no longer a valid optimization. So unify the swapin path using
> the swap cache. This changes the swap in behavior in multiple ways:
>
> We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
> the indicator to bypass both the swap cache and readahead. The swap
> count check is not a good indicator for readahead. It existed because
> the previously swap design made readahead strictly coupled with swap
> cache bypassing. We actually want to always bypass readahead for
> SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
> swap cache will cause redundant IO.
I suppose it’s not only redundant I/O, but also causes additional memory
copies, as each swap-in allocates a new folio. Using swapcache allows the
folio to be shared instead?
>
> Now that limitation is gone, with the new introduced helpers and design,
> we will always swap cache, so this check can be simplified to check
> SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
> SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.
>
> The second thing here is that this enabled a large swap for all swap
> entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
> also coupled with swap cache bypassing, and so the count checking side
> effect also makes large swap in less effective. Now this is also fixed.
> We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
> cases.
>
In your cover letter, you mentioned: “it’s especially better for workloads
with swap count > 1 on SYNC_IO devices, about ~20% gain in the above test.”
Is this improvement mainly from mTHP swap-in?
> And to catch potential issues with large swap in, especially with page
> exclusiveness and swap cache, more debug sanity checks and comments are
> added. But overall, the code is simpler. And new helper and routines
> will be used by other components in later commits too. And now it's
> possible to rely on the swap cache layer for resolving synchronization
> issues, which will also be done by a later commit.
>
> Worth mentioning that for a large folio workload, this may cause more
> serious thrashing. This isn't a problem with this commit, but a generic
> large folio issue. For a 4K workload, this commit increases the
> performance.
>
> Signed-off-by: Kairui Song <kasong@...cent.com>
> ---
> mm/memory.c | 136 +++++++++++++++++++++-----------------------------------
> mm/swap.h | 6 +++
> mm/swap_state.c | 27 +++++++++++
> 3 files changed, 84 insertions(+), 85 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..9a43d4811781 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> +/* Sanity check that a folio is fully exclusive */
> +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> + unsigned int nr_pages)
> +{
> + do {
> + VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
> + entry.val++;
> + } while (--nr_pages);
> +}
>
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> @@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> - struct folio *swapcache, *folio = NULL;
> - DECLARE_WAITQUEUE(wait, current);
> + struct folio *swapcache = NULL, *folio;
> struct page *page;
> struct swap_info_struct *si = NULL;
> rmap_t rmap_flags = RMAP_NONE;
> - bool need_clear_cache = false;
> bool exclusive = false;
> swp_entry_t entry;
> pte_t pte;
> vm_fault_t ret = 0;
> - void *shadow = NULL;
> int nr_pages;
> unsigned long page_idx;
> unsigned long address;
> @@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio = swap_cache_get_folio(entry);
> if (folio)
> swap_update_readahead(folio, vma, vmf->address);
> - swapcache = folio;
> -
I wonder if we should move swap_update_readahead() elsewhere. Since for
sync IO you’ve completely dropped readahead, why do we still need to call
update_readahead()?
Thanks
Barry
Powered by blists - more mailing lists