[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB5678B9E52ACCD1C2BD9DB115C9402@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Fri, 18 Oct 2024 18:09:04 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: David Hildenbrand <david@...hat.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"hannes@...xchg.org" <hannes@...xchg.org>, "yosryahmed@...gle.com"
<yosryahmed@...gle.com>, "nphamcs@...il.com" <nphamcs@...il.com>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "Huang, Ying" <ying.huang@...el.com>,
"21cnbao@...il.com" <21cnbao@...il.com>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "hughd@...gle.com" <hughd@...gle.com>,
"willy@...radead.org" <willy@...radead.org>, "bfoster@...hat.com"
<bfoster@...hat.com>, "dchinner@...hat.com" <dchinner@...hat.com>,
"chrisl@...nel.org" <chrisl@...nel.org>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
CC: "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
<vinodh.gopal@...el.com>
Subject: RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead()
zswap load batching interface.
Hi David,
> -----Original Message-----
> From: David Hildenbrand <david@...hat.com>
> Sent: Friday, October 18, 2024 12:27 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>; linux-
> kernel@...r.kernel.org; linux-mm@...ck.org; hannes@...xchg.org;
> yosryahmed@...gle.com; nphamcs@...il.com;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> ryan.roberts@....com; Huang, Ying <ying.huang@...el.com>;
> 21cnbao@...il.com; akpm@...ux-foundation.org; hughd@...gle.com;
> willy@...radead.org; bfoster@...hat.com; dchinner@...hat.com;
> chrisl@...nel.org
> Cc: Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
>
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > This patch invokes the swapin_readahead() based batching interface to
> > prefetch a batch of 4K folios for zswap load with batch decompressions
> > in parallel using IAA hardware. swapin_readahead() prefetches folios based
> > on vm.page-cluster and the usefulness of prior prefetches to the
> > workload. As folios are created in the swapcache and the readahead code
> > calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > respective folio_batches get populated with the folios to be read.
> >
> > Finally, the swapin_readahead() procedures will call the newly added
> > process_ra_batch_of_same_type() which:
> >
> > 1) Reads all the non_zswap_batch folios sequentially by calling
> > swap_read_folio().
> > 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> > zswap_finish_load_batch() that finally decompresses each
> > SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> > batch of say, 32 folios) in parallel with IAA.
> >
> > Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > these scenarios:
> >
> > 1) single-mapped, SWP_SYNCHRONOUS_IO:
> > We call swapin_readahead() with "single_mapped_path = true". This is
> > done only in the !zswap_never_enabled() case.
> > 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> > We call swapin_readahead() with "single_mapped_path = false".
> >
> > This will place folios in the swapcache: a design choice that handles cases
> > where a folio that is "single-mapped" in process 1 could be prefetched in
> > process 2; and handles highly contended server scenarios with stability.
> > There are checks added at the end of do_swap_page(), after the folio has
> > been successfully loaded, to detect if the single-mapped swapcache folio is
> > still single-mapped, and if so, folio_free_swap() is called on the folio.
> >
> > Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > either the platform does not have IAA, or, if the platform has IAA and the
> > user selects a software compressor for zswap (details of sysfs knob
> > follow), readahead/batching are skipped and the folio is loaded using
> > zswap_load().
> >
> > A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> > want to give the user the option to run experiments with IAA and with
> > software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >
> > For IAA:
> > echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > For software compressors:
> > echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> > prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > path.
> >
> > Thanks Ying Huang for the really helpful brainstorming discussions on the
> > swap_read_folio() plug design.
> >
> > Suggested-by: Ying Huang <ying.huang@...el.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
> > ---
> > mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++------
> -----
> > mm/shmem.c | 2 +-
> > mm/swap.h | 12 ++--
> > mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> > mm/swapfile.c | 2 +-
> > 5 files changed, 299 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b5745b9ffdf7..9655b85fc243 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> > return 0;
> > }
> >
> > +/*
> > + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > + *
> > + * Should only be called for and if the faulting swap entry in do_swap_page
> > + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > + *
> > + * Detect if the folio is in the swapcache, is still mapped to only this
> > + * process, and further, there are no additional references to this folio
> > + * (for e.g. if another process simultaneously readahead this swap entry
> > + * while this process was handling the page-fault, and got a pointer to the
> > + * folio allocated by this process in the swapcache), besides the references
> > + * that were obtained within __read_swap_cache_async() by this process
> that is
> > + * faulting in this single-mapped swap entry.
> > + */
>
> How is this supposed to work for large folios?
Thanks for your code review comments. The main idea behind this
patch-series is to work with the existing kernel page-fault granularity of 4K
folios, that swapin_readahead() builds upon to prefetch other "useful"
4K folios. The intent is to not try to make modifications at page-fault time
to opportunistically synthesize large folios for swapin.
As we know, __read_swap_cache_async() allocates an order-0 folio, which
explains the implementation of should_free_singlemap_swapcache() in this
patch. IOW, this is not supposed to work for large folios based on the existing
page-fault behavior and without making any modifications to that.
>
> > +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> > + struct folio *folio)
> > +{
> > + if (!folio_test_swapcache(folio))
> > + return false;
> > +
> > + if (__swap_count(entry) != 0)
> > + return false;
> > +
> > + /*
> > + * The folio ref count for a single-mapped folio that was allocated
> > + * in __read_swap_cache_async(), can be a maximum of 3. These are
> the
> > + * incrementors of the folio ref count in __read_swap_cache_async():
> > + * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> > + */
> > +
> > + if (folio_ref_count(folio) <= 3)
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > static inline bool should_try_to_free_swap(struct folio *folio,
> > struct vm_area_struct *vma,
> > unsigned int fault_flags)
> > @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > swp_entry_t entry;
> > pte_t pte;
> > vm_fault_t ret = 0;
> > + bool single_mapped_swapcache = false;
> > void *shadow = NULL;
> > int nr_pages;
> > unsigned long page_idx;
> > @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault
> *vmf)
> > if (!folio) {
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > __swap_count(entry) == 1) {
> > - /* skip swapcache */
> > - folio = alloc_swap_folio(vmf);
> > - if (folio) {
> > - __folio_set_locked(folio);
> > - __folio_set_swapbacked(folio);
> > -
> > - nr_pages = folio_nr_pages(folio);
> > - if (folio_test_large(folio))
> > - entry.val = ALIGN_DOWN(entry.val,
> nr_pages);
> > - /*
> > - * Prevent parallel swapin from proceeding
> with
> > - * the cache flag. Otherwise, another thread
> > - * may finish swapin first, free the entry, and
> > - * swapout reusing the same entry. It's
> > - * undetectable as pte_same() returns true
> due
> > - * to entry reuse.
> > - */
> > - if (swapcache_prepare(entry, nr_pages)) {
> > + if (zswap_never_enabled()) {
> > + /* skip swapcache */
> > + folio = alloc_swap_folio(vmf);
> > + if (folio) {
> > + __folio_set_locked(folio);
> > + __folio_set_swapbacked(folio);
> > +
> > + nr_pages = folio_nr_pages(folio);
> > + if (folio_test_large(folio))
> > + entry.val =
> ALIGN_DOWN(entry.val, nr_pages);
> > /*
> > - * Relax a bit to prevent rapid
> > - * repeated page faults.
> > + * Prevent parallel swapin from
> proceeding with
> > + * the cache flag. Otherwise, another
> thread
> > + * may finish swapin first, free the
> entry, and
> > + * swapout reusing the same entry.
> It's
> > + * undetectable as pte_same()
> returns true due
> > + * to entry reuse.
> > */
> > - add_wait_queue(&swapcache_wq,
> &wait);
> > -
> schedule_timeout_uninterruptible(1);
> > -
> remove_wait_queue(&swapcache_wq, &wait);
> > - goto out_page;
> > + if (swapcache_prepare(entry,
> nr_pages)) {
> > + /*
> > + * Relax a bit to prevent rapid
> > + * repeated page faults.
> > + */
> > +
> add_wait_queue(&swapcache_wq, &wait);
> > +
> schedule_timeout_uninterruptible(1);
> > +
> remove_wait_queue(&swapcache_wq, &wait);
> > + goto out_page;
> > + }
> > + need_clear_cache = true;
> > +
> > +
> mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> > +
> > + shadow =
> get_shadow_from_swap_cache(entry);
> > + if (shadow)
> > + workingset_refault(folio,
> shadow);
> > +
> > + folio_add_lru(folio);
> > +
> > + /* To provide entry to
> swap_read_folio() */
> > + folio->swap = entry;
> > + swap_read_folio(folio, NULL, NULL,
> NULL);
> > + folio->private = NULL;
> > + }
> > + } else {
> > + /*
> > + * zswap is enabled or was enabled at some
> point.
> > + * Don't skip swapcache.
> > + *
> > + * swapin readahead based batching
> interface
> > + * for zswap batched loads using IAA:
> > + *
> > + * Readahead is invoked in this path only if
> > + * the sys swap "singlemapped_ra_enabled"
> swap
> > + * parameter is set to true. By default,
> > + * "singlemapped_ra_enabled" is set to false,
> > + * the recommended setting for software
> compressors.
> > + * For IAA, if "singlemapped_ra_enabled" is
> set
> > + * to true, readahead will be deployed in this
> path
> > + * as well.
> > + *
> > + * For single-mapped pages, the batching
> interface
> > + * calls __read_swap_cache_async() to
> allocate and
> > + * place the faulting page in the swapcache.
> This is
> > + * to handle a scenario where the faulting
> page in
> > + * this process happens to simultaneously be
> a
> > + * readahead page in another process. By
> placing the
> > + * single-mapped faulting page in the
> swapcache,
> > + * we avoid race conditions and duplicate
> page
> > + * allocations under these scenarios.
> > + */
> > + folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > + vmf, true);
> > + if (!folio) {
> > + ret = VM_FAULT_OOM;
> > + goto out;
> > }
> > - need_clear_cache = true;
> > -
> > - mem_cgroup_swapin_uncharge_swap(entry,
> nr_pages);
> > -
> > - shadow =
> get_shadow_from_swap_cache(entry);
> > - if (shadow)
> > - workingset_refault(folio, shadow);
> > -
> > - folio_add_lru(folio);
> >
> > - /* To provide entry to swap_read_folio() */
> > - folio->swap = entry;
> > - swap_read_folio(folio, NULL, NULL, NULL);
> > - folio->private = NULL;
> > - }
> > + single_mapped_swapcache = true;
> > + nr_pages = folio_nr_pages(folio);
> > + swapcache = folio;
> > + } /* swapin with zswap support. */
> > } else {
> > folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > - vmf);
> > + vmf, false);
> > swapcache = folio;
>
> I'm sorry, but making this function ever more complicated and ugly is
> not going to fly. The zswap special casing is quite ugly here as well.
>
> Is there a way forward that we can make this code actually readable and
> avoid zswap special casing?
Yes, I realize this is now quite cluttered. I need to think some more about
how to make this more readable, and would appreciate suggestions
towards this.
Thanks,
Kanchana
>
> --
> Cheers,
>
> David / dhildenb
Powered by blists - more mailing lists