[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71bcbd3f-a8bd-4014-aabe-081006cc62f8@redhat.com>
Date: Fri, 18 Oct 2024 09:26:30 +0200
From: David Hildenbrand <david@...hat.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, hannes@...xchg.org,
yosryahmed@...gle.com, nphamcs@...il.com, chengming.zhou@...ux.dev,
usamaarif642@...il.com, ryan.roberts@....com, ying.huang@...el.com,
21cnbao@...il.com, akpm@...ux-foundation.org, hughd@...gle.com,
willy@...radead.org, bfoster@...hat.com, dchinner@...hat.com,
chrisl@...nel.org
Cc: wajdi.k.feghali@...el.com, vinodh.gopal@...el.com
Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead()
zswap load batching interface.
On 18.10.24 08:48, Kanchana P Sridhar wrote:
> This patch invokes the swapin_readahead() based batching interface to
> prefetch a batch of 4K folios for zswap load with batch decompressions
> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> on vm.page-cluster and the usefulness of prior prefetches to the
> workload. As folios are created in the swapcache and the readahead code
> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> respective folio_batches get populated with the folios to be read.
>
> Finally, the swapin_readahead() procedures will call the newly added
> process_ra_batch_of_same_type() which:
>
> 1) Reads all the non_zswap_batch folios sequentially by calling
> swap_read_folio().
> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
> zswap_finish_load_batch() that finally decompresses each
> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
> batch of say, 32 folios) in parallel with IAA.
>
> Within do_swap_page(), we try to benefit from batch decompressions in both
> these scenarios:
>
> 1) single-mapped, SWP_SYNCHRONOUS_IO:
> We call swapin_readahead() with "single_mapped_path = true". This is
> done only in the !zswap_never_enabled() case.
> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> We call swapin_readahead() with "single_mapped_path = false".
>
> This will place folios in the swapcache: a design choice that handles cases
> where a folio that is "single-mapped" in process 1 could be prefetched in
> process 2; and handles highly contended server scenarios with stability.
> There are checks added at the end of do_swap_page(), after the folio has
> been successfully loaded, to detect if the single-mapped swapcache folio is
> still single-mapped, and if so, folio_free_swap() is called on the folio.
>
> Within the swapin_readahead() functions, if single_mapped_path is true, and
> either the platform does not have IAA, or, if the platform has IAA and the
> user selects a software compressor for zswap (details of sysfs knob
> follow), readahead/batching are skipped and the folio is loaded using
> zswap_load().
>
> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> want to give the user the option to run experiments with IAA and with
> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
>
> For IAA:
> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>
> For software compressors:
> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>
> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> path.
>
> Thanks Ying Huang for the really helpful brainstorming discussions on the
> swap_read_folio() plug design.
>
> Suggested-by: Ying Huang <ying.huang@...el.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
> ---
> mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++-----------
> mm/shmem.c | 2 +-
> mm/swap.h | 12 ++--
> mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> mm/swapfile.c | 2 +-
> 5 files changed, 299 insertions(+), 61 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b5745b9ffdf7..9655b85fc243 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> return 0;
> }
>
> +/*
> + * swapin readahead based batching interface for zswap batched loads using IAA:
> + *
> + * Should only be called for and if the faulting swap entry in do_swap_page
> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> + *
> + * Detect if the folio is in the swapcache, is still mapped to only this
> + * process, and further, there are no additional references to this folio
> + * (for e.g. if another process simultaneously readahead this swap entry
> + * while this process was handling the page-fault, and got a pointer to the
> + * folio allocated by this process in the swapcache), besides the references
> + * that were obtained within __read_swap_cache_async() by this process that is
> + * faulting in this single-mapped swap entry.
> + */
How is this supposed to work for large folios?
> +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> + struct folio *folio)
> +{
> + if (!folio_test_swapcache(folio))
> + return false;
> +
> + if (__swap_count(entry) != 0)
> + return false;
> +
> + /*
> + * The folio ref count for a single-mapped folio that was allocated
> + * in __read_swap_cache_async(), can be a maximum of 3. These are the
> + * incrementors of the folio ref count in __read_swap_cache_async():
> + * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> + */
> +
> + if (folio_ref_count(folio) <= 3)
> + return true;
> +
> + return false;
> +}
> +
> static inline bool should_try_to_free_swap(struct folio *folio,
> struct vm_area_struct *vma,
> unsigned int fault_flags)
> @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> swp_entry_t entry;
> pte_t pte;
> vm_fault_t ret = 0;
> + bool single_mapped_swapcache = false;
> void *shadow = NULL;
> int nr_pages;
> unsigned long page_idx;
> @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> - /* skip swapcache */
> - folio = alloc_swap_folio(vmf);
> - if (folio) {
> - __folio_set_locked(folio);
> - __folio_set_swapbacked(folio);
> -
> - nr_pages = folio_nr_pages(folio);
> - if (folio_test_large(folio))
> - entry.val = ALIGN_DOWN(entry.val, nr_pages);
> - /*
> - * Prevent parallel swapin from proceeding with
> - * the cache flag. Otherwise, another thread
> - * may finish swapin first, free the entry, and
> - * swapout reusing the same entry. It's
> - * undetectable as pte_same() returns true due
> - * to entry reuse.
> - */
> - if (swapcache_prepare(entry, nr_pages)) {
> + if (zswap_never_enabled()) {
> + /* skip swapcache */
> + folio = alloc_swap_folio(vmf);
> + if (folio) {
> + __folio_set_locked(folio);
> + __folio_set_swapbacked(folio);
> +
> + nr_pages = folio_nr_pages(folio);
> + if (folio_test_large(folio))
> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> /*
> - * Relax a bit to prevent rapid
> - * repeated page faults.
> + * Prevent parallel swapin from proceeding with
> + * the cache flag. Otherwise, another thread
> + * may finish swapin first, free the entry, and
> + * swapout reusing the same entry. It's
> + * undetectable as pte_same() returns true due
> + * to entry reuse.
> */
> - add_wait_queue(&swapcache_wq, &wait);
> - schedule_timeout_uninterruptible(1);
> - remove_wait_queue(&swapcache_wq, &wait);
> - goto out_page;
> + if (swapcache_prepare(entry, nr_pages)) {
> + /*
> + * Relax a bit to prevent rapid
> + * repeated page faults.
> + */
> + add_wait_queue(&swapcache_wq, &wait);
> + schedule_timeout_uninterruptible(1);
> + remove_wait_queue(&swapcache_wq, &wait);
> + goto out_page;
> + }
> + need_clear_cache = true;
> +
> + mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> +
> + shadow = get_shadow_from_swap_cache(entry);
> + if (shadow)
> + workingset_refault(folio, shadow);
> +
> + folio_add_lru(folio);
> +
> + /* To provide entry to swap_read_folio() */
> + folio->swap = entry;
> + swap_read_folio(folio, NULL, NULL, NULL);
> + folio->private = NULL;
> + }
> + } else {
> + /*
> + * zswap is enabled or was enabled at some point.
> + * Don't skip swapcache.
> + *
> + * swapin readahead based batching interface
> + * for zswap batched loads using IAA:
> + *
> + * Readahead is invoked in this path only if
> + * the sys swap "singlemapped_ra_enabled" swap
> + * parameter is set to true. By default,
> + * "singlemapped_ra_enabled" is set to false,
> + * the recommended setting for software compressors.
> + * For IAA, if "singlemapped_ra_enabled" is set
> + * to true, readahead will be deployed in this path
> + * as well.
> + *
> + * For single-mapped pages, the batching interface
> + * calls __read_swap_cache_async() to allocate and
> + * place the faulting page in the swapcache. This is
> + * to handle a scenario where the faulting page in
> + * this process happens to simultaneously be a
> + * readahead page in another process. By placing the
> + * single-mapped faulting page in the swapcache,
> + * we avoid race conditions and duplicate page
> + * allocations under these scenarios.
> + */
> + folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> + vmf, true);
> + if (!folio) {
> + ret = VM_FAULT_OOM;
> + goto out;
> }
> - need_clear_cache = true;
> -
> - mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> -
> - shadow = get_shadow_from_swap_cache(entry);
> - if (shadow)
> - workingset_refault(folio, shadow);
> -
> - folio_add_lru(folio);
>
> - /* To provide entry to swap_read_folio() */
> - folio->swap = entry;
> - swap_read_folio(folio, NULL, NULL, NULL);
> - folio->private = NULL;
> - }
> + single_mapped_swapcache = true;
> + nr_pages = folio_nr_pages(folio);
> + swapcache = folio;
> + } /* swapin with zswap support. */
> } else {
> folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> - vmf);
> + vmf, false);
> swapcache = folio;
I'm sorry, but making this function ever more complicated and ugly is
not going to fly. The zswap special casing is quite ugly here as well.
Is there a way forward that we can make this code actually readable and
avoid zswap special casing?
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists