lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkbQMB1RBr1mDb_Ye+wvk97mD1D-+uFAxOEw0ZRLZp1yRQ@mail.gmail.com>
Date: Thu, 1 Aug 2024 13:02:13 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Nhat Pham <nphamcs@...il.com>
Cc: akpm@...ux-foundation.org, hannes@...xchg.org, shakeel.butt@...ux.dev, 
	linux-mm@...ck.org, kernel-team@...a.com, linux-kernel@...r.kernel.org, 
	flintglass@...il.com, chengming.zhou@...ux.dev
Subject: Re: [PATCH v2 2/2] zswap: increment swapin count for non-pivot
 swapped in pages

On Tue, Jul 30, 2024 at 3:27 PM Nhat Pham <nphamcs@...il.com> wrote:
>
> Currently, we only increment the swapin counter on pivot pages. This
> means we are not taking into account pages that also need to be swapped
> in, but are already taken care of as part of the readahead window. We

Hmm, but there is a chance that these pages are not actually needed,
in which case we will unnecessarily increase the zswap protection.
Does the readahead window self-correct if the pages were not used?

> are also incrementing when the pages are read from the zswap pool, which
> is inaccurate.

I feel like this is the more important part. It should be the focus of
the commit log with more details (i.e. why is it wrong to increment
the zswap protection in this case).

Do we need a Fixes and cc:stable for this one? Maybe it can be moved
first to make backports easy.

>
> This patch rectifies this issue by incrementing whenever we need to
> perform a non-zswap read.
>
> To test this change, I built the kernel under a cgroup with its
> memory.max set to 2 GB:
>
> real: 236.66s
> user: 4286.06s
> sys: 652.86s
> swapins: 81552
>
> For comparison, with just the new second chance algorithm, the build
> time is as follows:
>
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
>
> Without neither:
>
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
>
> (average over 5 runs)
>
> With this change, the kernel CPU time reduces by a further 1.7%, and
> the real time is reduced by another 3.3%, compared to just the second
> chance algorithm by itself. The swapins count also reduces by another
> 13.85%.
>
> Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
> time by 3%, and number of swapins by 64.12%.
>
> To gauge the new scheme's ability to offload cold data, I ran another
> benchmark, in which the kernel was built under a cgroup with memory.max
> set to 3 GB, but with 0.5 GB worth of cold data allocated before each
> build (in a shmem file).
>
> Under the old scheme:
>
> real: 197.18s
> user: 4365.08s
> sys: 289.02s
> zswpwb: 72115.2
>
> Under the new scheme:
>
> real: 195.8s
> user: 4362.25s
> sys: 290.14s
> zswpwb: 87277.8
>
> (average over 5 runs)
>
> Notice that we actually observe a 21% increase in the number of written
> back pages - so the new scheme is just as good, if not better at
> offloading pages from the zswap pool when they are cold. Build time
> reduces by around 0.7% as a result.
>
> Suggested-by: Johannes Weiner <hannes@...xchg.org>
> Signed-off-by: Nhat Pham <nphamcs@...il.com>
> ---
>  mm/page_io.c    | 11 ++++++++++-
>  mm/swap_state.c |  8 ++------
>  2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ff8c99ee3af7..0004c9fbf7e8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -521,7 +521,15 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>
>         if (zswap_load(folio)) {
>                 folio_unlock(folio);
> -       } else if (data_race(sis->flags & SWP_FS_OPS)) {
> +               goto finish;
> +       }
> +
> +       /*
> +        * We have to read the page from slower devices. Increase zswap protection.
> +        */
> +       zswap_folio_swapin(folio);
> +
> +       if (data_race(sis->flags & SWP_FS_OPS)) {
>                 swap_read_folio_fs(folio, plug);
>         } else if (synchronous) {
>                 swap_read_folio_bdev_sync(folio, sis);
> @@ -529,6 +537,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>                 swap_read_folio_bdev_async(folio, sis);
>         }
>
> +finish:
>         if (workingset) {
>                 delayacct_thrashing_end(&in_thrashing);
>                 psi_memstall_leave(&pflags);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index a1726e49a5eb..3a0cf965f32b 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -698,10 +698,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         /* The page was likely read above, so no need for plugging here */
>         folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>                                         &page_allocated, false);
> -       if (unlikely(page_allocated)) {
> -               zswap_folio_swapin(folio);
> +       if (unlikely(page_allocated))
>                 swap_read_folio(folio, NULL);
> -       }
>         return folio;
>  }
>
> @@ -850,10 +848,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>         /* The folio was likely read above, so no need for plugging here */
>         folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
>                                         &page_allocated, false);
> -       if (unlikely(page_allocated)) {
> -               zswap_folio_swapin(folio);
> +       if (unlikely(page_allocated))
>                 swap_read_folio(folio, NULL);
> -       }
>         return folio;
>  }
>
> --
> 2.43.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ