[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
Date: Thu, 17 Oct 2024 23:47:58 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
To: linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
hannes@...xchg.org,
yosryahmed@...gle.com,
nphamcs@...il.com,
chengming.zhou@...ux.dev,
usamaarif642@...il.com,
ryan.roberts@....com,
ying.huang@...el.com,
21cnbao@...il.com,
akpm@...ux-foundation.org,
hughd@...gle.com,
willy@...radead.org,
bfoster@...hat.com,
dchinner@...hat.com,
chrisl@...nel.org,
david@...hat.com
Cc: wajdi.k.feghali@...el.com,
vinodh.gopal@...el.com,
kanchana.p.sridhar@...el.com
Subject: [RFC PATCH v1 0/7] zswap IAA decompress batching
IAA Decompression Batching:
===========================
This patch-series applies over [1], the IAA compress batching patch-series.
[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537
This RFC patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel decompression of 4K folios prefetched by
swapin_readahead(). We have developed zswap batched loading of these
prefetched folios, that deploys the use of parallel decompressions by IAA.
swapin_readahead() provides a natural batching interface because it adapts
to the usefulness of prior prefetches, to adjust the readahead
window. Hence, it allows the page-cluster to be set based on workload
characteristics. For workloads that are prefetching friendly, this can form
the basis for reading ahead up to 32 folios with zswap load batching to
significantly reduce swap in latency, major page-faults and systime;
thereby improving workload performance.
The patch-series builds upon the IAA compress batching patch-series [1],
and is organized as follows:
1) A centralized batch decompression API that can be used by swap modules.
2) "struct folio_batch" modifications, e.g., PAGEVEC_SIZE is increased to
2^5.
3) Addition of "zswap_batch" and "non_zswap_batch" folio_batches in
swap_read_folio() to serve the purposes of a plug.
4) swap_read_zswap_batch_unplug() API in page_io.c to process a read
batch of entries found in zswap.
5) zswap API to add a swap entry to a load batch, init/reinit the batch,
process the batch using the batch decompression API.
6) Modifications to the swapin_readahead() functions,
swap_vma_readahead() and swap_cluster_readahead() to:
a) Call swap_read_folio() to add prefetch swap entries to "zswap_batch"
and "non_zswap_batch" folio_batches.
b) Process the two readahead folio batches: "non_zswap_batch" folios
will be read sequentially; "zswap_batch" folios will be batch
decompressed with IAA.
7) Modifications to do_swap_page() to invoke swapin_readahead() from both,
the single-mapped SWP_SYNCHRONOUS_IO and shared/non-SWP_SYNCHRONOUS_IO
branches. In the former path, we call swapin_readahead() only in the
!zswap_never_enabled() case.
a) This causes folios to be read into the swapcache in both paths. This
design choice was motivated by stability: to handle race conditions
with say, process 1 faulting in a single-mapped folio; however,
process 2 could be simultaneously prefetching it as a "readahead"
folio.
b) If the single-mapped folio was successfully read and the race did
not occur, there are checks added to free the swapcache entry for
the folio, before do_swap_page() returns.
8) Finally, for IAA batching, we reduce SWAP_BATCH to 16 and modify the
swap slots cache thresholds to alleviate lock contention on the
swap_info_struct lock due to reduced swap page-fault latencies.
IAA decompress batching can be enabled only on platforms that have IAA, by
setting this config variable:
CONFIG_ZSWAP_LOAD_BATCHING_ENABLED="y"
A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for use on platforms that have IAA. If zswap_load_batching_enabled() is
true, this is intended to give the user the option to run experiments with
IAA and with software compressors for zswap.
These are the recommended settings for "singlemapped_ra_enabled", which
takes effect only in the do_swap_page() single-mapped SWP_SYNCHRONOUS_IO
path:
For IAA:
echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
For software compressors:
echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.
IAA decompress batching performance testing was done using the kernel
compilation test "allmodconfig" run in tmpfs, which demonstrates a
significant amount of readahead activity. vm-scalability usemem is not
ideal for decompress batching because there is very little readahead
activity even with page-cluster of 5 (swap_ra is < 150 with 4k/16k/32k/64k
folios).
The kernel compilation experiments with decompress batching demonstrate
significant latency reductions with kernel compilation: up to 4% lower
elapsed time, 14% lower sys time than mm-unstable/zstd. When combined with
compress batching, we see a reduction of 5% in elapsed time and 20% in sys
time as compared to mm-unstable commit 817952b8be34 with zstd.
Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.
System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-16-2024,
commit 817952b8be34, without and with this patch-series ("this
patch-series" includes [1]). Data was gathered on an Intel Sapphire Rapids
server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB
RAM and 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.
The kernel compilation test with run in tmpfs, using the "allmodconfig", so
that significant swapout and readahead activity can be observed to quantify
decompress batching.
Other kernel configuration parameters:
zswap compressor : deflate-iaa
zswap allocator : zsmalloc
vm.page-cluster : 3,4
IAA "compression verification" is disabled and the async poll acomp
interface is used in the iaa_crypto driver (the defaults with this
series).
Performance testing (Kernel compilation):
=========================================
As mentioned earlier, for workloads that see a lot of swapout activity, we
can benefit from configuring 2 WQs per IAA device, with compress jobs from
all same-socket cores being distributed toothe wq.1 of all IAAs on the
socket, with the "global_wq" developed in this patch-series.
Although this data includes IAA decompress batching, which will be
submitted as a separate RFC patch-series, I am listing it here to quantify
the benefit of distributing compress jobs among all IAAs. The kernel
compilation test with "allmodconfig" is able to quantify this well:
4K folios: deflate-iaa: kernel compilation
==========================================
------------------------------------------------------------------------------
mm-unstable-10-16-2024 zswap_load_batch with
IAA decompress batching
------------------------------------------------------------------------------
zswap compressor zstd deflate-iaa
vm.compress-batchsize n/a 1
vm.page-cluster 3 3
------------------------------------------------------------------------------
real_sec 783.87 752.99
user_sec 15,750.07 15,746.37
sys_sec 6,522.32 5,638.16
Max_Res_Set_Size_KB 1,872,640 1,872,640
------------------------------------------------------------------------------
zswpout 82,364,991 105,190,461
zswpin 21,303,393 29,684,653
pswpout 13 1
pswpin 12 1
pgmajfault 17,114,339 24,034,146
swap_ra 4,596,035 6,219,484
swap_ra_hit 2,903,249 3,876,195
------------------------------------------------------------------------------
Progression of kernel compilation latency improvements with
compress/decompress batching:
============================================================
-------------------------------------------------------------------------------
mm-unstable-10-16-2024 shrink_folio_ zswap_load_batch
list() w/ IAA decompress
batching batching
of folios
-------------------------------------------------------------------------------
zswap compr zstd deflate-iaa deflate-iaa deflate-iaa deflate-iaa
vm.compress- n/a n/a 32 1 32
batchsize
vm.page- 3 3 3 3 3
cluster
-------------------------------------------------------------------------------
real_sec 783.87 761.69 747.32 752.99 749.25
user_sec 15,750.07 15,716.69 15,728.39 15,746.37 15,741.71
sys_sec 6,522.32 5,725.28 5,399.44 5,638.16 5,482.12
Max_RSS_KB 1,872,640 1,870,848 1,874,432 1,872,640 1,872,640
zswpout 82,364,991 97,739,600 102,780,612 105,190,461 106,729,372
zswpin 21,303,393 27,684,166 29,016,252 29,684,653 30,717,819
pswpout 13 222 213 1 12
pswpin 12 209 202 1 8
pgmajfault 17,114,339 22,421,211 23,378,161 24,034,146 24,852,985
swap_ra 4,596,035 5,840,082 6,231,646 6,219,484 6,504,878
swap_ra_hit 2,903,249 3,682,444 3,940,420 3,876,195 4,092,852
-------------------------------------------------------------------------------
The last 2 columns of the latency reduction progression are as follows:
IAA decompress batching combined with distributing compress jobs to all
same-socket IAA devices:
=======================================================================
------------------------------------------------------------------------------
IAA shrink_folio_list() compress batching and
swapin_readahead() decompress batching
1WQ 2WQ (distribute compress jobs)
1 local WQ (wq.0) 1 local WQ (wq.0) +
per IAA 1 global WQ (wq.1) per IAA
------------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa
vm.compress-batchsize 32 32
vm.page-cluster 4 4
------------------------------------------------------------------------------
real_sec 746.77 745.42
user_sec 15,732.66 15,738.85
sys_sec 5,384.14 5,247.86
Max_Res_Set_Size_KB 1,874,432 1,872,640
------------------------------------------------------------------------------
zswpout 101,648,460 104,882,982
zswpin 27,418,319 29,428,515
pswpout 213 22
pswpin 207 6
pgmajfault 21,896,616 23,629,768
swap_ra 6,054,409 6,385,080
swap_ra_hit 3,791,628 3,985,141
------------------------------------------------------------------------------
I would greatly appreciate code review comments for this RFC series!
[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537
Thanks,
Kanchana
Kanchana P Sridhar (7):
mm: zswap: Config variable to enable zswap loads with decompress
batching.
mm: swap: Add IAA batch decompression API
swap_crypto_acomp_decompress_batch().
pagevec: struct folio_batch changes for decompress batching interface.
mm: swap: swap_read_folio() can add a folio to a folio_batch if it is
in zswap.
mm: swap, zswap: zswap folio_batch processing with IAA decompression
batching.
mm: do_swap_page() calls swapin_readahead() zswap load batching
interface.
mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache
thresholds.
include/linux/pagevec.h | 13 +-
include/linux/swap.h | 7 +
include/linux/swap_slots.h | 7 +
include/linux/zswap.h | 65 +++++++++
mm/Kconfig | 13 ++
mm/memory.c | 187 +++++++++++++++++++------
mm/page_io.c | 61 ++++++++-
mm/shmem.c | 2 +-
mm/swap.h | 102 ++++++++++++--
mm/swap_state.c | 272 ++++++++++++++++++++++++++++++++++---
mm/swapfile.c | 2 +-
mm/zswap.c | 272 +++++++++++++++++++++++++++++++++++++
12 files changed, 927 insertions(+), 76 deletions(-)
--
2.27.0
Powered by blists - more mailing lists