[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB5678CB95FA021718B184CFF4C9812@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Fri, 16 Aug 2024 17:50:25 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: "Huang, Ying" <ying.huang@...el.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
"nphamcs@...il.com" <nphamcs@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "Zou, Nanhai"
<nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios
Hi Ying,
> -----Original Message-----
> From: Huang, Ying <ying.huang@...el.com>
> Sent: Friday, August 16, 2024 2:03 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosryahmed@...gle.com; nphamcs@...il.com;
> ryan.roberts@....com; 21cnbao@...il.com; akpm@...ux-foundation.org;
> Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios
>
> Kanchana P Sridhar <kanchana.p.sridhar@...el.com> writes:
>
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@....com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs.
> >
> > For instance, the determination of whether a folio is same-filled is
> > based on mapping an index into the folio to derive the page. Likewise,
> > there is a function "zswap_store_entry" added to store a zswap_entry in
> > the xarray.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> > Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> > Ryan's initial RFC [1]:
> > - Added a comment about the cgroup zswap limit checks occuring once
> per
> > folio at the beginning of zswap_store().
> > Nhat, Ryan, please do let me know if the comments convey the summary
> > from the RFC discussion. Thanks!
> > - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the
> backing
> > swap device. Core frequency was fixed at 2500MHz.
>
> I don't think that this is a reasonable test configuration, there's no
> benefit to use ZSWAP+ZRAM. We should use a normal SSD as backing swap
> device.
Thanks for this suggestion. Sure, I will gather data using SSD instead of ZRAM
as the backing swap device.
>
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> > usemem --init-time -w -O -n 70 1g
> >
> > Other kernel configuration parameters:
> >
> > ZSWAP Compressor : LZ4, DEFLATE-IAA
> > ZSWAP Allocator : ZSMALLOC
> > ZRAM Compressor : LZO-RLE
> > SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows:
> >
> > 64KB mTHP:
> > ==========
> > ------------------------------------------------------------------
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement|
> > | | | KB/s | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | ZRAM lzo-rle | 118,928 | Baseline |
> > |zswap-mTHP-Store | ZSWAP lz4 | 82,665 | -30% |
>
> Because the test configuration isn't reasonable, the performance drop
> isn't reasonable too. We should compare between zswap+SSD w/o mTHP
> zswap and zswap+SSD w/ mTHP zswap. I think that there should be
> performance improvement for that.
Sure, I will gather and post the data with these two configurations.
Thanks,
Kanchana
>
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 176,210 | 48% |
> > |------------------------------------------------------------------|
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement|
> > | | | sec | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | ZRAM lzo-rle | 1,032.20 | Baseline |
> > |zswap-mTHP=Store | ZSWAP lz4 | 1,854.51 | -80% |
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 582.71 | 44% |
> > ------------------------------------------------------------------
> >
> > -----------------------------------------------------------------------
> > | VMSTATS, mTHP ZSWAP stats, | v6.11-rc3 | zswap-mTHP | zswap-
> mTHP |
> > | mTHP ZRAM stats: | mainline | Store | Store |
> > | | | lz4 | deflate-iaa |
> > |-----------------------------------------------------------------------|
> > | pswpin | 16 | 0 | 0 |
> > | pswpout | 7,770,720 | 0 | 0 |
> > | zswpin | 547 | 695 | 579 |
> > | zswpout | 1,394 | 15,462,778 | 7,284,554 |
> > |-----------------------------------------------------------------------|
> > | thp_swpout | 0 | 0 | 0 |
> > | thp_swpout_fallback | 0 | 0 | 0 |
> > | pgmajfault | 3,786 | 3,541 | 3,367 |
> > |-----------------------------------------------------------------------|
> > | hugepages-64kB/stats/zswpout | | 966,328 | 455,196 |
> > |-----------------------------------------------------------------------|
> > | hugepages-64kB/stats/swpout | 485,670 | 0 | 0 |
> > -----------------------------------------------------------------------
> >
> >
> > 2MB PMD-THP/2048K mTHP:
> > =======================
> > ------------------------------------------------------------------
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement|
> > | | | KB/s | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | ZRAM lzo-rle | 177,340 | Baseline |
> > |zswap-mTHP-Store | ZSWAP lz4 | 84,030 | -53% |
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 185,691 | 5% |
> > |------------------------------------------------------------------|
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement|
> > | | | sec | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | ZRAM lzo-rle | 876.29 | Baseline |
> > |zswap-mTHP-Store | ZSWAP lz4 | 1,740.55 | -99% |
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 650.33 | 26% |
> > ------------------------------------------------------------------
> >
> > -------------------------------------------------------------------------
> > | VMSTATS, mTHP ZSWAP stats, | v6.11-rc3 | zswap-mTHP | zswap-
> mTHP |
> > | mTHP ZRAM stats: | mainline | Store | Store |
> > | | | lz4 | deflate-iaa |
> > |-------------------------------------------------------------------------|
> > | pswpin | 0 | 0 | 0 |
> > | pswpout | 8,628,224 | 0 | 0 |
> > | zswpin | 678 | 22,733 | 1,641 |
> > | zswpout | 1,481 | 14,828,597 | 9,404,937 |
> > |-------------------------------------------------------------------------|
> > | thp_swpout | 16,852 | 0 | 0 |
> > | thp_swpout_fallback | 0 | 0 | 0 |
> > | pgmajfault | 3,467 | 25,550 | 4,800 |
> > |-------------------------------------------------------------------------|
> > | hugepages-2048kB/stats/zswpout | | 28,924 | 18,366 |
> > |-------------------------------------------------------------------------|
> > | hugepages-2048kB/stats/swpout | 16,852 | 0 | 0 |
> > -------------------------------------------------------------------------
> >
> > As expected, in the "Before" experiment, there are relatively fewer
> > swapouts because ZRAM utilization is not accounted in the cgroup.
> >
> > With the introduction of zswap_store mTHP, the "After" data reflects the
> > higher swapout activity, and consequent throughput/sys time degradation
> > when LZ4 is used as the zswap compressor. However, we observe
> considerable
> > throughput and sys time improvement in the "After" data when DEFLATE-
> IAA
> > is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
> > experiments. IAA's higher compression ratio and better compress latency
> > can be attributed to fewer swap-outs and major page-faults, that result
> > in better throughput and sys time.
> >
> > Our goal is to improve ZSWAP mTHP store performance using batching. With
> > Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
> > additional RFC series), we are able to demonstrate significant
> > performance improvements and memory savings with IAA as compared to
> > software compressors.
> >
> > cgroup zswap kselftest:
> > =======================
> >
> > "Before":
> > =========
> > Test run with v6.11-rc3 and no code changes:
> > mTHP 64K set to 'always'
> > zswap compressor set to 'lz4'
> > page-cluster = 3
> >
> > zswap shrinker_enabled = N:
> > ---------------------------
> > ok 1 test_zswap_usage
> > ok 2 test_swapin_nozswap
> > # at least 24MB should be brought back from zswap
> > not ok 3 test_zswapin
> > # zswpwb_after is 0 while wb is enablednot ok 4
> test_zswap_writeback_enabled
> > # Failed to reclaim all of the requested memory
> > not ok 5 test_zswap_writeback_disabled
> > ok 6 # SKIP test_no_kmem_bypass
> > ok 7 test_no_invasive_cgroup_shrink
> >
> > zswap shrinker_enabled = Y:
> > ---------------------------
> > ok 1 test_zswap_usage
> > ok 2 test_swapin_nozswap
> > # at least 24MB should be brought back from zswap
> > not ok 3 test_zswapin
> > # zswpwb_after is 0 while wb is enablednot ok 4
> test_zswap_writeback_enabled
> > # Failed to reclaim all of the requested memory
> > not ok 5 test_zswap_writeback_disabled
> > ok 6 # SKIP test_no_kmem_bypass
> > not ok 7 test_no_invasive_cgroup_shrink
> >
> > "After":
> > ========
> > Test run with this patch-series and v6.11-rc3:
> > mTHP 64K set to 'always'
> > zswap compressor set to 'deflate-iaa'
> > page-cluster = 3
> >
> > zswap shrinker_enabled = N:
> > ---------------------------
> > ok 1 test_zswap_usage
> > ok 2 test_swapin_nozswap
> > ok 3 test_zswapin
> > ok 4 test_zswap_writeback_enabled
> > ok 5 test_zswap_writeback_disabled
> > ok 6 # SKIP test_no_kmem_bypass
> > ok 7 test_no_invasive_cgroup_shrink
> >
> > zswap shrinker_enabled = Y:
> > ---------------------------
> > ok 1 test_zswap_usage
> > ok 2 test_swapin_nozswap
> > # at least 24MB should be brought back from zswap
> > not ok 3 test_zswapin
> > ok 4 test_zswap_writeback_enabled
> > ok 5 test_zswap_writeback_disabled
> > ok 6 # SKIP test_no_kmem_bypass
> > not ok 7 test_no_invasive_cgroup_shrink
> >
> > I haven't taken an in-depth look into the cgroup zswap tests, but it
> > looks like the results with the patch-series are no worse than without,
> > and in some cases better (not exactly sure why, this needs more
> > analysis).
> >
> > I would greatly appreciate your code review comments and suggestions!
> >
> > Thanks,
> > Kanchana
> >
> > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-
> ryan.roberts@....com/
> >
> >
> > Kanchana P Sridhar (4):
> > mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
> > mm: zswap: zswap_store() extended to handle mTHP folios.
> > mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
> > mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
> >
> > include/linux/huge_mm.h | 1 +
> > mm/huge_memory.c | 2 +
> > mm/page_io.c | 7 ++
> > mm/zswap.c | 238 +++++++++++++++++++++++++++++-----------
> > 4 files changed, 184 insertions(+), 64 deletions(-)
>
> --
> Best Regards,
> Huang, Ying
Powered by blists - more mailing lists