[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB567808155B93421F4EC004FDC9962@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Thu, 29 Aug 2024 00:20:41 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
"ryan.roberts@....com" <ryan.roberts@....com>, "Huang, Ying"
<ying.huang@...el.com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "Zou, Nanhai"
<nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@...gle.com>
> Sent: Wednesday, August 28, 2024 3:37 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; ryan.roberts@....com;
> Huang, Ying <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-
> foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> <kanchana.p.sridhar@...el.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@....com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> > Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> > commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> > robot; as per Nhat's and Michal's suggestion to not require a separate
> > patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> > suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> > invoke count_mthp_stat() after successful zswap_store()s; into a single
> > commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> > Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> > changes to count_mthp_stat() so that it's always defined, even when THP
> > is disabled. Barry, I have also made one other change in page_io.c
> > where count_mthp_stat() is called by count_swpout_vm_event(). I would
> > appreciate it if you can review this. Thanks!
> > Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> > as suggested by Ying Huang. Ying, I would appreciate it if you can
> > review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> > the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> > Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> > Ryan's initial RFC [1]:
> > - Added a comment about the cgroup zswap limit checks occuring once
> per
> > folio at the beginning of zswap_store().
> > Nhat, Ryan, please do let me know if the comments convey the summary
> > from the RFC discussion. Thanks!
> > - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as
> the
> > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> > Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory:
> >
> > usemem --init-time -w -O -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> > ZSWAP Compressors : zstd, deflate-iaa
> > ZSWAP Allocator : zsmalloc
> > SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. sys time is measured with perf. All data points are
> > averaged across 3 runs.
> >
> > 64KB mTHP (cgroup memory.high set to 40G):
> > ==========================================
> >
> > ------------------------------------------------------------------------------
> > v6.11-rc3 mainline zswap-mTHP Change wrt
> > Baseline Baseline
> > ------------------------------------------------------------------------------
> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> > iaa iaa iaa
> > ------------------------------------------------------------------------------
> > Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3%
> > sys time (sec) 771.68 802.08 954.85 735.47 -24% 8%
> > memcg_high 111,223 110,889 138,651 133,884
> > memcg_swap_high 0 0 0 0
> > memcg_swap_fail 0 0 0 0
> > pswpin 16 16 0 0
> > pswpout 7,471,472 7,527,963 0 0
> > zswpin 635 605 624 639
> > zswpout 1,509 1,478 9,453,761 9,385,910
> > thp_swpout 0 0 0 0
> > thp_swpout_ 0 0 0 0
> > fallback
> > pgmajfault 3,616 3,430 4,633 3,611
> > ZSWPOUT-64kB n/a n/a 590,768 586,521
> > SWPOUT-64kB 466,967 470,498 0 0
> > ------------------------------------------------------------------------------
> >
> > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> > =======================================================
> >
> > ------------------------------------------------------------------------------
> > v6.11-rc3 mainline zswap-mTHP Change wrt
> > Baseline Baseline
> > ------------------------------------------------------------------------------
> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> > iaa iaa iaa
> > ------------------------------------------------------------------------------
> > Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10%
> > sys time (sec) 823.55 830.42 801.72 676.65 3% 19%
> > memcg_high 16,054 15,936 14,951 16,096
> > memcg_swap_high 0 0 0 0
> > memcg_swap_fail 0 0 0 0
> > pswpin 0 0 0 0
> > pswpout 8,629,248 8,628,907 0 0
> > zswpin 560 645 5,333 781
> > zswpout 1,416 1,503 8,546,895 9,355,760
> > thp_swpout 16,854 16,853 0 0
> > thp_swpout_ 0 0 0 0
> > fallback
> > pgmajfault 3,341 3,574 8,139 3,582
> > ZSWPOUT-2048kB n/a n/a 16,684 18,270
> > SWPOUT-2048kB 16,854 16,853 0 0
> > ------------------------------------------------------------------------------
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > This causes degradation in throughput and sys time with zswap mTHP, more
> so
> > in case of zstd than deflate-iaa. Compress latency could play a part in
> > this - when there is more swapout activity happening, a slower compressor
> > would cause allocations to stall for any/all of the 70 processes.
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
>
> Are you saying that in the "Before" data we end up skipping zswap
> completely because of using mTHPs?
That's right, Yosry.
>
> Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
We could do this, however I am not sure if turning off CONFIG_THP_SWAP
will have other side-effects in terms of disabling mm code paths outside of
zswap that are intended to be mTHP optimizations that could again skew
the before/after comparisons.
Will wait for Nhat's comments as well.
Thanks,
Kanchana
> to force the mTHPs to be split and for the data to be stored in zswap?
> This would be a more fair Before/After comparison where the memory
> goes to zswap in both cases, but "Before" has to be split because of
> zswap's lack of support for mTHP. I assume most setups relying on
> zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
> Nhat, is this something you can share?
Powered by blists - more mailing lists