[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <875xrigy21.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Fri, 30 Aug 2024 17:27:50 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, hannes@...xchg.org,
nphamcs@...il.com, chengming.zhou@...ux.dev, usamaarif642@...il.com,
ryan.roberts@....com, 21cnbao@...il.com, akpm@...ux-foundation.org,
nanhai.zou@...el.com, wajdi.k.feghali@...el.com, vinodh.gopal@...el.com
Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
Yosry Ahmed <yosryahmed@...gle.com> writes:
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@...el.com> wrote:
>>
>> Hi All,
>>
>> This patch-series enables zswap_store() to accept and store mTHP
>> folios. The most significant contribution in this series is from the
>> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
>> migrated to v6.11-rc3 in patch 2/4 of this series.
>>
>> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>> https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>>
>> Additionally, there is an attempt to modularize some of the functionality
>> in zswap_store(), to make it more amenable to supporting any-order
>> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
>> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
>> delete all offsets corresponding to a higher order folio stored in zswap.
>>
>> For accounting purposes, the patch-series adds per-order mTHP sysfs
>> "zswpout" counters that get incremented upon successful zswap_store of
>> an mTHP folio:
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>>
>> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
>> will enable/disable zswap storing of (m)THP. When disabled, zswap will
>> fallback to rejecting the mTHP folio, to be processed by the backing
>> swap device.
>>
>> This patch-series is a precursor to ZSWAP compress batching of mTHP
>> swap-out and decompress batching of swap-ins based on swapin_readahead(),
>> using Intel IAA hardware acceleration, which we would like to submit in
>> subsequent RFC patch-series, with performance improvement data.
>>
>> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>>
>> Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
>> reviews and suggestions!
>>
>> Changes since v5:
>> =================
>> 1) Rebased to mm-unstable as of 8/29/2024,
>> commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
>> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>> enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>> suggestion to add a knob by which users can enable/disable this
>> change. Nhat, I hope this is along the lines of what you were
>> thinking.
>> 3) Added vm-scalability usemem data with 4K folios with
>> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>> there is no regression with this change.
>> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>> before/after, as suggested by Yosry, so we can understand the impact
>> of when mTHPs are split into 4K folios in shrink_folio_list()
>> (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>> in zswap. Thanks Yosry for this suggestion.
>>
>> Changes since v4:
>> =================
>> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>> Nhat for the data reviews!).
>> 2) Rebased to mm-unstable from 8/27/2024,
>> commit b659edec079c90012cf8d05624e312d1062b8b87.
>> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>> CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>> robot; as per Nhat's and Michal's suggestion to not require a separate
>> patch to fix the build errors (thanks both!).
>> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>> suggested by Yosry (Thanks Yosry!).
>> 5) Squashed the commits that define new mthp zswpout stat counters, and
>> invoke count_mthp_stat() after successful zswap_store()s; into a single
>> commit. Thanks Yosry for this suggestion!
>>
>> Changes since v3:
>> =================
>> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>> Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>> changes to count_mthp_stat() so that it's always defined, even when THP
>> is disabled. Barry, I have also made one other change in page_io.c
>> where count_mthp_stat() is called by count_swpout_vm_event(). I would
>> appreciate it if you can review this. Thanks!
>> Hopefully this should resolve the kernel robot build errors.
>>
>> Changes since v2:
>> =================
>> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>> as suggested by Ying Huang. Ying, I would appreciate it if you can
>> review the latest data. Thanks!
>> 2) Generated the base commit info in the patches to attempt to address
>> the kernel test robot build errors.
>> 3) No code changes to the individual patches themselves.
>>
>> Changes since RFC v1:
>> =====================
>>
>> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>> Thanks Barry!
>> 2) Addressed some of the code review comments that Nhat Pham provided in
>> Ryan's initial RFC [1]:
>> - Added a comment about the cgroup zswap limit checks occuring once per
>> folio at the beginning of zswap_store().
>> Nhat, Ryan, please do let me know if the comments convey the summary
>> from the RFC discussion. Thanks!
>> - Posted data on running the cgroup suite's zswap kselftest.
>> 3) Rebased to v6.11-rc3.
>> 4) Gathered performance data with usemem and the rebased patch-series.
>>
>>
>> Regression Testing:
>> ===================
>> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
>> folios with mm-unstable and with this patch-series. The main goal was
>> to make sure that there is no functional or performance regression
>> wrt the earlier zswap behavior for 4K folios,
>> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
>> pages goes through the newly added code path [zswap_store(),
>> zswap_store_page()].
>>
>> The data indicates there is no regression.
>>
>> ------------------------------------------------------------------------------
>> mm-unstable 8-28-2024 zswap-mTHP v6
>> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
>> is not set
>> ------------------------------------------------------------------------------
>> ZSWAP compressor zstd deflate- zstd deflate-
>> iaa iaa
>> ------------------------------------------------------------------------------
>> Throughput (KB/s) 110,775 113,010 111,550 121,937
>> sys time (sec) 1,141.72 954.87 1,131.95 828.47
>> memcg_high 140,500 153,737 139,772 134,129
>> memcg_swap_high 0 0 0 0
>> memcg_swap_fail 0 0 0 0
>> pswpin 0 0 0 0
>> pswpout 0 0 0 0
>> zswpin 675 690 682 684
>> zswpout 9,552,298 10,603,271 9,566,392 9,267,213
>> thp_swpout 0 0 0 0
>> thp_swpout_ 0 0 0 0
>> fallback
>> pgmajfault 3,453 3,468 3,841 3,487
>> ZSWPOUT-64kB-mTHP n/a n/a 0 0
>> SWPOUT-64kB-mTHP 0 0 0 0
>> ------------------------------------------------------------------------------
>>
>>
>> Performance Testing:
>> ====================
>> Testing of this patch-series was done with the v6.11-rc3 mainline, without
>> and with this patch-series, on an Intel Sapphire Rapids server,
>> dual-socket 56 cores per socket, 4 IAA devices per socket.
>>
>> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
>> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
>> Core frequency was fixed at 2500MHz.
>>
>> The vm-scalability "usemem" test was run in a cgroup whose memory.high
>> was fixed at 40G. The is no swap limit set for the cgroup. Following a
>> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
>> series [2], 70 usemem processes were run, each allocating and writing 1G of
>> memory:
>>
>> usemem --init-time -w -O -n 70 1g
>>
>> The vm/sysfs mTHP stats included with the performance data provide details
>> on the swapout activity to ZSWAP/swap.
>>
>> Other kernel configuration parameters:
>>
>> ZSWAP Compressors : zstd, deflate-iaa
>> ZSWAP Allocator : zsmalloc
>> SWAP page-cluster : 2
>>
>> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
>> IAA "compression verification" is enabled. Hence each IAA compression
>> will be decompressed internally by the "iaa_crypto" driver, the crc-s
>> returned by the hardware will be compared and errors reported in case of
>> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
>> compared to the software compressors.
>>
>> Throughput is derived by averaging the individual 70 processes' throughputs
>> reported by usemem. sys time is measured with perf. All data points are
>> averaged across 3 runs.
>>
>> Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in reclaim.
>> ===============================================================================
>>
>> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
>> 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
>>
>> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
>> in 64K/2M (m)THP to not be split, and processed by zswap.
>>
>> 64KB mTHP (cgroup memory.high set to 40G):
>> ==========================================
>>
>> -------------------------------------------------------------------------------
>> v6.11-rc3 mainline zswap-mTHP Change wrt
>> Baseline Baseline
>> CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y
>> -------------------------------------------------------------------------------
>> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
>> iaa iaa iaa
>> -------------------------------------------------------------------------------
>> Throughput (KB/s) 136,113 140,044 140,363 151,938 3% 8%
>> sys time (sec) 986.78 951.95 954.85 735.47 3% 23%
>> memcg_high 124,183 127,513 138,651 133,884
>> memcg_swap_high 0 0 0 0
>> memcg_swap_fail 619,020 751,099 0 0
>> pswpin 0 0 0 0
>> pswpout 0 0 0 0
>> zswpin 656 569 624 639
>> zswpout 9,413,603 11,284,812 9,453,761 9,385,910
>> thp_swpout 0 0 0 0
>> thp_swpout_ 0 0 0 0
>> fallback
>> pgmajfault 3,470 3,382 4,633 3,611
>> ZSWPOUT-64kB n/a n/a 590,768 586,521
>> SWPOUT-64kB 0 0 0 0
>> -------------------------------------------------------------------------------
>>
>>
>> 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>> =======================================================
>>
>> ------------------------------------------------------------------------------
>> v6.11-rc3 mainline zswap-mTHP Change wrt
>> Baseline Baseline
>> CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y
>> ------------------------------------------------------------------------------
>> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
>> iaa iaa iaa
>> ------------------------------------------------------------------------------
>> Throughput (KB/s) 164,220 172,523 165,005 174,536 0.5% 1%
>> sys time (sec) 855.76 686.94 801.72 676.65 6% 1%
>> memcg_high 14,628 16,247 14,951 16,096
>> memcg_swap_high 0 0 0 0
>> memcg_swap_fail 18,698 21,114 0 0
>> pswpin 0 0 0 0
>> pswpout 0 0 0 0
>> zswpin 663 665 5,333 781
>> zswpout 8,419,458 8,992,065 8,546,895 9,355,760
>> thp_swpout 0 0 0 0
>> thp_swpout_ 18,697 21,113 0 0
>> fallback
>> pgmajfault 3,439 3,496 8,139 3,582
>> ZSWPOUT-2048kB n/a n/a 16,684 18,270
>> SWPOUT-2048kB 0 0 0 0
>> -----------------------------------------------------------------------------
>>
>> We see improvements overall in throughput and sys time for zstd and
>> deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
>>
>>
>> Case 2: Baseline with CONFIG_THP_SWAP enabled.
>> ==============================================
>>
>> In this scenario, the "before" represents zswap rejecting mTHP, and the mTHP
>> being stored by the backing swap device.
>>
>> The "after" represents data with this patch-series, that results in 64K/2M
>> (m)THP being processed by zswap.
>>
>> 64KB mTHP (cgroup memory.high set to 40G):
>> ==========================================
>>
>> ------------------------------------------------------------------------------
>> v6.11-rc3 mainline zswap-mTHP Change wrt
>> Baseline Baseline
>> ------------------------------------------------------------------------------
>> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
>> iaa iaa iaa
>> ------------------------------------------------------------------------------
>> Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3%
>> sys time (sec) 771.68 802.08 954.85 735.47 -24% 8%
>> memcg_high 111,223 110,889 138,651 133,884
>> memcg_swap_high 0 0 0 0
>> memcg_swap_fail 0 0 0 0
>> pswpin 16 16 0 0
>> pswpout 7,471,472 7,527,963 0 0
>> zswpin 635 605 624 639
>> zswpout 1,509 1,478 9,453,761 9,385,910
>> thp_swpout 0 0 0 0
>> thp_swpout_ 0 0 0 0
>> fallback
>> pgmajfault 3,616 3,430 4,633 3,611
>> ZSWPOUT-64kB n/a n/a 590,768 586,521
>> SWPOUT-64kB 466,967 470,498 0 0
>> ------------------------------------------------------------------------------
>>
>> 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>> =======================================================
>>
>> ------------------------------------------------------------------------------
>> v6.11-rc3 mainline zswap-mTHP Change wrt
>> Baseline Baseline
>> ------------------------------------------------------------------------------
>> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
>> iaa iaa iaa
>> ------------------------------------------------------------------------------
>> Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10%
>> sys time (sec) 823.55 830.42 801.72 676.65 3% 19%
>> memcg_high 16,054 15,936 14,951 16,096
>> memcg_swap_high 0 0 0 0
>> memcg_swap_fail 0 0 0 0
>> pswpin 0 0 0 0
>> pswpout 8,629,248 8,628,907 0 0
>> zswpin 560 645 5,333 781
>> zswpout 1,416 1,503 8,546,895 9,355,760
>> thp_swpout 16,854 16,853 0 0
>> thp_swpout_ 0 0 0 0
>> fallback
>> pgmajfault 3,341 3,574 8,139 3,582
>> ZSWPOUT-2048kB n/a n/a 16,684 18,270
>> SWPOUT-2048kB 16,854 16,853 0 0
>> ------------------------------------------------------------------------------
>>
>> In the "Before" scenario, when zswap does not store mTHP, only allocations
>> count towards the cgroup memory limit. However, in the "After" scenario,
>> with the introduction of zswap_store() mTHP, both, allocations as well as
>> the zswap compressed pool usage from all 70 processes are counted towards
>> the memory limit. As a result, we see higher swapout activity in the
>> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
>> charge leads to more frequent memory.high breaches.
>>
>> This causes degradation in throughput and sys time with zswap mTHP, more so
>> in case of zstd than deflate-iaa. Compress latency could play a part in
>> this - when there is more swapout activity happening, a slower compressor
>> would cause allocations to stall for any/all of the 70 processes.
>
> We are basically comparing zram with zswap in this case, and it's not
> fair because, as you mentioned, the zswap compressed data is being
> accounted for while the zram compressed data isn't. I am not really
> sure how valuable these test results are. Even if we remove the cgroup
> accounting from zswap, we won't see an improvement, we should expect a
> similar performance to zram.
>
> I think the test results that are really valuable are case 1, where
> zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> it after this series.
>
> If we really want to compare CONFIG_THP_SWAP on before and after, it
> should be with SSD because that's a more conventional setup. In this
> case the users that have CONFIG_THP_SWAP=y only experience the
> benefits of zswap with this series.
Yes. I think so too.
> You mentioned experimenting with
> usemem to keep the memory allocated longer so that you're able to have
> a fair test with the small SSD swap setup. Did that work?
Looking forward to the results of this test too.
> I am hoping Nhat or Johannes would shed some light on whether they
> usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> Otherwise the testing results from case 1 should be sufficient.
I guess that even if 2MB THP swapping may be not popular, 64KB mTHP
swapping to SSD or zswap looks much more appealing.
>>
>> In my opinion, even though the test set up does not provide an accurate
>> way for a direct before/after comparison (because of zswap usage being
>> counted in cgroup, hence towards the memory.high), it still seems
>> reasonable for zswap_store to support (m)THP, so that further performance
>> improvements can be implemented.
>
> This is only referring to the results of case 2, right?
>
> Honestly, I wouldn't want to merge mTHP swapout support on its own
> just because it enables further performance improvements without
> having actual patches for them. But I don't think this captures the
> results accurately as it dismisses case 1 results (which I think are
> more reasonable).
--
Best Regards,
Huang, Ying
Powered by blists - more mailing lists