lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB567893E61CB522991ED1379EC96C2@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Fri, 20 Sep 2024 01:41:02 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
	"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
	<ryan.roberts@....com>, "Huang, Ying" <ying.huang@...el.com>,
	"21cnbao@...il.com" <21cnbao@...il.com>, "akpm@...ux-foundation.org"
	<akpm@...ux-foundation.org>, "Zou, Nanhai" <nanhai.zou@...el.com>, "Feghali,
 Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
	<vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
	<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@...gle.com>
> Sent: Thursday, August 29, 2024 3:49 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; Huang, Ying
> <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-foundation.org;
> Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@...el.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@....com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. When disabled, zswap will
> > fallback to rejecting the mTHP folio, to be processed by the backing
> > swap device.
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
> > reviews and suggestions!
> >
> > Changes since v5:
> > =================
> > 1) Rebased to mm-unstable as of 8/29/2024,
> >    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >    suggestion to add a knob by which users can enable/disable this
> >    change. Nhat, I hope this is along the lines of what you were
> >    thinking.
> > 3) Added vm-scalability usemem data with 4K folios with
> >    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >    there is no regression with this change.
> > 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >    before/after, as suggested by Yosry, so we can understand the impact
> >    of when mTHPs are split into 4K folios in shrink_folio_list()
> >    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
> >    in zswap. Thanks Yosry for this suggestion.
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> >
> > Regression Testing:
> > ===================
> > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> > folios with mm-unstable and with this patch-series. The main goal was
> > to make sure that there is no functional or performance regression
> > wrt the earlier zswap behavior for 4K folios,
> > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of
> 4K
> > pages goes through the newly added code path [zswap_store(),
> > zswap_store_page()].
> >
> > The data indicates there is no regression.
> >
> >  ------------------------------------------------------------------------------
> >                      mm-unstable 8-28-2024                        zswap-mTHP v6
> >                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
> >                                                                      is not set
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
> >                                        iaa                                  iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    110,775      113,010               111,550        121,937
> >  sys time (sec)      1,141.72       954.87              1,131.95         828.47
> >  memcg_high           140,500      153,737               139,772        134,129
> >  memcg_swap_high            0            0                     0              0
> >  memcg_swap_fail            0            0                     0              0
> >  pswpin                     0            0                     0              0
> >  pswpout                    0            0                     0              0
> >  zswpin                   675          690                   682            684
> >  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
> >  thp_swpout                 0            0                     0              0
> >  thp_swpout_                0            0                     0              0
> >   fallback
> >  pgmajfault             3,453        3,468                 3,841          3,487
> >  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
> >  SWPOUT-64kB-mTHP           0            0                     0              0
> >  ------------------------------------------------------------------------------
> >
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as
> the
> > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> > Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressors : zstd, deflate-iaa
> >     ZSWAP Allocator   : zsmalloc
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. sys time is measured with perf. All data points are
> > averaged across 3 runs.
> >
> > Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in
> reclaim.
> >
> ==============================================================
> =================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >                                  Baseline                               Baseline
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
> >  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
> >  memcg_high          124,183      127,513     138,651     133,884
> >  memcg_swap_high           0            0           0           0
> >  memcg_swap_fail     619,020      751,099           0           0
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  656          569         624         639
> >  zswpout           9,413,603   11,284,812   9,453,761   9,385,910
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            3,470        3,382       4,633       3,611
> >  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
> >  SWPOUT-64kB               0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  ------------------------------------------------------------------------------
> >                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
> >                                  Baseline                              Baseline
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >                                      iaa                     iaa            iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
> >  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
> >  memcg_high            14,628     16,247       14,951      16,096
> >  memcg_swap_high            0          0            0           0
> >  memcg_swap_fail       18,698     21,114            0           0
> >  pswpin                     0          0            0           0
> >  pswpout                    0          0            0           0
> >  zswpin                   663        665        5,333         781
> >  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
> >  thp_swpout                 0          0            0           0
> >  thp_swpout_           18,697     21,113            0           0
> >   fallback
> >  pgmajfault             3,439      3,496        8,139       3,582
> >  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
> >  SWPOUT-2048kB              0          0            0           0
> >  -----------------------------------------------------------------------------
> >
> > We see improvements overall in throughput and sys time for zstd and
> > deflate-iaa, when comparing before (THP_SWAP=N) vs. after
> (THP_SWAP=Y).
> >
> >
> > Case 2: Baseline with CONFIG_THP_SWAP enabled.
> > ==============================================
> >
> > In this scenario, the "before" represents zswap rejecting mTHP, and the
> mTHP
> > being stored by the backing swap device.
> >
> > The "after" represents data with this patch-series, that results in 64K/2M
> > (m)THP being processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  ------------------------------------------------------------------------------
> >                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
> >                                Baseline                                Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
> >                                     iaa                     iaa             iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
> >  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
> >  memcg_high          111,223    110,889     138,651     133,884
> >  memcg_swap_high           0          0           0           0
> >  memcg_swap_fail           0          0           0           0
> >  pswpin                   16         16           0           0
> >  pswpout           7,471,472  7,527,963           0           0
> >  zswpin                  635        605         624         639
> >  zswpout               1,509      1,478   9,453,761   9,385,910
> >  thp_swpout                0          0           0           0
> >  thp_swpout_               0          0           0           0
> >   fallback
> >  pgmajfault            3,616      3,430       4,633       3,611
> >  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
> >  SWPOUT-64kB         466,967    470,498           0           0
> >  ------------------------------------------------------------------------------
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  ------------------------------------------------------------------------------
> >                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >                                 Baseline                               Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >                                      iaa                     iaa            iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
> >  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
> >  memcg_high            16,054     15,936      14,951      16,096
> >  memcg_swap_high            0          0           0           0
> >  memcg_swap_fail            0          0           0           0
> >  pswpin                     0          0           0           0
> >  pswpout            8,629,248  8,628,907           0           0
> >  zswpin                   560        645       5,333         781
> >  zswpout                1,416      1,503   8,546,895   9,355,760
> >  thp_swpout            16,854     16,853           0           0
> >  thp_swpout_                0          0           0           0
> >   fallback
> >  pgmajfault             3,341      3,574       8,139       3,582
> >  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
> >  SWPOUT-2048kB         16,854     16,853           0           0
> >  ------------------------------------------------------------------------------
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > This causes degradation in throughput and sys time with zswap mTHP, more
> so
> > in case of zstd than deflate-iaa. Compress latency could play a part in
> > this - when there is more swapout activity happening, a slower compressor
> > would cause allocations to stall for any/all of the 70 processes.
> 
> We are basically comparing zram with zswap in this case, and it's not
> fair because, as you mentioned, the zswap compressed data is being
> accounted for while the zram compressed data isn't. I am not really
> sure how valuable these test results are. Even if we remove the cgroup
> accounting from zswap, we won't see an improvement, we should expect a
> similar performance to zram.
> 
> I think the test results that are really valuable are case 1, where
> zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> it after this series.
> 
> If we really want to compare CONFIG_THP_SWAP on before and after, it
> should be with SSD because that's a more conventional setup. In this
> case the users that have CONFIG_THP_SWAP=y only experience the
> benefits of zswap with this series. You mentioned experimenting with
> usemem to keep the memory allocated longer so that you're able to have
> a fair test with the small SSD swap setup. Did that work?

Thanks, these are good points. I ran this experiment with mm-unstable 9-17-2024,
commit 248ba8004e76eb335d7e6079724c3ee89a011389.

Data is based on average of 3 runs of the vm-scalability "usemem" test.

 4G SSD backing zswap, each process sleeps before exiting
 ========================================================

 64KB mTHP (cgroup memory.high set to 60G, no swap limit):
 =========================================================
 CONFIG_THP_SWAP=Y
 Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device
 for zswap.

 Experiment 1: Each process sleeps for 0 sec after allocating memory
 (usemem --init-time -w -O --sleep 0 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"      (sleep 0)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   296,684      274,207     359,722     390,162    21%     42%
 sys time (sec)        92.67        93.33      251.06      237.56  -171%   -155%
 memcg_high            3,503        3,769      44,425      27,154
 memcg_swap_fail           0            0     115,814     141,936
 pswpin                   17            0           0           0
 pswpout             370,853      393,232           0           0
 zswpin                  693          123         666         667
 zswpout               1,484          123   1,366,680   1,199,645
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            3,384        2,951       3,656       3,468
 ZSWPOUT-64kB            n/a          n/a      82,940      73,121
 SWPOUT-64kB          23,178       24,577           0           0
 -------------------------------------------------------------------------------


 Experiment 2: Each process sleeps for 10 sec after allocating memory
 (usemem --init-time -w -O --sleep 10 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"     (sleep 10)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    86,744       93,730     157,528     113,110    82%     21%
 sys time (sec)       308.87       315.29      477.55      629.98   -55%   -100%
 memcg_high          169,450      188,700     143,691     177,887
 memcg_swap_fail  10,131,859    9,740,646  18,738,715  19,528,110
 pswpin                   17           16           0           0
 pswpout           1,154,779    1,210,485           0           0
 zswpin                  711          659       1,016         736
 zswpout              70,212       50,128   1,235,560   1,275,917
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            6,120        6,291       8,789       6,474
 ZSWPOUT-64kB            n/a          n/a      67,587      68,912
 SWPOUT-64kB          72,174       75,655           0           0
 -------------------------------------------------------------------------------


Conclusions from the experiments:
=================================
1) zswap-mTHP improves throughput as compared to the baseline, for zstd and
   deflate-iaa.

2) Yosry's theory is proved correct in the 4G constrained swap setup.
   When the processes are constrained to sleep 10 sec after allocating
   memory, thereby keeping the memory allocated longer, the "Baseline" or
   "before" with mTHP getting stored in SSD shows a degradation of 71% in
   throughput and 238% in sys time, as compared to the "Baseline" with
   sleep 0 that benefits from serialization of disk IO not allowing all
   processes to allocate memory at the same time.

3) In the 4G SSD "sleep 0" case, zswap-mTHP shows an increase in sys time
   due to the cgroup charging and consequently higher memcg.high breaches
   and swapout activity.

   However, the "sleep 10" case's sys time seems to degrade less, and the
   memcg.high breaches and swapout activity are almost similar between the
   before/after (confirming Yosry's hypothesis). Further, the
   memcg_swap_fail activity in the "after" scenario is almost 2X that of
   the "before". This indicates failure to obtain swap offsets, resulting
   in the folio remaining active in memory.

   I tried to better understand this through the 64k mTHP swpout_fallback
   stats in the "sleep 10" zstd experiments:

   --------------------------------------------------------------
                                           "before"       "after"
   --------------------------------------------------------------
   64k mTHP swpout_fallback                 627,308       897,407
   64k folio swapouts                        72,174        67,587
   [p|z]swpout events due to 64k mTHP     1,154,779     1,081,397
   4k folio swapouts                         70,212       154,163
   --------------------------------------------------------------

   The data indicates a higher # of 64k folio swpout_fallback with
   zswap-mTHP, that co-relates with the higher memcg_swap_fail counts and
   4k folio swapouts with zswap-mTHP. Could the root-cause be fragmentation
   of the swap space due to zswap swapout being faster than SSD swapout?

> 
> I am hoping Nhat or Johannes would shed some light on whether they
> usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> Otherwise the testing results from case 1 should be sufficient.
> 
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
> 
> This is only referring to the results of case 2, right?

To begin with, yes. With IAA batching, we can submit say, up to 8 pages
in an mTHP for parallel compression in hardware. We have also implemented
batching of any-order folios (e.g. mix of 4K/16K/64K/.. folios) reclaimed in the
shrink_folio_list() -- swap_writepage() path, that demonstrates performance
and memory savings improvements with IAA.

> 
> Honestly, I wouldn't want to merge mTHP swapout support on its own
> just because it enables further performance improvements without
> having actual patches for them. But I don't think this captures the
> results accurately as it dismisses case 1 results (which I think are
> more reasonable).

Based on the latest set of data, we do see consistent throughput improvements
with zswap mTHP swapout using zstd and deflate-IAA, as compared to a baseline
where mTHP are swapped to disk (CONFIG_THP_SWP=y).

The Intel IAA batching related patches would enable the additional performance
improvements I was referring to, only for configurations that have the hardware
acceleration, without impacting performance of software compressors.

Hence, I was thinking we could separate the patch-sets as:

1) zswap-mTHP swapout that could benefit all compressors: this patch series.
2) Additional IAA batching performance improvements that would only
   benefit users of IAA.

I would appreciate your thoughts on this.

Thanks,
Kanchana

> 
> Thnaks

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ