linux-kernel - RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB567801EE747E2810EF183C73C9892@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Sat, 24 Aug 2024 06:21:15 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Nhat Pham <nphamcs@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
	"ryan.roberts@....com" <ryan.roberts@....com>, "Huang, Ying"
	<ying.huang@...el.com>, "21cnbao@...il.com" <21cnbao@...il.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "Zou, Nanhai"
	<nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
	"Gopal, Vinodh" <vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
	<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

Hi Nhat,

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Sent: Wednesday, August 21, 2024 12:08 PM
> To: Nhat Pham <nphamcs@...il.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosryahmed@...gle.com; ryan.roberts@....com;
> Huang, Ying <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-
> foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> Hi Nhat,
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@...il.com>
> > Sent: Wednesday, August 21, 2024 7:43 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > hannes@...xchg.org; yosryahmed@...gle.com; ryan.roberts@....com;
> > Huang, Ying <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-
> > foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@...el.com> wrote:
> > >
> > > Hi All,
> > >
> > > This patch-series enables zswap_store() to accept and store mTHP
> > > folios. The most significant contribution in this series is from the
> > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > > migrated to v6.11-rc3 in patch 2/4 of this series.
> > >
> > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> > ryan.roberts@....com/T/#u
> > >
> > > Additionally, there is an attempt to modularize some of the functionality
> > > in zswap_store(), to make it more amenable to supporting any-order
> > > mTHPs.
> > >
> > > For instance, the determination of whether a folio is same-filled is
> > > based on mapping an index into the folio to derive the page. Likewise,
> > > there is a function "zswap_store_entry" added to store a zswap_entry in
> > > the xarray.
> > >
> > > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > > "zswpout" counters that get incremented upon successful zswap_store of
> > > an mTHP folio:
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> > >
> > > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > > swap-out and decompress batching of swap-ins based on
> > swapin_readahead(),
> > > using Intel IAA hardware acceleration, which we would like to submit in
> > > subsequent RFC patch-series, with performance improvement data.
> > >
> > > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> > >
> > > Changes since v3:
> > > =================
> > > 1) Rebased to mm-unstable commit
> > 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> > >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> > >    changes to count_mthp_stat() so that it's always defined, even when
> THP
> > >    is disabled. Barry, I have also made one other change in page_io.c
> > >    where count_mthp_stat() is called by count_swpout_vm_event(). I
> would
> > >    appreciate it if you can review this. Thanks!
> > >    Hopefully this should resolve the kernel robot build errors.
> > >
> > > Changes since v2:
> > > =================
> > > 1) Gathered usemem data using SSD as the backing swap device for
> zswap,
> > >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> > >    review the latest data. Thanks!
> > > 2) Generated the base commit info in the patches to attempt to address
> > >    the kernel test robot build errors.
> > > 3) No code changes to the individual patches themselves.
> > >
> > > Changes since RFC v1:
> > > =====================
> > >
> > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> > >    Thanks Barry!
> > > 2) Addressed some of the code review comments that Nhat Pham
> provided
> > in
> > >    Ryan's initial RFC [1]:
> > >    - Added a comment about the cgroup zswap limit checks occuring once
> > per
> > >      folio at the beginning of zswap_store().
> > >      Nhat, Ryan, please do let me know if the comments convey the
> summary
> > >      from the RFC discussion. Thanks!
> > >    - Posted data on running the cgroup suite's zswap kselftest.
> > > 3) Rebased to v6.11-rc3.
> > > 4) Gathered performance data with usemem and the rebased patch-
> series.
> > >
> > > Performance Testing:
> > > ====================
> > > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > > and with this patch-series, on an Intel Sapphire Rapids server,
> > > dual-socket 56 cores per socket, 4 IAA devices per socket.
> > >
> > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device
> for
> > > ZSWAP. Core frequency was fixed at 2500MHz.
> > >
> > > The vm-scalability "usemem" test was run in a cgroup whose
> memory.high
> > > was fixed. Following a similar methodology as in Ryan Roberts'
> > > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > > run, each allocating and writing 1G of memory:
> > >
> > >     usemem --init-time -w -O -n 70 1g
> > >
> > > Since I was constrained to get the 70 usemem processes to generate
> > > swapout activity with the 4G SSD, I ended up using different cgroup
> > > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
> > >
> > > 64K mTHP experiments: cgroup memory fixed at 60G
> > > 2M THP experiments  : cgroup memory fixed at 55G
> > >
> > > The vm/sysfs stats included after the performance data provide details
> > > on the swapout activity to SSD/ZSWAP.
> > >
> > > Other kernel configuration parameters:
> > >
> > >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> > >     ZSWAP Allocator   : ZSMALLOC
> > >     SWAP page-cluster : 2
> > >
> > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > > IAA "compression verification" is enabled. Hence each IAA compression
> > > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > > returned by the hardware will be compared and errors reported in case of
> > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > > compared to the software compressors.
> > >
> > > Throughput reported by usemem and perf sys time for running the test
> > > are as follows, averaged across 3 runs:
> > >
> > >  64KB mTHP (cgroup memory.high set to 60G):
> > >  ==========================================
> > >   ------------------------------------------------------------------
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> > >  |                    |                   |       KB/s |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> > >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> > >  |------------------------------------------------------------------|
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> > >  |                    |                   |        sec |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> > >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> > >   ------------------------------------------------------------------
> >
> > Yeah no, this is not good. That throughput regression is concerning...
> >
> > Is this tied to lz4 only, or do you observe similar trends in other
> > compressors that are not deflate-iaa?
> 
> Let me gather data with other software compressors.
> 
> >
> >
> > >
> > >   -----------------------------------------------------------------------
> > >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> > mTHP |
> > >  |                              |   mainline |       Store |       Store |
> > >  |                              |            |         lz4 | deflate-iaa |
> > >  |-----------------------------------------------------------------------|
> > >  | pswpin                       |          0 |           0 |           0 |
> > >  | pswpout                      |    174,432 |           0 |           0 |
> > >  | zswpin                       |        703 |         534 |         721 |
> > >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> > >  |-----------------------------------------------------------------------|
> > >  | thp_swpout                   |          0 |           0 |           0 |
> > >  | thp_swpout_fallback          |          0 |           0 |           0 |
> > >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> > >  |-----------------------------------------------------------------------|
> > >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> > >  |-----------------------------------------------------------------------|
> > >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> > >   -----------------------------------------------------------------------
> > >
> >
> > Yeah this is not good. Something fishy is going on, if we see this
> > ginormous jump from 175000 (z)swpout pages to almost 1.5 million
> > pages. That's a massive jump.
> >
> > Either it's:
> >
> > 1.Your theory - zswap store keeps banging on the limit (which suggests
> > incompatibility between the way zswap currently behaves and our
> > reclaim logic)
> >
> > 2. The data here is ridiculously incompressible. We're needing to
> > zswpout roughly 8.5 times the number of pages, so the saving is 8.5
> > less => we only save 11.76% of memory for each page??? That's not
> > right...
> >
> > 3. There's an outright bug somewhere.
> >
> > Very suspicious.
> >
> > >
> > >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G):
> > >  =======================================================
> > >   ------------------------------------------------------------------
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> > >  |                    |                   |       KB/s |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |    190,827 |   Baseline |
> > >  |zswap-mTHP-Store    | ZSWAP lz4         |     32,026 |       -83% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    203,772 |         7% |
> > >  |------------------------------------------------------------------|
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> > >  |                    |                   |        sec |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |      27.23 |   Baseline |
> > >  |zswap-mTHP-Store    | ZSWAP lz4         |     156.52 |      -475% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     171.45 |      -530% |
> > >   ------------------------------------------------------------------
> >
> > I'm confused. This is a *regression* right? A massive one that is -
> > sys time is *more* than 5 times the old value?
> >
> > >
> > >   -------------------------------------------------------------------------
> > >  | VMSTATS, mTHP ZSWAP/SSD stats  |  v6.11-rc3 |  zswap-mTHP |
> zswap-
> > mTHP |
> > >  |                                |   mainline |       Store |       Store |
> > >  |                                |            |         lz4 | deflate-iaa |
> > >  |-------------------------------------------------------------------------|
> > >  | pswpin                         |          0 |           0 |           0 |
> > >  | pswpout                        |    797,184 |           0 |           0 |
> > >  | zswpin                         |        690 |         649 |         669 |
> > >  | zswpout                        |      1,465 |   1,596,382 |   1,540,766 |
> > >  |-------------------------------------------------------------------------|
> > >  | thp_swpout                     |      1,557 |           0 |           0 |
> > >  | thp_swpout_fallback            |          0 |       3,248 |       3,752 |
> >
> > This is also increased, but I supposed we're just doing more
> > (z)swapping out in general...
> >
> > >  | pgmajfault                     |      3,726 |       6,470 |       5,691 |
> > >  |-------------------------------------------------------------------------|
> > >  | hugepages-2048kB/stats/zswpout |            |       2,416 |       2,261 |
> > >  |-------------------------------------------------------------------------|
> > >  | hugepages-2048kB/stats/swpout  |      1,557 |           0 |           0 |
> > >   -------------------------------------------------------------------------
> > >
> >
> > I'm not trying to delay this patch - I fully believe in supporting
> > zswap for larger pages (both mTHP and THP - whatever the memory
> > reclaim subsystem throws at us).
> >
> > But we need to get to the bottom of this :) These are very suspicious
> > and concerning data. If this is something urgent, I can live with a
> > gate to enable/disable this, but I'd much prefer we understand what's
> > going on here.


I started out with 2 main hypotheses to explain why zswap incurs more
reclaim wrt SSD:

1) The cgroup zswap charge, that hastens the memory.high limit to be
   breached, and adds to the reclaim being triggered in
   mem_cgroup_handle_over_high().

2) Does a faster reclaim path somehow cause less allocation stalls; thereby
   causing more breaches of memory.high, hence more reclaim -- and does this
   cycle repeat, potentially leading to higher swapout activity with zswap?

I focused on gathering data with lz4 for this debug, under the reasonable
assumption that results with deflate-iaa will be better. Once we figure out
an overall direction on next steps, I will publish results with zswap lz4,
deflate-iaa, etc.

All experiments except "Exp 1.A" are run with
usemem --init-time -w -O -n 70 1g.

General settings for all data presented in this patch-series:

vm.swappiness = 100
zswap shrinker_enabled = N

 Experiment 1 - impact of not doing cgroup zswap charge:
 -------------------------------------------------------

I wanted to first understand by how much we improve without the cgroup
zswap charge. I commented out both, the calls to obj_cgroup_charge_zswap()
and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set.
We improve throughput by quite a bit with this change, and are now better
than mTHP getting swapped out to SSD. We have also slightly improved on the
sys time, though this is still a regression as compared to SSD. If you
recall, we were worse on throughput and sys time with v4.

Averages over 3 runs are summarized in each case.

 Exp 1.A: usemem -n 1 58g:
 -------------------------

 64KB mTHP (cgroup memory.high set to 60G):
 ==========================================

                SSD mTHP    zswap mTHP v4   zswap mTHP no_charge
 ----------------------------------------------------------------
 pswpout          586,352                0                      0
 zswpout            1,005        1,042,963                587,181
 ----------------------------------------------------------------
 Total swapout    587,357        1,042,963                587,181
 ----------------------------------------------------------------

Without the zswap charge to cgroup, the total swapout activity for
zswap-mTHP is on par with that of SSD-mTHP for the single process case.


 Exp 1.B: usemem -n 70 1g:
 -------------------------
 v4 results with cgroup zswap charge:
 ------------------------------------

 64KB mTHP (cgroup memory.high set to 60G):
 ==========================================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 |                              |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpout                      |    174,432 |           0 |           0 |
 | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
  -----------------------------------------------------------------------

 Debug results without cgroup zswap charge in both, "Before" and "After":
 ------------------------------------------------------------------------

 64KB mTHP (cgroup memory.high set to 60G):
 ==========================================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |    300,565 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |    420,125 |        40% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |      90.76 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |     213.09 |      -135% |
  ------------------------------------------------------------------

  ---------------------------------------------------------
 | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
 |                              |   mainline |       Store |
 |                              |            |         lz4 |
 |----------------------------------------------------------
 | pswpout                      |    330,640 |           0 |
 | zswpout                      |      1,527 |   1,384,725 |
 |----------------------------------------------------------
 | hugepages-64kB/stats/zswpout |            |      63,335 |
 |----------------------------------------------------------
 | hugepages-64kB/stats/swpout  |     18,242 |           0 |
  ---------------------------------------------------------


Based on these results, I kept the cgroup zswap charging commented out in
subsequent debug steps, so as to not place zswap at a disadvantage when
trying to determine further causes for hypothesis (1).


Experiment 2 - swap latency/reclamation with 64K mTHP:
------------------------------------------------------

Number of swap_writepage    Total swap_writepage  Average swap_writepage
    calls from all cores      Latency (millisec)      Latency (microsec)
---------------------------------------------------------------------------
SSD               21,373               165,434.9                   7,740
zswap            344,109                55,446.8                     161
---------------------------------------------------------------------------


Reclamation analysis: 64k mTHP swapout:
---------------------------------------
"Before":
  Total SSD compressed data size   =  1,362,296,832  bytes
  Total SSD write IO latency       =        887,861  milliseconds

  Average SSD compressed data size =      1,089,837  bytes
  Average SSD write IO latency     =        710,289  microseconds

"After":
  Total ZSWAP compressed pool size =  2,610,657,430  bytes
  Total ZSWAP compress latency     =         55,984  milliseconds

  Average ZSWAP compress length    =          2,055  bytes
  Average ZSWAP compress latency   =             44  microseconds

  zswap-LZ4 mTHP compression ratio =  1.99
  All moderately compressible pages. 0 zswap_store errors.                                    
  84% of pages compress to 2056 bytes.


 Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
 ------------------------------------------------------------

 I wanted to take a step back and understand how the mainline v6.11-rc3
 handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and when
 swapped out to ZSWAP. Interestingly, higher swapout activity is observed
 with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
 cgroup).

 v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:

 -------------------------------------------------------------
 SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
 -------------------------------------------------------------
 cgroup memory.events:           cgroup memory.events:
 
 low                 0           low              0          0
 high            5,068           high       321,923    375,116
 max                 0           max              0          0
 oom                 0           oom              0          0
 oom_kill            0           oom_kill         0          0
 oom_group_kill      0           oom_group_kill   0          0
 -------------------------------------------------------------

 SSD (CONFIG_ZSWAP is OFF):
 --------------------------
 pswpout            415,709
 sys time (sec)      301.02
 Throughput KB/s    155,970
 memcg_high events    5,068
 --------------------------


 ZSWAP                  lz4         lz4         lz4     lzo-rle
 --------------------------------------------------------------
 zswpout          1,598,550   1,515,151   1,449,432   1,493,917
 sys time (sec)      889.36      481.21      581.22      635.75
 Throughput KB/s     35,176      14,765      20,253      21,407
 memcg_high events  321,923     412,733     369,976     375,116
 --------------------------------------------------------------

 This shows that there is a performance regression of -60% to -195% with
 zswap as compared to SSD with 4K folios. The higher swapout activity with
 zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).

 I verified this to be the case even with the v6.7 kernel, which also
 showed a 2.3X throughput improvement when we don't charge zswap:

 ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
 --------------------------------------------------------------------
 zswpout              1,419,802       1,398,620
 sys time (sec)           535.4          613.41
 Throughput KB/s          8,671          20,045
 memcg_high events      574,046         451,859
 --------------------------------------------------------------------


Summary from the debug:
-----------------------
1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the
   charge, reclaim is on par with SSD for mTHP in the single process
   case. The multiple process excess reclaim seems to be most likely
   resulting from over-reclaim done by the cores, in their respective calls
   to mem_cgroup_handle_over_high().

2) The higher swapout activity with zswap as compared to SSD does not
   appear to be specific to mTHP. Higher reclaim activity and sys time
   regression with zswap (as compared to a setup where there is only SSD
   configured as swap) exists with 4K pages as far back as v6.7.

3) The debug indicates the hypothesis (2) is worth more investigation:
   Does a faster reclaim path somehow cause less allocation stalls; thereby
   causing more breaches of memory.high, hence more reclaim -- and does this
   cycle repeat, potentially leading to higher swapout activity with zswap?
   Any advise on this being a possibility, and suggestions/pointers to
   verify this, would be greatly appreciated.

4) Interestingly, the # of memcg_high events reduces significantly with 64K
   mTHP as compared to the above 4K high events data, when tested with v4
   and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This
   potentially indicates something to do with allocation efficiency
   countering the higher reclaim that seems to be caused by swapout
   efficiency.

5) Nhat, Yosry: would it be possible for you to run the 4K folios
   usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some higher
   value SSD configuration in your setup and say, v6.11-rc3. I would like
   to rule out the memory constrained 4G SSD in my setup somehow skewing
   the behavior of zswap vis-a-vis
   allocation/memcg_handle_over_high/reclaim. I realize your time is
   valuable, however I think an independent confirmation of what I have
   been observing, would be really helpful for us to figure out potential
   root-causes and solutions.

6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high() to
   break out of the loop if we have reclaimed a total of at least
   "nr_pages":

	nr_reclaimed = reclaim_high(memcg,
				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
				    gfp_mask);

+	nr_reclaimed_total += nr_reclaimed;
+
+	if (nr_reclaimed_total >= nr_pages)
+		goto out;


   This was only for debug purposes, and did seem to mitigate the higher
   reclaim behavior for 4K folios:
   
 ZSWAP                  lz4             lz4             lz4
 ----------------------------------------------------------
 zswpout          1,305,367       1,349,195       1,529,235
 sys time (sec)      472.06          507.76          646.39
 Throughput KB/s     55,144          21,811          88,310
 memcg_high events  257,890         343,213         172,351
 ----------------------------------------------------------

On average, this change results in 17% improvement in sys time, 2.35X
improvement in throughput and 30% fewer memcg_high events.

I look forward to further inputs on next steps.

Thanks,
Kanchana


> 
> Thanks for this analysis. I will debug this some more, so we can better
> understand these results.
> 
> Thanks,
> Kanchana