[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB5678288EE890E8CB882839CDC98E2@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Wed, 21 Aug 2024 19:07:46 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Nhat Pham <nphamcs@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
"ryan.roberts@....com" <ryan.roberts@....com>, "Huang, Ying"
<ying.huang@...el.com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "Zou, Nanhai"
<nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
Hi Nhat,
> -----Original Message-----
> From: Nhat Pham <nphamcs@...il.com>
> Sent: Wednesday, August 21, 2024 7:43 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosryahmed@...gle.com; ryan.roberts@....com;
> Huang, Ying <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-
> foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
>
> On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@...el.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@....com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs.
> >
> > For instance, the determination of whether a folio is same-filled is
> > based on mapping an index into the folio to derive the page. Likewise,
> > there is a function "zswap_store_entry" added to store a zswap_entry in
> > the xarray.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> > Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> > changes to count_mthp_stat() so that it's always defined, even when THP
> > is disabled. Barry, I have also made one other change in page_io.c
> > where count_mthp_stat() is called by count_swpout_vm_event(). I would
> > appreciate it if you can review this. Thanks!
> > Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> > as suggested by Ying Huang. Ying, I would appreciate it if you can
> > review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> > the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> > Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> > Ryan's initial RFC [1]:
> > - Added a comment about the cgroup zswap limit checks occuring once
> per
> > folio at the beginning of zswap_store().
> > Nhat, Ryan, please do let me know if the comments convey the summary
> > from the RFC discussion. Thanks!
> > - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
> > ZSWAP. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> > usemem --init-time -w -O -n 70 1g
> >
> > Since I was constrained to get the 70 usemem processes to generate
> > swapout activity with the 4G SSD, I ended up using different cgroup
> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
> >
> > 64K mTHP experiments: cgroup memory fixed at 60G
> > 2M THP experiments : cgroup memory fixed at 55G
> >
> > The vm/sysfs stats included after the performance data provide details
> > on the swapout activity to SSD/ZSWAP.
> >
> > Other kernel configuration parameters:
> >
> > ZSWAP Compressor : LZ4, DEFLATE-IAA
> > ZSWAP Allocator : ZSMALLOC
> > SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows, averaged across 3 runs:
> >
> > 64KB mTHP (cgroup memory.high set to 60G):
> > ==========================================
> > ------------------------------------------------------------------
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement|
> > | | | KB/s | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | SSD | 335,346 | Baseline |
> > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% |
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% |
> > |------------------------------------------------------------------|
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement|
> > | | | sec | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | SSD | 91.37 | Baseline |
> > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% |
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% |
> > ------------------------------------------------------------------
>
> Yeah no, this is not good. That throughput regression is concerning...
>
> Is this tied to lz4 only, or do you observe similar trends in other
> compressors that are not deflate-iaa?
Let me gather data with other software compressors.
>
>
> >
> > -----------------------------------------------------------------------
> > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-
> mTHP |
> > | | mainline | Store | Store |
> > | | | lz4 | deflate-iaa |
> > |-----------------------------------------------------------------------|
> > | pswpin | 0 | 0 | 0 |
> > | pswpout | 174,432 | 0 | 0 |
> > | zswpin | 703 | 534 | 721 |
> > | zswpout | 1,501 | 1,491,654 | 1,398,805 |
> > |-----------------------------------------------------------------------|
> > | thp_swpout | 0 | 0 | 0 |
> > | thp_swpout_fallback | 0 | 0 | 0 |
> > | pgmajfault | 3,364 | 3,650 | 3,431 |
> > |-----------------------------------------------------------------------|
> > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 |
> > |-----------------------------------------------------------------------|
> > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 |
> > -----------------------------------------------------------------------
> >
>
> Yeah this is not good. Something fishy is going on, if we see this
> ginormous jump from 175000 (z)swpout pages to almost 1.5 million
> pages. That's a massive jump.
>
> Either it's:
>
> 1.Your theory - zswap store keeps banging on the limit (which suggests
> incompatibility between the way zswap currently behaves and our
> reclaim logic)
>
> 2. The data here is ridiculously incompressible. We're needing to
> zswpout roughly 8.5 times the number of pages, so the saving is 8.5
> less => we only save 11.76% of memory for each page??? That's not
> right...
>
> 3. There's an outright bug somewhere.
>
> Very suspicious.
>
> >
> > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G):
> > =======================================================
> > ------------------------------------------------------------------
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement|
> > | | | KB/s | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | SSD | 190,827 | Baseline |
> > |zswap-mTHP-Store | ZSWAP lz4 | 32,026 | -83% |
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 203,772 | 7% |
> > |------------------------------------------------------------------|
> > | | | | |
> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement|
> > | | | sec | |
> > |--------------------|-------------------|------------|------------|
> > |v6.11-rc3 mainline | SSD | 27.23 | Baseline |
> > |zswap-mTHP-Store | ZSWAP lz4 | 156.52 | -475% |
> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 171.45 | -530% |
> > ------------------------------------------------------------------
>
> I'm confused. This is a *regression* right? A massive one that is -
> sys time is *more* than 5 times the old value?
>
> >
> > -------------------------------------------------------------------------
> > | VMSTATS, mTHP ZSWAP/SSD stats | v6.11-rc3 | zswap-mTHP | zswap-
> mTHP |
> > | | mainline | Store | Store |
> > | | | lz4 | deflate-iaa |
> > |-------------------------------------------------------------------------|
> > | pswpin | 0 | 0 | 0 |
> > | pswpout | 797,184 | 0 | 0 |
> > | zswpin | 690 | 649 | 669 |
> > | zswpout | 1,465 | 1,596,382 | 1,540,766 |
> > |-------------------------------------------------------------------------|
> > | thp_swpout | 1,557 | 0 | 0 |
> > | thp_swpout_fallback | 0 | 3,248 | 3,752 |
>
> This is also increased, but I supposed we're just doing more
> (z)swapping out in general...
>
> > | pgmajfault | 3,726 | 6,470 | 5,691 |
> > |-------------------------------------------------------------------------|
> > | hugepages-2048kB/stats/zswpout | | 2,416 | 2,261 |
> > |-------------------------------------------------------------------------|
> > | hugepages-2048kB/stats/swpout | 1,557 | 0 | 0 |
> > -------------------------------------------------------------------------
> >
>
> I'm not trying to delay this patch - I fully believe in supporting
> zswap for larger pages (both mTHP and THP - whatever the memory
> reclaim subsystem throws at us).
>
> But we need to get to the bottom of this :) These are very suspicious
> and concerning data. If this is something urgent, I can live with a
> gate to enable/disable this, but I'd much prefer we understand what's
> going on here.
Thanks for this analysis. I will debug this some more, so we can better
understand these results.
Thanks,
Kanchana
Powered by blists - more mailing lists