[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB56783F8762A5AE6FF20BBAF2C9802@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Thu, 15 Aug 2024 01:37:07 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Barry Song <21cnbao@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
"nphamcs@...il.com" <nphamcs@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "Huang, Ying" <ying.huang@...el.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "Zou, Nanhai"
<nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat
event counters.
Hi Barry,
> -----Original Message-----
> From: Barry Song <21cnbao@...il.com>
> Sent: Wednesday, August 14, 2024 4:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosryahmed@...gle.com; nphamcs@...il.com;
> ryan.roberts@....com; Huang, Ying <ying.huang@...el.com>; akpm@...ux-
> foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> vmstat event counters.
>
> On Thu, Aug 15, 2024 at 5:40 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@...el.com> wrote:
> >
> > Hi Barry,
> >
> > > -----Original Message-----
> > > From: Barry Song <21cnbao@...il.com>
> > > Sent: Wednesday, August 14, 2024 12:49 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > > hannes@...xchg.org; yosryahmed@...gle.com; nphamcs@...il.com;
> > > ryan.roberts@....com; Huang, Ying <ying.huang@...el.com>;
> akpm@...ux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> > > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> > > vmstat event counters.
> > >
> > > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@...el.com> wrote:
> > > >
> > > > Added vmstat event counters per mTHP-size that can be used to account
> > > > for folios of different sizes being successfully stored in ZSWAP.
> > > >
> > > > For this RFC, it is not clear if these zswpout counters should instead
> > > > be added as part of the existing mTHP stats in
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
> > > >
> > > > The following is also a viable option, should it make better sense,
> > > > for instance, as:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
> > > >
> > > > If so, we would be able to distinguish between mTHP zswap and
> > > > non-zswap swapouts through:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> > > >
> > > > and
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> > > >
> > > > respectively.
> > > >
> > > > Comments would be appreciated as to which approach is preferable.
> > >
> > > Even though swapout might go through zswap, from the perspective of
> > > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
> > > of swpout? Why are they separate? no matter if a mTHP has been
> > > put in zswap, it has been swapped-out to mm-core? No?
> >
> > Thanks for the code review comments. This is a good point. I was keeping in
> > mind the convention used by existing vmstat event counters that distinguish
> > zswpout/zswpin from pswpout/pswpin events.
> >
> > If we want to keep the distinction in mTHP swapouts, would adding a
> > separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok?
> >
>
> I'm not entirely sure how important the zswpout counter is. To me, it doesn't
> seem as critical as swpout and swpout_fallback, which are more useful for
> system profiling. zswapout feels more like an internal detail related to
> how the swap-out process is handled? If this is the case, we might not
> need this per-size counter.
>
> Otherwise, I believe sysfs is a better place to avoid all the chaos in vmstat
> to handle various orders and sizes. So the question is, per-size zswpout
> counter is really important or just for debugging purposes?
I agree, sysfs would be a cleaner mTHP stats accounting solution, given the
existing mTHP swpout stats under the per-order
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout.
I personally find distinct zswap vs. bdev/fs swapout accounting useful
for debugging, and for overall reclaim path characterization.
For instance, the impact of different zswap compressors' compress latency
on zswpout activity for a given workload. Is a slowdown in compress latency
causing active/hot memory to be reclaimed and immediately faulted in?
Does better zswap compress efficiency co-relate to more cold memory
as mTHP to be reclaimed? How does the reclaim path efficiency
improvement resulting from improving zswap_store mTHP performance
co-relate with ZSWAP utilization and memory savings? I have found these
counters useful in understanding some of these characteristics.
I also believe it helps to account for the number of mTHP being stored in
different compress tiers. For e.g. how many mTHP were stored in zswap vs.
being rejected and stored in the backing swap device. This could help say
in provisioning zswap memory, and knowing the impact of zswap compress
path latency on scaling.
Another interesting characteristic that mTHP zswpout accounting could help
understand would be compressor incompressibility and/or zpool fragmentation;
and being able to better co-relate the zswap/reject_* sysfs counters with
mTHP [z]swpout stats.
Look forward to inputs from yourself and others on the direction and next steps.
Thanks,
Kanchana
>
> > In any case, it looks like all that would be needed is a call to
> > count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the
> > general case.
> >
> > I will make this change in v2, depending on whether or not the
> > separation of zswpout vs. non-zswap swpout is recommended for
> > mTHP.
> >
> > >
> > >
> > > >
> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
> > > > ---
> > > > include/linux/vm_event_item.h | 15 +++++++++++++++
> > > > mm/vmstat.c | 15 +++++++++++++++
> > > > 2 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/include/linux/vm_event_item.h
> > > b/include/linux/vm_event_item.h
> > > > index 747943bc8cc2..2451bcfcf05c 100644
> > > > --- a/include/linux/vm_event_item.h
> > > > +++ b/include/linux/vm_event_item.h
> > > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > > PSWPIN, PSWPOUT,
> > > > THP_ZERO_PAGE_ALLOC,
> > > > THP_ZERO_PAGE_ALLOC_FAILED,
> > > > THP_SWPOUT,
> > > > +#ifdef CONFIG_ZSWAP
> > > > + ZSWPOUT_PMD_THP_FOLIO,
> > > > +#endif
> > > > THP_SWPOUT_FALLBACK,
> > > > #endif
> > > > #ifdef CONFIG_MEMORY_BALLOON
> > > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > > PSWPIN, PSWPOUT,
> > > > ZSWPIN,
> > > > ZSWPOUT,
> > > > ZSWPWB,
> > > > + ZSWPOUT_4KB_FOLIO,
> > > > +#ifdef CONFIG_THP_SWAP
> > > > + mTHP_ZSWPOUT_8kB,
> > > > + mTHP_ZSWPOUT_16kB,
> > > > + mTHP_ZSWPOUT_32kB,
> > > > + mTHP_ZSWPOUT_64kB,
> > > > + mTHP_ZSWPOUT_128kB,
> > > > + mTHP_ZSWPOUT_256kB,
> > > > + mTHP_ZSWPOUT_512kB,
> > > > + mTHP_ZSWPOUT_1024kB,
> > > > + mTHP_ZSWPOUT_2048kB,
> > > > +#endif
> > >
> > > This implementation hardcodes assumptions about the page size being
> 4KB,
> > > but page sizes can vary, and so can the THP orders?
> >
> > Agreed, will address in v2.
> >
> > >
> > > > #endif
> > > > #ifdef CONFIG_X86
> > > > DIRECT_MAP_LEVEL2_SPLIT,
> > > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > > index 8507c497218b..0e66c8b0c486 100644
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
> > > > "thp_zero_page_alloc",
> > > > "thp_zero_page_alloc_failed",
> > > > "thp_swpout",
> > > > +#ifdef CONFIG_ZSWAP
> > > > + "zswpout_pmd_thp_folio",
> > > > +#endif
> > > > "thp_swpout_fallback",
> > > > #endif
> > > > #ifdef CONFIG_MEMORY_BALLOON
> > > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
> > > > "zswpin",
> > > > "zswpout",
> > > > "zswpwb",
> > > > + "zswpout_4kb_folio",
> > > > +#ifdef CONFIG_THP_SWAP
> > > > + "mthp_zswpout_8kb",
> > > > + "mthp_zswpout_16kb",
> > > > + "mthp_zswpout_32kb",
> > > > + "mthp_zswpout_64kb",
> > > > + "mthp_zswpout_128kb",
> > > > + "mthp_zswpout_256kb",
> > > > + "mthp_zswpout_512kb",
> > > > + "mthp_zswpout_1024kb",
> > > > + "mthp_zswpout_2048kb",
> > > > +#endif
> > >
> > > The issue here is that the number of THP orders
> > > can vary across different platforms.
> >
> > Agreed, will address in v2.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > > #endif
> > > > #ifdef CONFIG_X86
> > > > "direct_map_level2_splits",
> > > > --
> > > > 2.27.0
> > > >
>
> Thanks
> Barry
Powered by blists - more mailing lists