[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH7PR11MB81218BA4B8D8CC5F57570455C9EAA@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Mon, 13 Oct 2025 17:47:13 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"senozhatsky@...omium.org" <senozhatsky@...omium.org>, "sj@...nel.org"
<sj@...nel.org>, "kasong@...cent.com" <kasong@...cent.com>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
folio in batches.
> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Sent: Friday, October 3, 2025 12:11 PM
> To: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Gomes, Vinicius <vinicius.gomes@...el.com>;
> Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@...el.com>
> Subject: RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> large folio in batches.
>
>
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > Sent: Wednesday, October 1, 2025 9:19 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > > hannes@...xchg.org; nphamcs@...il.com;
> chengming.zhou@...ux.dev;
> > > usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> > > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> > > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > > <kristen.c.accardi@...el.com>; Gomes, Vinicius
> > <vinicius.gomes@...el.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > > <vinodh.gopal@...el.com>
> > > Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> > > large folio in batches.
> [...]
> > >
> > > > + */
> > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > **entries,
> > > > + unsigned int
> > > nr_entries,
> > > > + gfp_t gfp)
> > > > +{
> > > > + return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > > entries);
> > >
> > > We currently use kmem_cache_alloc_node() in
> zswap_entry_cache_alloc()
> > to
> > > allocate the entry on the same node as the compressed page. We use
> > > entry_to_nid() to get the node for LRU operations.
> > >
> > > This breaks that assumption.
> >
> > You bring up a good point. I was looking at the code in slub.c and my
> > understanding thus far is that both, bulk allocations and
> > kmem_cache_alloc_node()
> > allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> >
> > IIUC, the concern you are raising is in the mainline, the entry is allocated on
> > the same node as the compressed page, and gets added to the LRU list of
> that
> > node. IOW, the node to which the compressed page belongs is the one to
> > whose
> > LRU the entry will be added.
> >
> > With this patch, with kmem_cache_alloc_bulk(), the entry will be created on
> > the per-CPU slab of the CPU on which zswap_store() is called and will be
> > added to the LRU of that per-CPU slab's NUMA node. Hence, the end result
> > could potentially be that the zswap_entry for a page could potentially be
> > on a different NUMA node/memcg than the page's NUMA node.
> >
> > This is my thinking as to how this will impact the zswap shrinker:
> >
> > 1) memcg shrinker: if the memcg the entry ends up in is on the
> zswap_list_lru,
> > the entry will be written back.
> > 2) Global shrinker: will cycle through all memcg's that have pages in the
> > zswap_list_lru, and the entry will be written back.
> >
> > Based on this, it is not clear to me if there is a problem, and would like to
> > request you, Nhat and others to provide insights as well.
> >
> > Interestingly, most of the code in slub.c has unlikely(!node_match(slab,
> > node)).
> > Does this imply some higher level mm slab allocation requirements?
> >
> > I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> > think this would be more correct.
>
> I wanted to share some updates and summarize my understanding from
> looking some more at slub.c. The "zswap_entry_cache" kmem_cache
> has slab memory created for each node. The main change with v12 is:
>
> before:
> The page being compressed and its zswap_entry are in the same memcg,
> thus, implicitly on the same slab node.
>
> v12:
> The page being compressed and its zswap_entry could be in different
> memcgs if the process that owns the page gets migrated to a different
> node.
>
> Impact to zswap LRU list vis-à-vis the memcg shrinker:
>
> before:
> The original NUMA node memcg [hierarchy] would need to face memory
> pressure for the page's zswap_entry to be written back. Hence it is possible
> that the node on which the process is currently running may not see the
> benefit of memory being freed up.
>
> v12:
> The memcg whose node on which the process was running when the page
> was compressed would have to face memory pressure for the page's
> zswap_entry
> to be written back. This node will see the benefit of memory being freed up
> due
> to writeback. If the process has migrated to a different node than the one
> on which it was running when the page was compressed, we have the same
> issue
> as "before".
>
> Is my understanding correct? Please let me know if I am missing something.
>
> The bulk allocation does improve batching performance, especially with 64K
> folios:
>
> kernel_compile, 64K folios, deflate-iaa:
> ========================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 874.74 821.59 870.82
> sys_sec 3,834.35 3,791.12 3,794.06
> ------------------------------------------------------------------------------
>
> kernel_compile, 64K folios, zstd:
> =================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 925.08 853.14 899.40
> sys_sec 5,318.65 5,172.23 5,415.20
> ------------------------------------------------------------------------------
>
> kernel_compile, PMD folios, deflate-iaa:
> ========================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 808.10 794.85 809.33
> sys_sec 4,351.01 4,266.95 4,169.07
> ------------------------------------------------------------------------------
>
> kernel_compile, PMD folios, zstd:
> =================================
> mm-unstable-9-18-2025 v12 without bulk alloc/free
> ------------------------------------------------------------------------------
> real_sec 848.06 845.42 836.06
> sys_sec 5,898.58 5,741.31 5,795.75
> ------------------------------------------------------------------------------
>
> Based on this, and if my understanding of the v12 change's impact on the
> zswap shrinker is correct, I am not sure if the change in the assumption
> is necessarily a concern.
>
> >
> > >
> > > > +}
> > > > +
> > > > +static __always_inline void zswap_entries_cache_free_batch(void
> > > **entries,
> > > > + unsigned int
> > > nr_entries)
> > > > +{
> > > > + kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> > > > +}
> [...]
> > > > +static bool zswap_store_pages(struct folio *folio,
> > > > + long start,
> > > > + long end,
> > > > + struct obj_cgroup *objcg,
> > > > + struct zswap_pool *pool,
> > > > + int node_id,
> > > > + bool folio_wb)
> > > > {
> > > > - swp_entry_t page_swpentry = page_swap_entry(page);
> > > > - struct zswap_entry *entry, *old;
> > > > -
> > > > - /* allocate entry */
> > > > - entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > > > - if (!entry) {
> > > > - zswap_reject_kmemcache_fail++;
> > > > - return false;
> > > > + struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > > > + u8 i, store_fail_idx = 0, nr_pages = end - start;
> > > > +
> > > > + VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> > > > +
> > > > + if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> > > > + nr_pages, GFP_KERNEL)))
> > > {
> > > > + for (i = 0; i < nr_pages; ++i) {
> > > > + entries[i] = zswap_entry_cache_alloc(GFP_KERNEL,
> > > node_id);
> > > > +
> > > > + if (unlikely(!entries[i])) {
> > > > + zswap_reject_kmemcache_fail++;
> > > > + /*
> > > > + * While handling this error, we only need to
> > > > + * call zswap_entries_cache_free_batch() for
> > > > + * entries[0 .. i-1].
> > > > + */
> > > > + nr_pages = i;
> > > > + goto store_pages_failed;
> > > > + }
> > >
> > > Is it okay to use kmem_cache_free_bulk() to free slab objects that were
> > > not allocated with kmem_cache_alloc_bulk()?
> >
> > I recall verifying that it should be Ok, but can check again.
> >
>
> I verified the code again, and yes, it is Ok for slab objects allocated by
> either kmem_cache_alloc_bulk() or kmem_cache_alloc_node() to be
> freed by calling kmem_cache_free_bulk(). kmem_cache_free_bulk()
> opportunistically looks to create freelists from the list of slab objects
> to then "bulk transfer" them to the slab's freelists, with optimizations
> in synchronization as compared to calling kmem_cache_free().
>
> I verified this with usemem30 with the following code change in
> zswap_store_pages(), and saw no issues with deflate-iaa/zstd:
>
> for each of nr_pages: kmem_cache_alloc_node()
>
> kmem_cache_free_bulk()
>
> kmem_cache_alloc_bulk()
>
> kmem_cache_free_bulk()
>
> kmem_cache_alloc _bulk()
>
> [proceed]
>
> Yosry, please let me know how I should proceed and if there are
> other experiments that I can run to verify this change to bulk alloc/free
> to address any concerns.
Hi Yosry,
Just wanted to check in to get your suggestions on next steps as I start
working on v13.
Thanks for taking the time to provide code review comments!
Best regards,
Kanchana
>
> Thanks,
> Kanchana
Powered by blists - more mailing lists