[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <PH7PR11MB81210004ECD59264FF37B767C901A@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Wed, 3 Sep 2025 18:00:39 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Barry Song <21cnbao@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosry.ahmed@...ux.dev" <yosry.ahmed@...ux.dev>,
"nphamcs@...il.com" <nphamcs@...il.com>, "chengming.zhou@...ux.dev"
<chengming.zhou@...ux.dev>, "usamaarif642@...il.com"
<usamaarif642@...il.com>, "ryan.roberts@....com" <ryan.roberts@....com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"senozhatsky@...omium.org" <senozhatsky@...omium.org>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if
the compressor supports batching.
> -----Original Message-----
> From: Barry Song <21cnbao@...il.com>
> Sent: Saturday, August 30, 2025 1:41 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosry.ahmed@...ux.dev; nphamcs@...il.com;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> ryan.roberts@....com; ying.huang@...ux.alibaba.com; akpm@...ux-
> foundation.org; senozhatsky@...omium.org; linux-crypto@...r.kernel.org;
> herbert@...dor.apana.org.au; davem@...emloft.net;
> clabbe@...libre.com; ardb@...nel.org; ebiggers@...gle.com;
> surenb@...gle.com; Accardi, Kristen C <kristen.c.accardi@...el.com>;
> Gomes, Vinicius <vinicius.gomes@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
>
> > > >
> > > > I am not sure I understand this rationale, but I do want to reiterate
> > > > that the patch-set implements a simple set of rules/design choices
> > > > to provide a batching framework for software and hardware
> compressors,
> > > > that has shown good performance improvements with both, while
> > > > unifying zswap_store()/zswap_compress() code paths for both.
> > >
> > > I’m really curious: if ZSWAP_MAX_BATCH_SIZE = 8 and
> > > compr_batch_size = 4, why wouldn’t batch_size = 8 and
> > > compr_batch_size = 4 perform better than batch_size = 4 and
> > > compr_batch_size = 4?
> > >
> > > In short, I’d like the case of compr_batch_size == 1 to be treated the same
> > > as compr_batch_size == 2, 4, etc., since you can still see performance
> > > improvements when ZSWAP_MAX_BATCH_SIZE = 8 and compr_batch_size
> ==
> > > 1,
> > > as batching occurs even outside compression.
> > >
> > > Therefore, I would expect batch_size == 8 and compr_batch_size == 2 to
> > > perform better than when both are 2.
> > >
> > > The only thing preventing this from happening is that compr_batch_size
> > > might be 5, 6, or 7, which are not powers of two?
> >
> > It would be interesting to see if a generalization of pool->compr_batch_size
> > being a factor "N" (where N > 1) of ZSWAP_MAX_BATCH_SIZE yields better
> > performance than the current set of rules. However, we would still need to
> > handle the case where it is not, as you mention, which might still
> necessitate
> > the use of a distinct pool->batch_size to avoid re-calculating this
> dynamically,
> > when this information doesn't change after pool creation.
> >
> > The current implementation gives preference to the algorithm to determine
> > not just the batch compression step-size, but also the working-set size for
> > other zswap processing for the batch, i.e., bulk allocation of entries,
> > zpool writes, etc. The algorithm's batch-size is what zswap uses for the
> latter
> > (the zswap_store_pages() in my patch-set). This has been shown to work
> > well.
> >
> > To change this design to be driven instead by ZSWAP_MAX_BATCH_SIZE
> > always (while handling non-factor pool->compr_batch_size) requires more
> > data gathering. I am inclined to keep the existing implementation and
> > we can continue to improve upon this if its Ok with you.
>
> Right, I have no objection at this stage. I’m just curious—since some hardware
> now supports HW compression with only one queue, and in the future may
> increase to two or four queues but not many overall—whether batch_size ==
> compr_batch_size is always the best rule.
>
> BTW, is HW compression always better than software? For example, when
> kswapd, proactive reclamation, and direct reclamation all run simultaneously,
> the CPU-based approach can leverage multiple CPUs to perform compression
> in parallel. But if the hardware only provides a limited number of queues,
> software might actually perform better. An extreme case is when multiple
> threads are running MADV_PAGEOUT at the same time.
These are great questions that we'll need to run more experiments to answer
and understand the trade-offs. The good thing is the zswap architecture proposed
in this patch-set is flexible enough to allow us to do so with minor changes in
how we set up these two zswap_pool data members (pool->compr_batch_size and
pool->store_batch_size).
One of the next steps that we plan to explore is integrating batching for
hardware parallelism with the kcompressd work, as per Nhat's suggestion. I
believe hardware can be used with multiple compression threads from all these
sources. There are many details to be worked out, but I think this can be done
by striking the right balance between cost amortization, hardware parallelism,
overlapping computes between the CPU and the accelerator, etc.
Another important point is that our hardware roadmap continues to evolve,
consistently improving compression ratios, lowering both compression and
decompression latency and boosting overall throughput.
To summarize: with hardware compression acceleration and using batching to avail
of parallel hardware compress/decompress, we can improve the reclaim latency for
a given CPU thread. When combined with compression ratio improvements, we save
more memory for equivalent performance; and/or need to reclaim less via
proactive or direct reclaim (reclaim latency improvements result in less memory
pressure) and improve workload performance. This can make a significant impact
on contended systems where CPU threads for reclaim come at a cost, and hardware
compression can help offset this. Hence, I can only see upside, but we'll need to
prove this out :).
Thanks,
Kanchana
>
> I’m not opposing your current patchset, just sharing some side thoughts :-)
>
> Thanks
> Barry
Powered by blists - more mailing lists