[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB5678F262108A0ED22A84698BC9362@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Tue, 3 Dec 2024 22:24:17 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
CC: Herbert Xu <herbert@...dor.apana.org.au>, Nhat Pham <nphamcs@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
<vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
the crypto_alg supports batching.
> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Sent: Tuesday, December 3, 2024 2:18 PM
> To: Yosry Ahmed <yosryahmed@...gle.com>
> Cc: Herbert Xu <herbert@...dor.apana.org.au>; Nhat Pham
> <nphamcs@...il.com>; linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> akpm@...ux-foundation.org; linux-crypto@...r.kernel.org;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Feghali, Wajdi K <wajdi.k.feghali@...el.com>;
> Gopal, Vinodh <vinodh.gopal@...el.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@...el.com>
> Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
>
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@...gle.com>
> > Sent: Tuesday, December 3, 2024 1:44 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > Cc: Herbert Xu <herbert@...dor.apana.org.au>; Nhat Pham
> > <nphamcs@...il.com>; linux-kernel@...r.kernel.org; linux-
> mm@...ck.org;
> > hannes@...xchg.org; chengming.zhou@...ux.dev;
> > usamaarif642@...il.com; ryan.roberts@....com; ying.huang@...el.com;
> > 21cnbao@...il.com; akpm@...ux-foundation.org; linux-
> > crypto@...r.kernel.org; davem@...emloft.net; clabbe@...libre.com;
> > ardb@...nel.org; ebiggers@...gle.com; surenb@...gle.com; Accardi,
> > Kristen C <kristen.c.accardi@...el.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources
> if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@...el.com> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Herbert Xu <herbert@...dor.apana.org.au>
> > > > Sent: Tuesday, December 3, 2024 12:01 AM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > > > Cc: Nhat Pham <nphamcs@...il.com>; linux-kernel@...r.kernel.org;
> > linux-
> > > > mm@...ck.org; hannes@...xchg.org; yosryahmed@...gle.com;
> > > > chengming.zhou@...ux.dev; usamaarif642@...il.com;
> > > > ryan.roberts@....com; ying.huang@...el.com; 21cnbao@...il.com;
> > > > akpm@...ux-foundation.org; linux-crypto@...r.kernel.org;
> > > > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > > > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > > > <kristen.c.accardi@...el.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@...el.com>;
> > > > Gopal, Vinodh <vinodh.gopal@...el.com>
> > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> > resources if
> > > > the crypto_alg supports batching.
> > > >
> > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > > > >
> > > > > > Why do we need this "can_batch" field? IIUC, this can be
> determined
> > > > > > from the compressor internal fields itself, no?
> > > > > >
> > > > > > acomp_has_async_batching(acomp);
> > > > > >
> > > > > > Is this just for convenience, or is this actually an expensive thing to
> > > > compute?
> > > > >
> > > > > Thanks for your comments. This is a good question. I tried not to imply
> > that
> > > > > batching resources have been allocated for the cpu based only on what
> > > > > acomp_has_async_batching() returns. It is possible that the cpu
> onlining
> > > > > code ran into an -ENOMEM error on any particular cpu. In this case, I
> > set
> > > > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > > > can be somewhat insulated from migration. I agree that this may not
> be
> > > > > the best solution; and whether or not batching is enabled can be
> directly
> > > > > determined just before the call to crypto_acomp_batch_compress()
> > > > > based on:
> > > > >
> > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> > > >
> > > > With ahash request chaining, the idea is to accumulate as much
> > > > data as you can before you provide it to the Crypto API. The
> > > > API is responsible for dividing it up if the underlying driver
> > > > is only able to handle one request at a time.
> > > >
> > > > So that would be the ideal model to use for compression as well.
> > > > Provide as much data as you can and let the API handle the case
> > > > where the data needs to be divided up.
> > >
> > > Thanks for this suggestion! This sounds like a clean way to handle the
> > > batching/sequential compress/decompress within the crypto API as long
> > > as it can be contained in the crypto acompress layer.
> > > If the zswap maintainers don't have any objections, I can look into the
> > > feasibility of doing this.
> >
> > Does this mean that instead of zswap breaking down the folio into
> > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > crypto layer and let it do the batching as it pleases?
>
> If I understand Herbert's suggestion correctly, I think what he meant was
> that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say,
> 8)
> during the cpu onlining always. The acomp_has_async_batching() API can
> be used to determine whether to allocate more than one acomp_req and
> crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait
> for the request chaining with the goal of understanding performance wrt the
> existing implementation of crypto_acomp_batch_compress()).
> In zswap_store_folio(), we process the large folio in batches of 8 pages
> and call "crypto_acomp_batch_compress()" for each batch. Based on earlier
> discussions in this thread, it might make sense to add a bool option to
> crypto_acomp_batch_compress() as follows:
>
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> struct crypto_wait *waits[],
> struct page *pages[],
> u8 *dsts[],
> unsigned int dlens[],
> int errors[],
> int nr_pages,
> bool parallel);
>
> zswap would acquire the per-cpu acomp_ctx->mutex, and pass
> (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel"
> parameter.
> This indicates to crypto_acomp_batch_compress() whether or not
> SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and
> "waits".
>
> If we have multiple "reqs" (parallel == true), we use request chaining (or the
> existing asynchronous poll implementation) for IAA batching. If (parallel ==
> false),
> crypto_acomp_batch_compress() will look something like this:
>
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> struct crypto_wait *waits[],
> struct page *pages[],
> u8 *dsts[],
> unsigned int dlens[],
> int errors[],
> int nr_pages,
> bool parallel)
> {
> if (!parallel) {
> struct scatterlist input, output;
> int i;
>
> for (i = 0; i < nr_pages; ++i) {
> /* for pages[i], buffers[i], dlens[i]: borrow first half of
> * zswap_compress() functionality:
> */
> dst = acomp_ctx->buffers[i];
> sg_init_table(&input, 1);
> sg_set_page(&input, pages[i], PAGE_SIZE, 0);
>
> sg_init_one(&output, dst, PAGE_SIZE * 2);
> acomp_request_set_params(acomp_ctx->reqs[0],
> &input, &output, PAGE_SIZE, dlens[i]);
>
> comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx-
> >waits[0]);
> dlens[i] = acomp_ctx->reqs[0]->dlen;
> }
> }
>
> /*
> * At this point we would have sequentially compressed the batch.
> * zswap_store_folio() can process the buffers and dlens using
> * common code for batching and non-batching compressors.
> */
> }
>
> IIUC, this suggestion appears to be along the lines of using common
> code in zswap as far as possible, for compressors that do and do not
> support batching. Herbert can correct me if I am wrong.
>
> If this is indeed the case, the memory penalty for software compressors
> would be:
> 1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in
> zswap_cpu_comp_prepare().
> 2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in
> zswap_store_folio().
>
> This would be an additional memory penalty for what we gain by
> having the common code paths in zswap for compressors that do
> and do not support batching.
Alternately, we could use request chaining always, even for software
compressors for a larger memory penalty per-cpu, by allocating
SWAP_CRYPTO_BATCH_SIZE # of reqs/waits by default. I don't know
if this would have functional issues because the chain of requests
will be processed sequentially (basically all requests are added to a
list), but maybe Herbert is suggesting this (not sure).
Thanks,
Kanchana
>
> Thanks,
> Kanchana
>
> >
> > It sounds nice on the surface, but this implies that we have to
> > allocate folio_nr_pages() buffers in zswap, essentially as the
> > allocation is the same size as the folio itself. While the allocation
> > does not need to be contiguous, making a large number of allocations
> > in the reclaim path is definitely not something we want. For a 2M THP,
> > we'd need to allocate 2M in zswap_store().
> >
> > If we choose to keep preallocating, assuming the maximum THP size is
> > 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
> > memory.
> >
> > I feel like I am missing something.
> >
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > Cheers,
> > > > --
> > > > Email: Herbert Xu <herbert@...dor.apana.org.au>
> > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Powered by blists - more mailing lists