linux-kernel - Re: [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkatD8Qw582C4gOsHRNgN3G7Qx=CxzV=FExhvroCaCAW6Q@mail.gmail.com>
Date: Mon, 6 Jan 2025 16:58:33 -0800
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, hannes@...xchg.org, 
	nphamcs@...il.com, chengming.zhou@...ux.dev, usamaarif642@...il.com, 
	ryan.roberts@....com, 21cnbao@...il.com, akpm@...ux-foundation.org, 
	linux-crypto@...r.kernel.org, herbert@...dor.apana.org.au, 
	davem@...emloft.net, clabbe@...libre.com, ardb@...nel.org, 
	ebiggers@...gle.com, surenb@...gle.com, kristen.c.accardi@...el.com, 
	wajdi.k.feghali@...el.com, vinodh.gopal@...el.com
Subject: Re: [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if
 the crypto_alg supports batching.

On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar
<kanchana.p.sridhar@...el.com> wrote:
>
> This patch does the following:
>
> 1) Defines ZSWAP_MAX_BATCH_SIZE to denote the maximum number of acomp_ctx
>    batching resources (acomp_reqs and buffers) to allocate if the zswap
>    compressor supports batching. Currently, ZSWAP_MAX_BATCH_SIZE is set to
>    8U.
>
> 2) Modifies the definition of "struct crypto_acomp_ctx" to represent a
>    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
>    "struct crypto_acomp_ctx" to contain the number of resources that will
>    be allocated in the cpu hotplug onlining code.
>
> 3) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
>    crypto_acomp created for the zswap pool (in other words, the zswap
>    compression algorithm) has registered implementations for
>    batch_compress() and batch_decompress().

This is an implementation detail that is not visible to the zswap
code. Please do not refer to batch_compress() and batch_decompress()
here, just mention that we check if the compressor supports batching.

> If so, it will query the
>    crypto_acomp for the maximum batch size supported by the compressor, and
>    set "nr_reqs" to the minimum of this compressor-specific max batch size
>    and ZSWAP_MAX_BATCH_SIZE. Finally, it will allocate "nr_reqs"
>    reqs/buffers, and set the acomp_ctx->nr_reqs accordingly.
>
> 4) If the crypto_acomp does not support batching, "nr_reqs" defaults to 1.

General note, some implementation details are obvious from the code
and do not need to be explained in the commit log. It's mostly useful
to explain what you are doing from a high level, and why you are doing
it.

In this case, we should mainly describe that we are adding support for
the per-CPU acomp_ctx to track multiple compression/decompression
requests but are not actually using more than one request yet. Mention
that followup changes will actually utilize this to batch
compression/decompression of multiple pages, and highlight important
implementation details (such as ZSWAP_MAX_BATCH_SIZE limiting the
amount of extra memory we are using for this, and that there is no
extra memory usage for compressors that do not use batching).

>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
> ---
>  mm/zswap.c | 122 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 90 insertions(+), 32 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 9718c33f8192..99cd78891fd0 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -78,6 +78,13 @@ static bool zswap_pool_reached_full;
>
>  #define ZSWAP_PARAM_UNSET ""
>
> +/*
> + * For compression batching of large folios:
> + * Maximum number of acomp compress requests that will be processed
> + * in a batch, iff the zswap compressor supports batching.
> + */

Please mention that this limit exists because we preallocate enough
requests and buffers accordingly, so a higher limit means higher
memory usage.

> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>
>  /* Enable/disable zswap */
> @@ -143,9 +150,10 @@ bool zswap_never_enabled(void)
>
>  struct crypto_acomp_ctx {
>         struct crypto_acomp *acomp;
> -       struct acomp_req *req;
> +       struct acomp_req **reqs;
> +       u8 **buffers;
> +       unsigned int nr_reqs;
>         struct crypto_wait wait;
> -       u8 *buffer;
>         struct mutex mutex;
>         bool is_sleepable;
>  };
> @@ -818,49 +826,88 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         struct crypto_acomp *acomp;
> -       struct acomp_req *req;
> -       int ret;
> +       unsigned int nr_reqs = 1;
> +       int ret = -ENOMEM;
> +       int i, j;
>
>         mutex_init(&acomp_ctx->mutex);
> -
> -       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -       if (!acomp_ctx->buffer)
> -               return -ENOMEM;
> +       acomp_ctx->nr_reqs = 0;
>
>         acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>         if (IS_ERR(acomp)) {
>                 pr_err("could not alloc crypto acomp %s : %ld\n",
>                                 pool->tfm_name, PTR_ERR(acomp));
> -               ret = PTR_ERR(acomp);
> -               goto acomp_fail;
> +               return PTR_ERR(acomp);
>         }
>         acomp_ctx->acomp = acomp;
>         acomp_ctx->is_sleepable = acomp_is_async(acomp);
>
> -       req = acomp_request_alloc(acomp_ctx->acomp);
> -       if (!req) {
> -               pr_err("could not alloc crypto acomp_request %s\n",
> -                      pool->tfm_name);
> -               ret = -ENOMEM;
> +       /*
> +        * Create the necessary batching resources if the crypto acomp alg
> +        * implements the batch_compress and batch_decompress API.

No mention of the internal implementation of acomp_has_async_batching() please.

> +        */
> +       if (acomp_has_async_batching(acomp)) {
> +               nr_reqs = min(ZSWAP_MAX_BATCH_SIZE, crypto_acomp_batch_size(acomp));
> +               pr_info_once("Creating acomp_ctx with %d reqs/buffers for batching since crypto acomp\n%s has registered batch_compress() and batch_decompress().\n",
> +                       nr_reqs, pool->tfm_name);

This will only be printed once, so if the compressor changes the
information will no longer be up-to-date on all CPUs. I think we
should just drop it.

> +       }
> +
> +       acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));

Can we use kcalloc_node() here?

> +       if (!acomp_ctx->buffers)
> +               goto buf_fail;
> +
> +       for (i = 0; i < nr_reqs; ++i) {
> +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> +               if (!acomp_ctx->buffers[i]) {
> +                       for (j = 0; j < i; ++j)
> +                               kfree(acomp_ctx->buffers[j]);
> +                       kfree(acomp_ctx->buffers);
> +                       ret = -ENOMEM;
> +                       goto buf_fail;
> +               }
> +       }
> +
> +       acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));

Ditto.

> +       if (!acomp_ctx->reqs)
>                 goto req_fail;
> +
> +       for (i = 0; i < nr_reqs; ++i) {
> +               acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
> +               if (!acomp_ctx->reqs[i]) {
> +                       pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
> +                              i, pool->tfm_name);
> +                       for (j = 0; j < i; ++j)
> +                               acomp_request_free(acomp_ctx->reqs[j]);
> +                       kfree(acomp_ctx->reqs);
> +                       ret = -ENOMEM;
> +                       goto req_fail;
> +               }
>         }
> -       acomp_ctx->req = req;
>
> +       /*
> +        * The crypto_wait is used only in fully synchronous, i.e., with scomp
> +        * or non-poll mode of acomp, hence there is only one "wait" per
> +        * acomp_ctx, with callback set to reqs[0], under the assumption that
> +        * there is at least 1 request per acomp_ctx.
> +        */
>         crypto_init_wait(&acomp_ctx->wait);
>         /*
>          * if the backend of acomp is async zip, crypto_req_done() will wakeup
>          * crypto_wait_req(); if the backend of acomp is scomp, the callback
>          * won't be called, crypto_wait_req() will return without blocking.
>          */
> -       acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +       acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
> +       acomp_ctx->nr_reqs = nr_reqs;
>         return 0;
>
>  req_fail:
> +       for (i = 0; i < nr_reqs; ++i)
> +               kfree(acomp_ctx->buffers[i]);
> +       kfree(acomp_ctx->buffers);

The cleanup code is all over the place. Sometimes it's done in the
loops allocating the memory and sometimes here. It's a bit hard to
follow. Please have all the cleanups here. You can just initialize the
arrays to 0s, and then if the array is not-NULL you can free any
non-NULL elements (kfree() will handle NULLs gracefully).

There may be even potential for code reuse with zswap_cpu_comp_dead().

> +buf_fail:
>         crypto_free_acomp(acomp_ctx->acomp);
> -acomp_fail:
> -       kfree(acomp_ctx->buffer);
>         return ret;
>  }
>
> @@ -870,11 +917,22 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>
>         if (!IS_ERR_OR_NULL(acomp_ctx)) {
> -               if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -                       acomp_request_free(acomp_ctx->req);
> +               int i;
> +
> +               for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +                       if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> +                               acomp_request_free(acomp_ctx->reqs[i]);
> +               kfree(acomp_ctx->reqs);
> +
> +               for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +                       kfree(acomp_ctx->buffers[i]);
> +               kfree(acomp_ctx->buffers);
> +
>                 if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>                         crypto_free_acomp(acomp_ctx->acomp);
> -               kfree(acomp_ctx->buffer);
> +
> +               acomp_ctx->nr_reqs = 0;
> +               acomp_ctx = NULL;
>         }
>
>         return 0;
> @@ -897,7 +955,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>
>         mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->buffer;
> +       dst = acomp_ctx->buffers[0];
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> @@ -907,7 +965,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          * giving the dst buffer with enough length to avoid buffer overflow.
>          */
>         sg_init_one(&output, dst, PAGE_SIZE * 2);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +       acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
>
>         /*
>          * it maybe looks a little bit silly that we send an asynchronous request,
> @@ -921,8 +979,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          * but in different threads running on different cpu, we have different
>          * acomp instance, so multiple threads can do (de)compression in parallel.
>          */
> -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -       dlen = acomp_ctx->req->dlen;
> +       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
> +       dlen = acomp_ctx->reqs[0]->dlen;
>         if (comp_ret)
>                 goto unlock;
>
> @@ -975,20 +1033,20 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>          */
>         if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
>             !virt_addr_valid(src)) {
> -               memcpy(acomp_ctx->buffer, src, entry->length);
> -               src = acomp_ctx->buffer;
> +               memcpy(acomp_ctx->buffers[0], src, entry->length);
> +               src = acomp_ctx->buffers[0];
>                 zpool_unmap_handle(zpool, entry->handle);
>         }
>
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_folio(&output, folio, PAGE_SIZE, 0);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> -       BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> +       acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
> +       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
> +       BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
>         mutex_unlock(&acomp_ctx->mutex);
>
> -       if (src != acomp_ctx->buffer)
> +       if (src != acomp_ctx->buffers[0])
>                 zpool_unmap_handle(zpool, entry->handle);
>  }
>
> --
> 2.27.0
>