[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB5678A2F2802F780DB32C9A98C95C2@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Thu, 7 Nov 2024 22:32:43 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Johannes Weiner <hannes@...xchg.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "yosryahmed@...gle.com"
<yosryahmed@...gle.com>, "nphamcs@...il.com" <nphamcs@...il.com>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "Huang, Ying" <ying.huang@...el.com>,
"21cnbao@...il.com" <21cnbao@...il.com>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "linux-crypto@...r.kernel.org"
<linux-crypto@...r.kernel.org>, "herbert@...dor.apana.org.au"
<herbert@...dor.apana.org.au>, "davem@...emloft.net" <davem@...emloft.net>,
"clabbe@...libre.com" <clabbe@...libre.com>, "ardb@...nel.org"
<ardb@...nel.org>, "ebiggers@...gle.com" <ebiggers@...gle.com>,
"surenb@...gle.com" <surenb@...gle.com>, "Accardi, Kristen C"
<kristen.c.accardi@...el.com>, "zanussi@...nel.org" <zanussi@...nel.org>,
"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
<vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in
zswap_store() of large folios.
> -----Original Message-----
> From: Johannes Weiner <hannes@...xchg.org>
> Sent: Thursday, November 7, 2024 10:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> yosryahmed@...gle.com; nphamcs@...il.com;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> ryan.roberts@....com; Huang, Ying <ying.huang@...el.com>;
> 21cnbao@...il.com; akpm@...ux-foundation.org; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; zanussi@...nel.org; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA
> in zswap_store() of large folios.
>
> On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> > If the system has Intel IAA, and if sysctl vm.compress-batching is set to
> > "1", zswap_store() will call crypto_acomp_batch_compress() to compress up
> > to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> > the multiple compress engines available in IAA hardware.
> >
> > On platforms with multiple IAA devices per socket, compress jobs from all
> > cores in a socket will be distributed among all IAA devices on the socket
> > by the iaa_crypto driver.
> >
> > With deflate-iaa configured as the zswap compressor, and
> > sysctl vm.compress-batching is enabled, the first time zswap_store() has to
> > swapout a large folio on any given cpu, it will allocate the
> > pool->acomp_batch_ctx resources on that cpu, and set pool-
> >can_batch_comp
> > to BATCH_COMP_ENABLED. It will then proceed to call the main
> > __zswap_store_batch_core() compress batching function. Subsequent calls
> to
> > zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx
> by
> > checking the pool->can_batch_comp status.
> >
> > Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on
> an
> > as-needed basis, to reduce memory footprint cost. The cost is not incurred
> > on cores that never get to swapout a large folio.
> >
> > This patch introduces the main __zswap_store_batch_core() function for
> > compress batching. This interface represents the extensible compress
> > batching architecture that can potentially be called with a batch of
> > any-order folios from shrink_folio_list(). In other words, although
> > zswap_store() calls __zswap_store_batch_core() with exactly one large folio
> > in this patch, we can reuse this interface to reclaim a batch of folios, to
> > significantly improve the reclaim path efficiency due to IAA's parallel
> > compression capability.
> >
> > The newly added functions that implement batched stores follow the
> > general structure of zswap_store() of a large folio. Some amount of
> > restructuring and optimization is done to minimize failure points
> > for a batch, fail early and maximize the zswap store pipeline occupancy
> > with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple
> > folios. This is intended to maximize reclaim throughput with the IAA
> > hardware parallel compressions.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>
> > ---
> > include/linux/zswap.h | 84 ++++++
> > mm/zswap.c | 625
> ++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 709 insertions(+)
> >
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 9ad27ab3d222..6d3ef4780c69 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -31,6 +31,88 @@ struct zswap_lruvec_state {
> > atomic_long_t nr_disk_swapins;
> > };
> >
> > +/*
> > + * struct zswap_store_sub_batch_page:
> > + *
> > + * This represents one "zswap batching element", namely, the
> > + * attributes associated with a page in a large folio that will
> > + * be compressed and stored in zswap. The term "batch" is reserved
> > + * for a conceptual "batch" of folios that can be sent to
> > + * zswap_store() by reclaim. The term "sub-batch" is used to describe
> > + * a collection of "zswap batching elements", i.e., an array of
> > + * "struct zswap_store_sub_batch_page *".
> > + *
> > + * The zswap compress sub-batch size is specified by
> > + * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the
> > + * platform has Intel IAA. This means zswap can store a large folio
> > + * by creating sub-batches of up to 8 pages and compressing this
> > + * batch using IAA to parallelize the 8 compress jobs in hardware.
> > + * For e.g., a 64KB folio can be compressed as 2 sub-batches of
> > + * 8 pages each. This can significantly improve the zswap_store()
> > + * performance for large folios.
> > + *
> > + * Although the page itself is represented directly, the structure
> > + * adds a "u8 batch_idx" to represent an index for the folio in a
> > + * conceptual "batch of folios" that can be passed to zswap_store().
> > + * Conceptually, this allows for up to 256 folios that can be passed
> > + * to zswap_store(). If this conceptual number of folios sent to
> > + * zswap_store() exceeds 256, the "batch_idx" needs to become u16.
> > + */
> > +struct zswap_store_sub_batch_page {
> > + u8 batch_idx;
> > + swp_entry_t swpentry;
> > + struct obj_cgroup *objcg;
> > + struct zswap_entry *entry;
> > + int error; /* folio error status. */
> > +};
> > +
> > +/*
> > + * struct zswap_store_pipeline_state:
> > + *
> > + * This stores state during IAA compress batching of (conceptually, a batch
> of)
> > + * folios. The term pipelining in this context, refers to breaking down
> > + * the batch of folios being reclaimed into sub-batches of
> > + * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the
> > + * sub-batch. This concept could be further evolved to use overlap of CPU
> > + * computes with IAA computes. For instance, we could stage the post-
> compress
> > + * computes for sub-batch "N-1" to happen in parallel with IAA batch
> > + * compression of sub-batch "N".
> > + *
> > + * We begin by developing the concept of compress batching. Pipelining
> with
> > + * overlap can be future work.
> > + *
> > + * @errors: The errors status for the batch of reclaim folios passed in from
> > + * a higher mm layer such as swap_writepage().
> > + * @pool: A valid zswap_pool.
> > + * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the
> @pool.
> > + * @sub_batch: This is an array that represents the sub-batch of up to
> > + * SWAP_CRYPTO_BATCH_SIZE pages that are being stored
> > + * in zswap.
> > + * @comp_dsts: The destination buffers for crypto_acomp_compress() for
> each
> > + * page being compressed.
> > + * @comp_dlens: The destination buffers' lengths from
> crypto_acomp_compress()
> > + * obtained after crypto_acomp_poll() returns completion status,
> > + * for each page being compressed.
> > + * @comp_errors: Compression errors for each page being compressed.
> > + * @nr_comp_pages: Total number of pages in @sub_batch.
> > + *
> > + * Note:
> > + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
> > + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the
> > + * u8 members (except @comp_dsts) need to become u16.
> > + */
> > +struct zswap_store_pipeline_state {
> > + int *errors;
> > + struct zswap_pool *pool;
> > + struct crypto_acomp_ctx *acomp_ctx;
> > + struct zswap_store_sub_batch_page *sub_batch;
> > + struct page **comp_pages;
> > + u8 **comp_dsts;
> > + unsigned int *comp_dlens;
> > + int *comp_errors;
> > + u8 nr_comp_pages;
> > +};
>
> Why are these in the public header?
Thanks Johannes, for the detailed code review comments! Yes, these don't
need to belong in the public header. I will move them to zswap.c.
>
> > unsigned long zswap_total_pages(void);
> > bool zswap_store(struct folio *folio);
> > bool zswap_load(struct folio *folio);
> > @@ -45,6 +127,8 @@ bool zswap_never_enabled(void);
> > #else
> >
> > struct zswap_lruvec_state {};
> > +struct zswap_store_sub_batch_page {};
> > +struct zswap_store_pipeline_state {};
> >
> > static inline bool zswap_store(struct folio *folio)
> > {
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2af736e38213..538aac3fb552 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int
> cpu,
> > char *tfm_name,
> > unsigned int nr_reqs);
> >
> > +static bool __zswap_store_batch_core(
> > + int node_id,
> > + struct folio **folios,
> > + int *errors,
> > + unsigned int nr_folios);
> > +
>
> Please reorder the functions to avoid forward decls.
Sure.
>
> > /*********************************
> > * pool functions
> > **********************************/
> > @@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page
> *page,
> > return -EINVAL;
> > }
> >
> > +/*
> > + * Modified to use the IAA compress batching framework implemented in
> > + * __zswap_store_batch_core() if sysctl vm.compress-batching is 1.
> > + * The batching code is intended to significantly improve folio store
> > + * performance over the sequential code.
>
> This isn't helpful, please delete.
Ok.
>
> > bool zswap_store(struct folio *folio)
> > {
> > long nr_pages = folio_nr_pages(folio);
> > @@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio)
> > bool ret = false;
> > long index;
> >
> > + /*
> > + * Improve large folio zswap_store() latency with IAA compress
> batching,
> > + * if this is enabled by setting sysctl vm.compress-batching to "1".
> > + * If enabled, the large folio's pages are compressed in parallel in
> > + * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every
> page in
> > + * the large folio is compressed sequentially.
> > + */
>
> Same here. Reduce to "Try to batch compress large folios, fall back to
> processing individual subpages if that fails."
Ok.
>
> > + if (folio_test_large(folio) && READ_ONCE(compress_batching)) {
> > + pool = zswap_pool_current_get();
>
> There is an existing zswap_pool_current_get() in zswap_store(), please
> reorder the sequence so you don't need to add an extra one.
Ok, will do this. I was trying to make the code less messy, but will find
a cleaner way.
>
> > + if (!pool) {
> > + pr_err("Cannot setup acomp_batch_ctx for compress
> batching: no current pool found\n");
>
> This is unnecessary.
Ok.
>
> > + goto sequential_store;
> > + }
> > +
> > + if (zswap_pool_can_batch(pool)) {
>
> This function is introduced in another patch, where it isn't
> used. Please add functions and callers in the same patch.
Ok. Unintended side effects of trying to break down the changes
into smaller commits. Will address this in v4.
>
> > + int error = -1;
> > + bool store_batch = __zswap_store_batch_core(
> > + folio_nid(folio),
> > + &folio, &error, 1);
> > +
> > + if (store_batch) {
> > + zswap_pool_put(pool);
> > + if (!error)
> > + ret = true;
> > + return ret;
> > + }
> > + }
>
> Please don't future proof code like this, only implement what is
> strictly necessary for the functionality in this patch. You're only
> adding a single caller with nr_folios=1, so it shouldn't be a
> parameter, and the function shouldn't have a that batch_idx loop.
Ok.
>
> > + zswap_pool_put(pool);
> > + }
> > +
> > +sequential_store:
> > +
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > @@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio)
> > return ret;
> > }
> >
> > +/*
> > + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
> > + * u8 stack variables in the next several functions, to u16.
> > + */
> > +
> > +/*
> > + * Propagate the "sbp" error condition to other batch elements belonging
> to
> > + * the same folio as "sbp".
> > + */
> > +static __always_inline void zswap_store_propagate_errors(
> > + struct zswap_store_pipeline_state *zst,
> > + u8 error_batch_idx)
> > +{
>
> Please observe surrounding coding style on how to wrap >80 col
> function signatures.
I see. Ok.
>
> Don't use __always_inline unless there is a clear, spelled out
> performance reason. Since it's an error path, that's doubtful.
The motivation was incompressible pages, but sure, will address in v4.
>
> Please use a consistent namespace for all this:
>
> CONFIG_ZSWAP_BATCH
> zswap_batch_store()
> zswap_batch_alloc_entries()
> zswap_batch_add_folios()
> zswap_batch_compress()
>
> etc.
>
> Again, order to avoid forward decls.
>
> Try to keep the overall sequence of events between zswap_store() and
> zswap_batch_store() similar as much as possible for readability.
Definitely.
Thanks,
Kanchana
Powered by blists - more mailing lists