[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB5678EE2AC8DCCD7E7E8B0AD2C95C2@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Thu, 7 Nov 2024 22:50:34 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Johannes Weiner <hannes@...xchg.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "yosryahmed@...gle.com"
<yosryahmed@...gle.com>, "nphamcs@...il.com" <nphamcs@...il.com>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "Huang, Ying" <ying.huang@...el.com>,
"21cnbao@...il.com" <21cnbao@...il.com>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "linux-crypto@...r.kernel.org"
<linux-crypto@...r.kernel.org>, "herbert@...dor.apana.org.au"
<herbert@...dor.apana.org.au>, "davem@...emloft.net" <davem@...emloft.net>,
"clabbe@...libre.com" <clabbe@...libre.com>, "ardb@...nel.org"
<ardb@...nel.org>, "ebiggers@...gle.com" <ebiggers@...gle.com>,
"surenb@...gle.com" <surenb@...gle.com>, "Accardi, Kristen C"
<kristen.c.accardi@...el.com>, "zanussi@...nel.org" <zanussi@...nel.org>,
"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
<vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in
zswap_store() of large folios.
> -----Original Message-----
> From: Johannes Weiner <hannes@...xchg.org>
> Sent: Thursday, November 7, 2024 10:54 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> yosryahmed@...gle.com; nphamcs@...il.com;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> ryan.roberts@....com; Huang, Ying <ying.huang@...el.com>;
> 21cnbao@...il.com; akpm@...ux-foundation.org; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; zanussi@...nel.org; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA
> in zswap_store() of large folios.
>
> On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> > +static void zswap_zpool_store_sub_batch(
> > + struct zswap_store_pipeline_state *zst)
>
> There is a zswap_store_sub_batch() below, which does something
> completely different. Naming is hard, but please invest a bit more
> time into this to make this readable.
Thanks Johannes, for the comments. Yes, I agree the naming could
be better.
>
> > +{
> > + u8 i;
> > +
> > + for (i = 0; i < zst->nr_comp_pages; ++i) {
> > + struct zswap_store_sub_batch_page *sbp = &zst-
> >sub_batch[i];
> > + struct zpool *zpool;
> > + unsigned long handle;
> > + char *buf;
> > + gfp_t gfp;
> > + int err;
> > +
> > + /* Skip pages that had compress errors. */
> > + if (sbp->error)
> > + continue;
> > +
> > + zpool = zst->pool->zpool;
> > + gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> > + if (zpool_malloc_support_movable(zpool))
> > + gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > + err = zpool_malloc(zpool, zst->comp_dlens[i], gfp, &handle);
> > +
> > + if (err) {
> > + if (err == -ENOSPC)
> > + zswap_reject_compress_poor++;
> > + else
> > + zswap_reject_alloc_fail++;
> > +
> > + /*
> > + * An error should be propagated to other pages of
> the
> > + * same folio in the sub-batch, and zpool resources for
> > + * those pages (in sub-batch order prior to this zpool
> > + * error) should be de-allocated.
> > + */
> > + zswap_store_propagate_errors(zst, sbp->batch_idx);
> > + continue;
> > + }
> > +
> > + buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> > + memcpy(buf, zst->comp_dsts[i], zst->comp_dlens[i]);
> > + zpool_unmap_handle(zpool, handle);
> > +
> > + sbp->entry->handle = handle;
> > + sbp->entry->length = zst->comp_dlens[i];
> > + }
> > +}
> > +
> > +/*
> > + * Returns true if the entry was successfully
> > + * stored in the xarray, and false otherwise.
> > + */
> > +static bool zswap_store_entry(swp_entry_t page_swpentry,
> > + struct zswap_entry *entry)
> > +{
> > + struct zswap_entry *old =
> xa_store(swap_zswap_tree(page_swpentry),
> > + swp_offset(page_swpentry),
> > + entry, GFP_KERNEL);
> > + if (xa_is_err(old)) {
> > + int err = xa_err(old);
> > +
> > + WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > + zswap_reject_alloc_fail++;
> > + return false;
> > + }
> > +
> > + /*
> > + * We may have had an existing entry that became stale when
> > + * the folio was redirtied and now the new version is being
> > + * swapped out. Get rid of the old.
> > + */
> > + if (old)
> > + zswap_entry_free(old);
> > +
> > + return true;
> > +}
> > +
> > +static void zswap_batch_compress_post_proc(
> > + struct zswap_store_pipeline_state *zst)
> > +{
> > + int nr_objcg_pages = 0, nr_pages = 0;
> > + struct obj_cgroup *objcg = NULL;
> > + size_t compressed_bytes = 0;
> > + u8 i;
> > +
> > + zswap_zpool_store_sub_batch(zst);
> > +
> > + for (i = 0; i < zst->nr_comp_pages; ++i) {
> > + struct zswap_store_sub_batch_page *sbp = &zst-
> >sub_batch[i];
> > +
> > + if (sbp->error)
> > + continue;
> > +
> > + if (!zswap_store_entry(sbp->swpentry, sbp->entry)) {
> > + zswap_store_propagate_errors(zst, sbp->batch_idx);
> > + continue;
> > + }
> > +
> > + /*
> > + * The entry is successfully compressed and stored in the tree,
> > + * there is no further possibility of failure. Grab refs to the
> > + * pool and objcg. These refs will be dropped by
> > + * zswap_entry_free() when the entry is removed from the
> tree.
> > + */
> > + zswap_pool_get(zst->pool);
> > + if (sbp->objcg)
> > + obj_cgroup_get(sbp->objcg);
> > +
> > + /*
> > + * We finish initializing the entry while it's already in xarray.
> > + * This is safe because:
> > + *
> > + * 1. Concurrent stores and invalidations are excluded by folio
> > + * lock.
> > + *
> > + * 2. Writeback is excluded by the entry not being on the LRU
> yet.
> > + * The publishing order matters to prevent writeback from
> seeing
> > + * an incoherent entry.
> > + */
> > + sbp->entry->pool = zst->pool;
> > + sbp->entry->swpentry = sbp->swpentry;
> > + sbp->entry->objcg = sbp->objcg;
> > + sbp->entry->referenced = true;
> > + if (sbp->entry->length) {
> > + INIT_LIST_HEAD(&sbp->entry->lru);
> > + zswap_lru_add(&zswap_list_lru, sbp->entry);
> > + }
> > +
> > + if (!objcg && sbp->objcg) {
> > + objcg = sbp->objcg;
> > + } else if (objcg && sbp->objcg && (objcg != sbp->objcg)) {
> > + obj_cgroup_charge_zswap(objcg,
> compressed_bytes);
> > + count_objcg_events(objcg, ZSWPOUT,
> nr_objcg_pages);
> > + compressed_bytes = 0;
> > + nr_objcg_pages = 0;
> > + objcg = sbp->objcg;
> > + }
> > +
> > + if (sbp->objcg) {
> > + compressed_bytes += sbp->entry->length;
> > + ++nr_objcg_pages;
> > + }
> > +
> > + ++nr_pages;
> > + } /* for sub-batch pages. */
> > +
> > + if (objcg) {
> > + obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > + count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> > + }
> > +
> > + atomic_long_add(nr_pages, &zswap_stored_pages);
> > + count_vm_events(ZSWPOUT, nr_pages);
> > +}
> > +
> > +static void zswap_store_sub_batch(struct zswap_store_pipeline_state
> *zst)
> > +{
> > + u8 i;
> > +
> > + for (i = 0; i < zst->nr_comp_pages; ++i) {
> > + zst->comp_dsts[i] = zst->acomp_ctx->buffers[i];
> > + zst->comp_dlens[i] = PAGE_SIZE;
> > + } /* for sub-batch pages. */
> > +
> > + /*
> > + * Batch compress sub-batch "N". If IAA is the compressor, the
> > + * hardware will compress multiple pages in parallel.
> > + */
> > + zswap_compress_batch(zst);
> > +
> > + zswap_batch_compress_post_proc(zst);
>
> The control flow here is a mess. Keep loops over the same batch at the
> same function level. IOW, pull the nr_comp_pages loop out of
> zswap_batch_compress_post_proc() and call the function from the loop.
I see. Got it.
>
> Also give it a more descriptive name. If that's hard to do, then
> you're probably doing too many different things in it. Create
> functions for a specific purpose, don't carve up sequences at
> arbitrary points.
>
> My impression after trying to read this is that the existing
> zswap_store() sequence could be a subset of the batched store, where
> you can reuse most code to get the pool, charge the cgroup, allocate
> entries, store entries, bump the stats etc. for both cases. Alas, your
> naming choices make it a bit difficult to be sure.
Apologies for the naming choices. I will fix this. As I was trying to explain
in the commit log, my goal was to minimize failure points, since each failure
point requires unwinding state, which adds latency. Towards this goal, I tried
to alloc all entries upfront, and fail early to prevent unwinding state.
Especially since the upfront work being done for the batch, is needed in
any case (e.g. zswap_alloc_entries()).
This is where the trade-offs between treating the existing zswap_store()
sequence as a subset of the batched store are not very clear. I tried to
optimize the batched store for batching, while following the logical
structure of zswap_store().
>
> Please explore this direction. Don't worry about the CONFIG symbol for
> now, we can still look at this later.
Definitely, I will think some more about this.
>
> Right now, it's basically
>
> if (special case)
> lots of duplicative code in slightly different order
> regular store sequence
>
> and that isn't going to be maintainable.
>
> Look for a high-level sequence that makes sense for both cases. E.g.:
>
> if (!zswap_enabled)
> goto check_old;
>
> get objcg
>
> check limits
>
> allocate memcg list lru
>
> for each batch {
> for each entry {
> allocate entry
> acquire objcg ref
> acquire pool ref
> }
> compress
> for each entry {
> store in tree
> add to lru
> bump stats and counters
> }
> }
>
> put objcg
>
> return true;
>
> check_error:
> ...
>
> and then set up the two loops such that they also makes sense when the
> folio is just a single page.
Thanks for this suggestion! I will explore this kind of structure. I hope
I have provided some explanations as to why I pursued the existing
batching structure. One other thing I wanted to add was the
"future proofing" you alluded to earlier (which I will fix). Many of
my design choices were motivated by minimizing latency gaps
(e.g. from state unwinding in case of errors) in the batch compress
pipeline when a reclaim batch of any-order folios is potentially
sent to zswap.
Thanks again,
Kanchana
Powered by blists - more mailing lists