[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20250807235427.4743-1-sj@kernel.org>
Date: Thu, 7 Aug 2025 16:54:27 -0700
From: SeongJae Park <sj@...nel.org>
To: Nhat Pham <nphamcs@...il.com>
Cc: SeongJae Park <sj@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Chengming Zhou <chengming.zhou@...ux.dev>,
David Hildenbrand <david@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Yosry Ahmed <yosry.ahmed@...ux.dev>,
kernel-team@...a.com,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
Takero Funaki <flintglass@...il.com>
Subject: Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
On Thu, 7 Aug 2025 16:03:54 -0700 Nhat Pham <nphamcs@...il.com> wrote:
> On Thu, Aug 7, 2025 at 11:16 AM SeongJae Park <sj@...nel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > the page is swapped out to the backing swap device. This behavior
> > breaks the zswap's writeback LRU order, and hence users can experience
> > unexpected latency spikes. If the page is compressed without failure,
> > but results in a size of PAGE_SIZE, the LRU order is kept, but the
> > decompression overhead for loading the page back on the later access is
> > unnecessary.
>
> Maybe add the writeback disabled story - we added to get around this
> latency issue, but that solution was not satisfactory either: wasting
> cpu cycles retrying incompressible pages, especially when we're under
> reclaim/memory pressure, and we've reclaimed most if not all other
> sources.
>
> This problem was pointed out by Yosry:
>
> https://lore.kernel.org/all/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com/
Thank you for sharing the detailed context, I will add this history with the
link to the last paragraph of this section, or the 'Related Works' section.
>
> >
> > Keep the LRU order and optimize unnecessary decompression overheads in
> > those cases, by storing the original content as-is in zpool. The length
> > field of zswap_entry will be set appropriately, as PAGE_SIZE, Hence
> > whether it is saved as-is or not (whether decompression is unnecessary)
> > is identified by 'zswap_entry->length = PAGE_SIZE'.
> >
> > Because the uncompressed data is saved in zpool, same to the compressed
> > ones, this introduces no change in terms of memory management including
> > movability and migratability of involved pages.
> >
> > This change is also not increasing per zswap entry metadata overhead.
> > But as the number of incompressible pages increases, total zswap
> > metadata overhead is proportionally increased. The overhead should not
> > be problematic in usual cases, since the zswap metadata for single zswap
> > entry is much smaller than PAGE_SIZE, and in common zswap use cases
> > there should be a sufficient amount of compressible pages. Also it can
> > be mitigated by the zswap writeback.
> >
> > When the writeback is disabled, the additional overhead could be
> > problematic. For the case, keep the current behavior that just returns
> > the failure and let swap_writeout() put the page back to the active LRU
> > list in the case. It is known to be suboptimal when the incompressible
> > pages are cold, since the incompressible pages will continuously be
> > tried to be zswapped out, and burn CPU cycles for compression attempts
> > that will anyway fails. One imaginable solution for the problem is
> > reusing the swapped-out page and its struct page to store in the zswap
> > pool. But that's out of the scope of this patch.
> >
> > Tests
> > -----
[...]
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -976,8 +976,25 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > */
> > comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> > dlen = acomp_ctx->req->dlen;
> > - if (comp_ret)
> > - goto unlock;
> > +
> > + /*
> > + * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > + * save the content as is without a compression, to keep the LRU order
> > + * of writebacks. If writeback is disabled, reject the page since it
> > + * only adds metadata overhead. swap_writeout() will put the page back
> > + * to the active LRU list in the case.
> > + */
> > + if (comp_ret || dlen >= PAGE_SIZE) {
> > + if (mem_cgroup_zswap_writeback_enabled(
> > + folio_memcg(page_folio(page)))) {
> > + comp_ret = 0;
> > + dlen = PAGE_SIZE;
> > + memcpy_from_page(dst, page, 0, dlen);
>
> I wonder if we can save one memcpy here. Would it be safe to map the page:
>
> dst = kmap_local_page(page);
Sounds good, I will do so in the next version.
>
> then, after we're done with storing (i.e after zpool_obj_write()), do:
>
> if (dlen = PAGE_SIZE)
> kunmap(dst);
>
> (don't forget to unmap the page in the failure paths too!)
Sure, I think the 'unlock' label would be appropriate part to add the unmap
code. I also got your correction of s/kunmap/kunmap_local() in the reply. I
will not miss that on the next version.
>
> > + } else {
> > + comp_ret = comp_ret ? : -EINVAL;
>
> Does this keep the value of comp_ret if comp_ret != 0 lol.
Yes.
> Syntax
> looks weird to me.
I agree it could look weird. I prefer keeping the code in a way more familiar
to ones who will keep maintain the file. So I will modify this as below, in
the next version.
comp_ret ? comp_ret : -EINVAL
>
> > + goto unlock;
> > + }
> > + }
>
> Also, can we fix the counter value?
>
> I assume we want:
>
> else if (comp_ret || dlen = PAGE_SIZE)
> zswap_reject_compress_fail++;
>
> or something like that.
I'm not very clearly getting your point.
I was thinking we should increase the counter if we "reject" the page (does not
save the content in the zpool) due to failing at compressing the page's content
into a size smaller than PAGE_SIZE. This patch implements the behavior.
Am I missing a mis-implementation of the behavior in this patch, or the
behavior is not what you think it should be? More elaboration of your point
would be helpful for me.
>
> And what happened to the incompressible page stored counter? :)
I don't get what counter you are asking about. Could you please elaborate?
>
>
> >
> > zpool = pool->zpool;
> > gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1012,6 +1029,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> > acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> > obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> >
> > + /* zswap entries of PAGE_SIZE are not compressed. */
> of length?
Thank you for this nice suggestion, I will change the comment so, in the next
version.
>
>
> > + if (entry->length = PAGE_SIZE) {
> > + memcpy_to_folio(folio, 0, obj, entry->length);
> > + zpool_obj_read_end(zpool, entry->handle, obj);
> > + acomp_ctx_put_unlock(acomp_ctx);
> > + return true;
> > + }
> > +
> > /*
> > * zpool_obj_read_begin() might return a kmap address of highmem when
> > * acomp_ctx->buffer is not used. However, sg_init_one() does not
> >
> > base-commit: 2ec534125ae474292175ae198483c545eac2161d
> > --
> > 2.39.5
>
Thank you for your nice review and comments :)
Thanks,
SJ
Powered by blists - more mailing lists