linux-kernel - Re: [External] Re: [PATCH] mm: zswap: fix the lack of page lru flag in zswap_writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=PC3C-PrWAH3XiYGyR4ujqBJQBBX6uRa2jXKCy9VMyRCQ@mail.gmail.com>
Date: Tue, 2 Jan 2024 15:27:03 -0800
From: Nhat Pham <nphamcs@...il.com>
To: Zhongkun He <hezhongkun.hzk@...edance.com>
Cc: akpm@...ux-foundation.org, hannes@...xchg.org, yosryahmed@...gle.com, 
	sjenning@...hat.com, ddstreet@...e.org, vitaly.wool@...sulko.com, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Chris Li <chrisl@...nel.org>
Subject: Re: [External] Re: [PATCH] mm: zswap: fix the lack of page lru flag
 in zswap_writeback_entry

On Tue, Jan 2, 2024 at 3:39 AM Zhongkun He <hezhongkun.hzk@...edance.com> wrote:
>
> On Sat, Dec 30, 2023 at 10:09 AM Nhat Pham <nphamcs@...il.com> wrote:
> >
> > On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He
> > <hezhongkun.hzk@...edance.com> wrote:
> > >
> >
> > My apologies for the delayed response. I have a couple of questions.
> >
> > > The zswap_writeback_entry() will add a page to the swap cache, decompress
> > > the entry data into the page, and issue a bio write to write the page back
> > > to the swap device. Move the page to the tail of lru list  through
> > > SetPageReclaim(page) and folio_rotate_reclaimable().
> > >
> > > Currently, about half of the pages will fail to move to the tail of lru
> >
> > May I ask what's the downstream effect of this? i.e so what if it
> > fails? And yes, as Andrew pointed out, it'd be nice if the patch
> > changelog spells out any observable or measurable change from the
> > user's POV.
> >
>
> The swap cache page used to decompress zswap_entry should be
> moved  to the tail of the inactive list after end_writeback, We can release
> them in time.Just like the following code in zswap_writeback_entry().
>
>    /* move it to the tail of the inactive list after end_writeback */
>    SetPageReclaim(page);
>
> After the writeback is over, the function of
> folio_rotate_reclaimable() will fail
> because the page is not in the LRU list but in some of the cpu folio batches.
> Therefore we did not achieve the goal of setting SetPageReclaim(page), and
> the pages could not be free in time.
>
> > > list because there is no LRU flag in page which is not in the LRU list but
> > > the cpu_fbatches. So fix it.
> >
> > This sentence is a bit confusing to me. Does this mean the page
> > currently being processed for writeback is not in the LRU list
> > (!PageLRU(page)), but IN one of the cpu folio batches? Which makes
> > folio_rotate_reclaimable() fails on this page later on in the
> > _swap_writepage() path? (hence the necessity of lru_add_drain()?)
> >
>
> Yes, exactly.
>
> > Let me know if I'm misunderstanding the intention of this patch. I
> > know it's a bit pedantic, but spelling things out (ideally in the
> > changelog itself) will help the reviewers, as well as future
> > contributors who want to study the codebase and make changes to it.
> >
>
> Sorry,my bad.
>
> > >
> > > Signed-off-by: Zhongkun He <hezhongkun.hzk@...edance.com>
> >
> > Thanks and look forward to your response,
> > Nhat
> >
> > P/S: Have a nice holiday season and happy new year!
>
> Here are the steps and results of the performance test：
> 1:zswap+ zram (simplified model with on IO)
> 2:disabel zswap/parameters/same_filled_pages_enabled (stress have same pages)
> 3:stress --vm 1 --vm-bytes 2g --vm-hang 0   (2Gi anon pages)
> 4: In order to quickly release zswap_entry, I used the previous
>      patch (I will send it again later).
> https://lore.kernel.org/all/20231025095248.458789-1-hezhongkun.hzk@bytedance.com/
>
> Performance result：
>    reclaim 1Gi zswap_entry
>
> time echo 1 > writeback_time_threshold
> (will release the zswap_entry,  not been accessed for more than 1 seconds )
>
> Base                                 With this patch
> real    0m1.015s               real    0m1.043s
> user    0m0.000s              user    0m0.001s
> sys     0m1.013s               sys     0m1.040s
> So no obvious performance regression was found.

That's around 2.7% increase in real time, no? Admittedly, this
micro-benchmark is too small to conclude either way, but the data
doesn't seem to be in your favor.

I'm a bit concerned about the overhead here, given that with this
patch we will drain the per-cpu batch on every written-back entry.
That's quite a high frequency, especially since we're moving towards
more writeback (either with the new zswap shrinker, or your time
threshold-based writeback mechanism). For instance, there seems to be
some (local/per-cpu) locking involved, no? Could there be some form of
lock contentions there (especially since with the new shrinker, you
can have a lot of concurrent writeback contexts?)

Furthermore, note that a writeback from zswap to swap saves less
memory than a writeback from memory to swap, so the effect of the
extra overhead will be even more pronounced. That is, the amount of
extra work done (from this change) to save one unit of memory would be
even larger than if we call lru_add_drain() every time we swap out a
page (from memory -> swap). And I'm pretty sure we don't call
lru_add_drain() every time we swap out a page - I believe we call
lru_add_drain() every time we perform a shrink action. For e.g, in
shrink_inactive_list(). That's much coarser in granularity than here.

Also, IIUC, the more often we perform lru_add_drain(), the less
batching effect we will obtain. IOW, the overhead of maintaining the
batch will become higher relative to the performance gains from
batching.

Maybe I'm missing something - could you walk me through how
lru_add_drain() is fine here, from this POV? Thanks!

>
> After writeback, we perform the following steps to release the memory again
> echo 1g > memory.reclaim
>
> Base:
>                      total        used        recalim        total        used
> Mem:           38Gi       2.5Gi       ---->             38Gi       1.5Gi
> Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.5Gi
> used memory  -1G   swap +0.5g
> It means that  half of the pages are failed to move to the tail of lru list,
> So we need to release an additional 0.5Gi anon pages to swap space.
>
> With this patch:
>                      total        used        recalim        total        used
> Mem:           38Gi       2.6Gi       ---->             38Gi       1.6Gi
> Swap:         5.0Gi       1.0Gi       ---->              5Gi        1Gi
>
> used memory  -1Gi,  swap +0Gi
> It means that we release all the pages which have been add to the tail of
> lru list in zswap_writeback_entry() and folio_rotate_reclaimable().
>

OTOH, this suggests that we're onto something. Swap usage seems to
decrease quite a bit. Sounds like a real problem that this patch is
tackling.
(Please add this benchmark result to future changelog. It'll help
demonstrate the problem).

I'm inclined to ack this patch, but it'd be nice if you can assuage my
concerns above (with some justification and/or larger benchmark).

(Or perhaps, we have to drain, but less frequently/higher up the stack?)

Thanks,
Nhat

>
> Thanks for your time Nhat and Andrew. Happy New Year!