linux-kernel - Re: [PATCH v2] mm: zswap: shrink until can accept

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJD7tkY55Z9n7Ue-4+a691t4YJAs+0e7gEZGocF7cp197gL+Dg@mail.gmail.com>
Date:   Fri, 26 May 2023 11:15:31 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Domenico Cerasuolo <cerasuolodomenico@...il.com>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        sjenning@...hat.com, ddstreet@...e.org, vitaly.wool@...sulko.com,
        kernel-team@...com
Subject: Re: [PATCH v2] mm: zswap: shrink until can accept

On Fri, May 26, 2023 at 11:10 AM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Fri, May 26, 2023 at 07:39:55PM +0200, Domenico Cerasuolo wrote:
> > This update addresses an issue with the zswap reclaim mechanism, which
> > hinders the efficient offloading of cold pages to disk, thereby
> > compromising the preservation of the LRU order and consequently
> > diminishing, if not inverting, its performance benefits.
> >
> > The functioning of the zswap shrink worker was found to be inadequate,
> > as shown by basic benchmark test. For the test, a kernel build was
> > utilized as a reference, with its memory confined to 1G via a cgroup and
> > a 5G swap file provided. The results are presented below, these are
> > averages of three runs without the use of zswap:
> >
> > real 46m26s
> > user 35m4s
> > sys 7m37s
> >
> > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > system), the results changed to:
> >
> > real 56m4s
> > user 35m13s
> > sys 8m43s
> >
> > written_back_pages: 18
> > reject_reclaim_fail: 0
> > pool_limit_hit:1478
> >
> > Besides the evident regression, one thing to notice from this data is
> > the extremely low number of written_back_pages and pool_limit_hit.
> >
> > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > when zswap is completely full, doesn't account for a particular
> > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > still above the acceptance threshold. Once we include the rejections due
> > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > 1478 to a significant 21578266.
> >
> > Zswap is stuck in an undesirable state where it rejects pages because
> > it's above the acceptance threshold, yet fails to attempt memory
> > reclaimation. This happens because the shrink work is only queued when
> > zswap_frontswap_store detects that it's full and the work itself only
> > reclaims one page per run.
> >
> > This state results in hot pages getting written directly to disk,
> > while cold ones remain memory, waiting only to be invalidated. The LRU
> > order is completely broken and zswap ends up being just an overhead
> > without providing any benefits.
> >
> > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > pages until the acceptance threshold is met and b) the task is also
> > enqueued when zswap is not full but still above the threshold.
> >
> > Testing this suggested update showed much better numbers:
> >
> > real 36m37s
> > user 35m8s
> > sys 9m32s
> >
> > written_back_pages: 10459423
> > reject_reclaim_fail: 12896
> > pool_limit_hit: 75653
> >
> > V2:
> > - loop against == -EAGAIN rather than != -EINVAL and also break the loop
> > on MAX_RECLAIM_RETRIES (thanks Yosry)
> > - cond_resched() to ensure that the loop doesn't burn the cpu (thanks
> > Vitaly)
> >
> > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@...il.com>
> > ---
> >  mm/zswap.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..f953dceaab34 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/workqueue.h>
> >
> >  #include "swap.h"
> > +#include "internal.h"
> >
> >  /*********************************
> >  * statistics
> > @@ -587,9 +588,17 @@ static void shrink_worker(struct work_struct *w)
> >  {
> >       struct zswap_pool *pool = container_of(w, typeof(*pool),
> >                                               shrink_work);
> > +     int ret, failures = 0;
> >
> > -     if (zpool_shrink(pool->zpool, 1, NULL))
> > -             zswap_reject_reclaim_fail++;
> > +     do {
> > +             ret = zpool_shrink(pool->zpool, 1, NULL);
> > +             if (ret) {
> > +                     zswap_reject_reclaim_fail++;
> > +                     failures++;
> > +             }
> > +             cond_resched();
> > +     } while (!zswap_can_accept() && ret == -EAGAIN &&
> > +              failures < MAX_RECLAIM_RETRIES);
>
> It should also loop on !ret, right?
>
> AFAIU Yosry's suggestion was that instead of breaking only on -EINVAL,
> it should break on all failures but -EAGAIN. But it should still keep
> going if the shrink was successful and the pool cannot accept yet.
>
> Basically, something like this?
>
>         do {
>                 ret = zpool_shrink(pool->zpool, 1, NULL);
>                 if (ret) {
>                         zswap_reject_reclaim_fail++;
>                         if (ret != -EAGAIN)
>                                 break;
>                         if (++failures == MAX_RECLAIM_RETRIES)
>                                 break;
>                 }
>                 cond_resched();
>         } while (!zswap_can_accept());

Yes, that's what I meant. Otherwise if shrink is successful we end up
doing 1 page only, which is exactly what we are trying to avoid here.