lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=MunYMKQXsV58vBXROKnJFDhViCpQgC7RnrLExa_U=n2g@mail.gmail.com>
Date:   Fri, 3 Nov 2023 12:24:27 -0700
From:   Nhat Pham <nphamcs@...il.com>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     akpm@...ux-foundation.org, tj@...nel.org, lizefan.x@...edance.com,
        hannes@...xchg.org, cerasuolodomenico@...il.com,
        sjenning@...hat.com, ddstreet@...e.org, vitaly.wool@...sulko.com,
        mhocko@...nel.org, roman.gushchin@...ux.dev, shakeelb@...gle.com,
        muchun.song@...ux.dev, hughd@...gle.com, corbet@....net,
        konrad.wilk@...cle.com, senozhatsky@...omium.org, rppt@...nel.org,
        linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        david@...t.cz
Subject: Re: [RFC PATCH v3] zswap: memcontrol: implement zswap writeback disabling

On Thu, Nov 2, 2023 at 6:13 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Thu, Nov 2, 2023 at 4:42 PM Nhat Pham <nphamcs@...il.com> wrote:
> >
> > During our experiment with zswap, we sometimes observe swap IOs due to
> > occasional zswap store failures and writebacks-to-swap. These swapping
> > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > to save memory and improve performance where possible.
> >
> > This patch adds the option to disable this behavior entirely: do not
> > writeback to backing swapping device when a zswap store attempt fail,
> > and do not write pages in the zswap pool back to the backing swap
> > device (both when the pool is full, and when the new zswap shrinker is
> > called).
> >
> > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > cgroup file. By default, writebacks to swap device is enabled, which is
> > the previous behavior.
> >
> > Note that this is subtly different from setting memory.swap.max to 0, as
> > it still allows for pages to be stored in the zswap pool (which itself
> > consumes swap space in its current form).
> >
> > Suggested-by: Johannes Weiner <hannes@...xchg.org>
> > Signed-off-by: Nhat Pham <nphamcs@...il.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 11 +++++++
> >  Documentation/admin-guide/mm/zswap.rst  |  6 ++++
> >  include/linux/memcontrol.h              | 12 ++++++++
> >  include/linux/zswap.h                   |  6 ++++
> >  mm/memcontrol.c                         | 38 +++++++++++++++++++++++++
> >  mm/page_io.c                            |  6 ++++
> >  mm/shmem.c                              |  3 +-
> >  mm/zswap.c                              | 14 +++++++++
> >  8 files changed, 94 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 606b2e0eac4b..18c4171392ea 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1672,6 +1672,17 @@ PAGE_SIZE multiple when read back.
> >         limit, it will refuse to take any more stores before existing
> >         entries fault back in or are written out to disk.
> >
> > +  memory.zswap.writeback
> > +       A read-write single value file which exists on non-root
> > +       cgroups.  The default value is "1".
> > +
> > +       When this is set to 0, all swapping attempts to swapping devices
> > +       are disabled. This included both zswap writebacks, and swapping due
> > +       to zswap store failure.
> > +
> > +       Note that this is subtly different from setting memory.swap.max to
> > +       0, as it still allows for pages to be written to the zswap pool.
> > +
> >    memory.pressure
> >         A read-only nested-keyed file.
> >
> > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> > index 522ae22ccb84..b987e58edb70 100644
> > --- a/Documentation/admin-guide/mm/zswap.rst
> > +++ b/Documentation/admin-guide/mm/zswap.rst
> > @@ -153,6 +153,12 @@ attribute, e. g.::
> >
> >  Setting this parameter to 100 will disable the hysteresis.
> >
> > +Some users cannot tolerate the swapping that comes with zswap store failures
> > +and zswap writebacks. Swapping can be disabled entirely (without disabling
> > +zswap itself) on a cgroup-basis as follows:
> > +
> > +       echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
> > +
> >  When there is a sizable amount of cold memory residing in the zswap pool, it
> >  can be advantageous to proactively write these cold pages to swap and reclaim
> >  the memory for other use cases. By default, the zswap shrinker is disabled.
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 95f6c9e60ed1..e51eafdf2a15 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -219,6 +219,12 @@ struct mem_cgroup {
> >
> >  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> >         unsigned long zswap_max;
> > +
> > +       /*
> > +        * Prevent pages from this memcg from being written back from zswap to
> > +        * swap, and from being swapped out on zswap store failures.
> > +        */
> > +       bool zswap_writeback;
> >  #endif
> >
> >         unsigned long soft_limit;
> > @@ -1931,6 +1937,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> >  bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> >  void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> >  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg);
> >  #else
> >  static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> >  {
> > @@ -1944,6 +1951,11 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg,
> >                                              size_t size)
> >  {
> >  }
> > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > +{
> > +       /* if zswap is disabled, do not block pages going to the swapping device */
> > +       return true;
> > +}
> >  #endif
> >
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index cbd373ba88d2..b4997e27a74b 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -35,6 +35,7 @@ void zswap_swapoff(int type);
> >  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
> >  void zswap_lruvec_state_init(struct lruvec *lruvec);
> >  void zswap_lruvec_swapin(struct page *page);
> > +bool is_zswap_enabled(void);
> >  #else
> >
> >  struct zswap_lruvec_state {};
> > @@ -55,6 +56,11 @@ static inline void zswap_swapoff(int type) {}
> >  static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
> >  static inline void zswap_lruvec_init(struct lruvec *lruvec) {}
> >  static inline void zswap_lruvec_swapin(struct page *page) {}
> > +
> > +static inline bool is_zswap_enabled(void)
> > +{
> > +       return false;
> > +}
> >  #endif
> >
> >  #endif /* _LINUX_ZSWAP_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e43b5aba8efc..8a6aadcc103c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5545,6 +5545,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >         WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
> >  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> >         memcg->zswap_max = PAGE_COUNTER_MAX;
> > +       WRITE_ONCE(memcg->zswap_writeback, true);
>
> Generally LGTM, just one question.
>
> Would it be more convenient if the initial value is inherited from the
> parent (the root starts with true)?
>
> I can see this being useful if we want to set it to false on the
> entire machine or one a parent cgroup, we can set it before creating
> any children instead of setting it to 0 every time we create a new
> cgroup.

I'm not 100% sure about the benefit or have a strong opinion one way
or another, but this sounds like a nice-to-have detail to me, and a relatively
low cost one (both in effort and at runtime) at that too.

Propagating the change everytime we modify the memory.zswap.writeback
value of the ancestor might be data race-prone (and costly, depending on
how big the cgroup subtree is), but this is just a one-time-per-cgroup
propagation (at the new cgroup creation time).

Can anyone come up with a failure case for this change, or why it might be
a bad idea?

Thanks for the suggestion, Yosry!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ