lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkY8iPBo99+1gdsSRMNDu4jkVKz8rb=W+xk9=GE0y=kSuw@mail.gmail.com>
Date:   Thu, 2 Nov 2023 13:27:24 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Nhat Pham <nphamcs@...il.com>
Cc:     akpm@...ux-foundation.org, tj@...nel.org, lizefan.x@...edance.com,
        hannes@...xchg.org, cerasuolodomenico@...il.com,
        sjenning@...hat.com, ddstreet@...e.org, vitaly.wool@...sulko.com,
        mhocko@...nel.org, roman.gushchin@...ux.dev, shakeelb@...gle.com,
        muchun.song@...ux.dev, hughd@...gle.com, corbet@....net,
        konrad.wilk@...cle.com, senozhatsky@...omium.org, rppt@...nel.org,
        linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        david@...t.cz
Subject: Re: [RFC PATCH v2] zswap: memcontrol: implement zswap writeback disabling

On Thu, Nov 2, 2023 at 1:02 PM Nhat Pham <nphamcs@...il.com> wrote:
>
> During our experiment with zswap, we sometimes observe swap IOs due to
> occasional zswap store failures and writebacks-to-swap. These swapping
> IOs prevent many users who cannot tolerate swapping from adopting zswap
> to save memory and improve performance where possible.
>
> This patch adds the option to disable this behavior entirely: do not
> writeback to backing swapping device when a zswap store attempt fail,
> and do not write pages in the zswap pool back to the backing swap
> device (both when the pool is full, and when the new zswap shrinker is
> called).
>
> This new behavior can be opted-in/out on a per-cgroup basis via a new
> cgroup file. By default, writebacks to swap device is enabled, which is
> the previous behavior.
>
> Note that this is subtly different from setting memory.swap.max to 0, as
> it still allows for pages to be stored in the zswap pool (which itself
> consumes swap space in its current form).
>
> Suggested-by: Johannes Weiner <hannes@...xchg.org>
> Signed-off-by: Nhat Pham <nphamcs@...il.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 11 +++++++
>  Documentation/admin-guide/mm/zswap.rst  |  6 ++++
>  include/linux/memcontrol.h              | 17 +++++++++++
>  mm/memcontrol.c                         | 38 +++++++++++++++++++++++++
>  mm/page_io.c                            |  6 ++++
>  mm/shmem.c                              |  3 +-
>  mm/zswap.c                              |  9 ++++++
>  7 files changed, 88 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 606b2e0eac4b..18c4171392ea 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1672,6 +1672,17 @@ PAGE_SIZE multiple when read back.
>         limit, it will refuse to take any more stores before existing
>         entries fault back in or are written out to disk.
>
> +  memory.zswap.writeback
> +       A read-write single value file which exists on non-root
> +       cgroups.  The default value is "1".
> +
> +       When this is set to 0, all swapping attempts to swapping devices
> +       are disabled. This included both zswap writebacks, and swapping due
> +       to zswap store failure.
> +
> +       Note that this is subtly different from setting memory.swap.max to
> +       0, as it still allows for pages to be written to the zswap pool.
> +
>    memory.pressure
>         A read-only nested-keyed file.
>
> diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> index 522ae22ccb84..b987e58edb70 100644
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -153,6 +153,12 @@ attribute, e. g.::
>
>  Setting this parameter to 100 will disable the hysteresis.
>
> +Some users cannot tolerate the swapping that comes with zswap store failures
> +and zswap writebacks. Swapping can be disabled entirely (without disabling
> +zswap itself) on a cgroup-basis as follows:
> +
> +       echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
> +
>  When there is a sizable amount of cold memory residing in the zswap pool, it
>  can be advantageous to proactively write these cold pages to swap and reclaim
>  the memory for other use cases. By default, the zswap shrinker is disabled.
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 95f6c9e60ed1..e3a3a06727dc 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -219,6 +219,12 @@ struct mem_cgroup {
>
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
>         unsigned long zswap_max;
> +
> +       /*
> +        * Prevent pages from this memcg from being written back from zswap to
> +        * swap, and from being swapped out on zswap store failures.
> +        */
> +       bool zswap_writeback;
>  #endif
>
>         unsigned long soft_limit;
> @@ -1615,6 +1621,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  {
>         return 0;
>  }
> +
> +static inline bool mem_cgroup_swap_disk_enabled(struct mem_cgroup *memcg)
> +{
> +       return false;
> +}
> +

This seems to be a leftover from a prior version.

>  #endif /* CONFIG_MEMCG */
>
>  static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
> @@ -1931,6 +1943,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>  bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
>  void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
>  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg);
>  #else
>  static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>  {
> @@ -1944,6 +1957,10 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg,
>                                              size_t size)
>  {
>  }
> +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> +{
> +       return false;
> +}
>  #endif
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e43b5aba8efc..b68c613c23a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5545,6 +5545,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>         WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
>         memcg->zswap_max = PAGE_COUNTER_MAX;
> +       WRITE_ONCE(memcg->zswap_writeback, true);
>  #endif
>         page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
>         if (parent) {
> @@ -8177,6 +8178,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>         rcu_read_unlock();
>  }
>
> +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> +{
> +       return cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg
> +                       && READ_ONCE(memcg->zswap_writeback);
> +}
> +
>  static u64 zswap_current_read(struct cgroup_subsys_state *css,
>                               struct cftype *cft)
>  {
> @@ -8209,6 +8216,31 @@ static ssize_t zswap_max_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +static int zswap_writeback_show(struct seq_file *m, void *v)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +       seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback));
> +       return 0;
> +}
> +
> +static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
> +                               char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       int zswap_writeback;
> +       ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback);
> +
> +       if (parse_ret)
> +               return parse_ret;
> +
> +       if (zswap_writeback != 0 && zswap_writeback != 1)
> +               return -EINVAL;
> +
> +       WRITE_ONCE(memcg->zswap_writeback, zswap_writeback);
> +       return nbytes;
> +}
> +
>  static struct cftype zswap_files[] = {
>         {
>                 .name = "zswap.current",
> @@ -8221,6 +8253,12 @@ static struct cftype zswap_files[] = {
>                 .seq_show = zswap_max_show,
>                 .write = zswap_max_write,
>         },
> +       {
> +               .name = "zswap.writeback",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .seq_show = zswap_writeback_show,
> +               .write = zswap_writeback_write,
> +       },
>         { }     /* terminate */
>  };
>  #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index cb559ae324c6..5e606f1aa2f6 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>                 folio_end_writeback(folio);
>                 return 0;
>         }
> +
> +       if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
> +               folio_mark_dirty(folio);
> +               return AOP_WRITEPAGE_ACTIVATE;
> +       }
> +

I am not a fan of this, because it will disable using disk swap if
"zswap_writeback" is disabled, even if zswap is disabled or the page
was never in zswap. The term zswap_writeback makes no sense here tbh.

I am still hoping someone else will suggest better semantics, because
honestly I can't think of anything. Perhaps something like
memory.swap.zswap_only or memory.swap.types which accepts a string
(e.g. "zswap"/"all",..).

Don't take my suggestions strongly because I am not very fond of them.

Can anyone else come back with better naming/semantics for "use zswap
but nothing else when swapping"?

>         __swap_writepage(&folio->page, wbc);
>         return 0;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index cab053831fea..e5044678de8b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>
>                 mutex_unlock(&shmem_swaplist_mutex);
>                 BUG_ON(folio_mapped(folio));
> -               swap_writepage(&folio->page, wbc);
> -               return 0;
> +               return swap_writepage(&folio->page, wbc);
>         }
>
>         mutex_unlock(&shmem_swaplist_mutex);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 260e01180ee0..42a478d1a21f 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -590,6 +590,9 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>         struct zswap_pool *pool = shrinker->private_data;
>         bool encountered_page_in_swapcache = false;
>
> +       if (!mem_cgroup_zswap_writeback_enabled(sc->memcg))
> +               return SHRINK_STOP;
> +
>         nr_protected =
>                 atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>         lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> @@ -620,6 +623,9 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
>         unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
>
> +       if (!mem_cgroup_zswap_writeback_enabled(memcg))
> +               return 0;
> +
>  #ifdef CONFIG_MEMCG_KMEM
>         cgroup_rstat_flush(memcg->css.cgroup);
>         nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> @@ -935,6 +941,9 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>         struct zswap_pool *pool;
>         int nid, shrunk = 0;
>
> +       if (!mem_cgroup_zswap_writeback_enabled(memcg))
> +               return -EINVAL;
> +
>         /*
>          * Skip zombies because their LRUs are reparented and we would be
>          * reclaiming from the parent instead of the dead memcg.
> --
> 2.34.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ