linux-kernel - Re: [PATCH] zswap: do not shrink when memory.zswap.max is 0

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkZJttvpYs4mgjL3pt8-jkX0fnWRJP7hVBZmm=i_Ef3Abg@mail.gmail.com>
Date:   Tue, 30 May 2023 09:52:36 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Nhat Pham <nphamcs@...il.com>
Cc:     akpm@...ux-foundation.org, hannes@...xchg.org,
        cerasuolodomenico@...il.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, sjenning@...hat.com,
        ddstreet@...e.org, vitaly.wool@...sulko.com, kernel-team@...a.com
Subject: Re: [PATCH] zswap: do not shrink when memory.zswap.max is 0

On Tue, May 30, 2023 at 9:22 AM Nhat Pham <nphamcs@...il.com> wrote:
>
> Before storing a page, zswap first checks if the number of stored pages
> exceeds the limit specified by memory.zswap.max, for each cgroup in the
> hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> triggered and short-circuits the store attempt.
>
> However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> will allow future store attempts from processes in this cgroup to
> succeed. Furthermore, this create a pathological behavior in a system
> where some cgroups have memory.zswap.max = 0 and some do not: the
> processes in the former cgroups, under memory pressure, will evict pages
> stored by the latter continually, until the need for swap ceases or the
> pool becomes empty.
>
> As a result of this, we observe a disproportionate amount of zswap
> writeback and a perpetually small zswap pool in our experiments, even
> though the pool limit is never hit.
>
> This patch fixes the issue by rejecting zswap store attempt without
> shrinking the pool when memory.zswap.max is 0.
>
> Fixes: f4840ccfca25 ("zswap: memcg accounting")
> Signed-off-by: Nhat Pham <nphamcs@...il.com>
> ---
>  include/linux/memcontrol.h | 6 +++---
>  mm/memcontrol.c            | 8 ++++----
>  mm/zswap.c                 | 9 +++++++--
>  3 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 222d7370134c..507bed3a28b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>  #endif /* CONFIG_MEMCG_KMEM */
>
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
>  void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
>  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
>  #else
> -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>  {
> -       return true;
> +       return 0;
>  }
>  static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
>                                            size_t size)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4b27e245a055..09aad0e6f2ea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
>   * spending cycles on compression when there is already no room left
>   * or zswap is disabled altogether somewhere in the hierarchy.
>   */
> -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>  {
>         struct mem_cgroup *memcg, *original_memcg;
> -       bool ret = true;
> +       int ret = 0;
>
>         if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>                 return true;
> @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>                 if (max == PAGE_COUNTER_MAX)
>                         continue;
>                 if (max == 0) {
> -                       ret = false;
> +                       ret = -ENODEV;
>                         break;
>                 }
>
> @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>                 pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
>                 if (pages < max)
>                         continue;
> -               ret = false;
> +               ret = -ENOMEM;
>                 break;
>         }
>         mem_cgroup_put(original_memcg);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59da2a415fbb..7b13dc865438 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>         }
>
>         objcg = get_obj_cgroup_from_page(page);
> -       if (objcg && !obj_cgroup_may_zswap(objcg))
> -               goto shrink;
> +       if (objcg) {
> +               ret = obj_cgroup_may_zswap(objcg);
> +               if (ret == -ENODEV)
> +                       goto reject;
> +               if (ret == -ENOMEM)
> +                       goto shrink;
> +       }

I wonder if we should just make this:

if (objcg && !obj_cgroup_may_zswap(objcg))
        goto reject;

Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
zswap pool will only help if we happen to writeback a page from the
same memcg that hit its limit. Keep in mind that we will only
writeback one page every time we observe that the limit is hit (even
with Domenico's patch, because zswap_can_accept() should be true).

On a system with a handful of memcgs,
it seems likely that we wrongfully writeback pages from other memcgs
because of this. Achieving nothing for this memcg, while hurting
others. OTOH, without invoking writeback when the limit is hit, the
memcg will just not be able to use zswap until some pages are
faulted back in or invalidated.

I am not sure which is better, just thinking out loud.

Seems like this can be solved by having per-memcg LRUs, or at least
providing an argument to the shrinker of which memcg to reclaim from.
This would only be possible when the LRU is moved to zswap.


>
>         /* reclaim space if needed */
>         if (zswap_is_full()) {
> --
> 2.34.1
>
>