linux-kernel - Re: [PATCH] memcg: allow exiting tasks to write back data to swap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkZ9gSxdPUCgz_NaHSDPTC+HEhxNRbinp619sNSshScJ0A@mail.gmail.com>
Date: Wed, 11 Dec 2024 09:30:24 -0800
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Rik van Riel <riel@...riel.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Shakeel Butt <shakeel.butt@...ux.dev>, 
	Muchun Song <muchun.song@...ux.dev>, Andrew Morton <akpm@...ux-foundation.org>, 
	cgroups@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	kernel-team@...a.com, Nhat Pham <nphamcs@...il.com>
Subject: Re: [PATCH] memcg: allow exiting tasks to write back data to swap

On Wed, Dec 11, 2024 at 9:20 AM Rik van Riel <riel@...riel.com> wrote:
>
> On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote:
> > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@...riel.com>
> > wrote:
> > >
> > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote:
> > > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@...riel.com>
> > > > wrote:
> > > > >
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -5371,6 +5371,15 @@ bool
> > > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > > > >         if (!zswap_is_enabled())
> > > > >                 return true;
> > > > >
> > > > > +       /*
> > > > > +        * Always allow exiting tasks to push data to swap. A
> > > > > process in
> > > > > +        * the middle of exit cannot get OOM killed, but may
> > > > > need
> > > > > to push
> > > > > +        * uncompressible data to swap in order to get the
> > > > > cgroup
> > > > > memory
> > > > > +        * use below the limit, and make progress with the
> > > > > exit.
> > > > > +        */
> > > > > +       if ((current->flags & PF_EXITING) && memcg ==
> > > > > mem_cgroup_from_task(current))
> > > > > +               return true;
> > > > > +
> > > >
> > > > I have a few questions:
> > > > (a) If the task is being OOM killed it should be able to charge
> > > > memory
> > > > beyond memory.max, so why do we need to get the usage down below
> > > > the
> > > > limit?
> > > >
> > > If it is a kernel directed memcg OOM kill, that is
> > > true.
> > >
> > > However, if the exit comes from somewhere else,
> > > like a userspace oomd kill, we might not hit that
> > > code path.
> >
> > Why do we treat dying tasks differently based on the source of the
> > kill?
> >
> Are you saying we should fail allocations for
> every dying task, and add a check for PF_EXITING
> in here?

I am asking, not really suggesting anything :)

Does it matter from the kernel perspective if the task is dying due to
a kernel OOM kill or a userspace SIGKILL?

>
>
>         if (unlikely(task_in_memcg_oom(current)))
>                 goto nomem;
>
>
> > > However, we don't know until the attempted zswap write
> > > whether the memory is compressible, and whether doing
> > > a bunch of zswap writes will help us bring our memcg
> > > down below its memory.max limit.
> >
> > If we are at memory.max (or memory.zswap.max), we can't compress
> > pages
> > into zswap anyway, regardless of their compressibility.
> >
> Wait, this is news to me.
>
> This seems like something we should fix, rather
> than live with, since compressing the data to
> a smaller size could bring us below memory.max.
>
> Is this "cannot compress when at memory.max"
> behavior intentional, or just a side effect of
> how things happen to be?
>
> Won't the allocations made from zswap_store
> ignore the memory.max limit because PF_MEMALLOC
> is set?

My bad, obj_cgroup_may_zswap() only checks the zswap limit, not
memory.max. Please ignore this.

The scenario I described where we scan the LRUs needlessly is if the
*zswap limit* is hit, and writeback is disabled. I am guessing this is
not the case you're running into.

So yeah my only outstanding question is the one above about handling
userspace OOM kills differently.

Thanks for bearing with me.