linux-kernel - Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHS8izOuT_-p-N1xPApi+BPJQ+P--2YVSUeiWBROGvGinN0vcg@mail.gmail.com>
Date:   Tue, 13 Dec 2022 11:29:45 -0800
From:   Mina Almasry <almasrymina@...gle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        "Huang, Ying" <ying.huang@...el.com>, Tejun Heo <tj@...nel.org>,
        Zefan Li <lizefan.x@...edance.com>,
        Jonathan Corbet <corbet@....net>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Yosry Ahmed <yosryahmed@...gle.com>, weixugc@...gle.com,
        fvdl@...gle.com, bagasdotme@...il.com, cgroups@...r.kernel.org,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim

On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@...e.com> wrote:
>
> On Tue 13-12-22 14:30:40, Johannes Weiner wrote:
> > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote:
> [...]
> > > After these discussion, I think the solution maybe use different
> > > interfaces for "proactive demote" and "proactive reclaim".  That is,
> > > reconsider "memory.demote".  In this way, we will always uncharge the
> > > cgroup for "memory.reclaim".  This avoid the possible confusion there.
> > > And, because demotion is considered aging, we don't need to disable
> > > demotion for "memory.reclaim", just don't count it.
> >
> > Hm, so in summary:
> >
> > 1) memory.reclaim would demote and reclaim like today, but it would
> >    change to only count reclaimed pages against the goal.
> >
> > 2) memory.demote would only demote.
> >

If the above 2 points are agreeable then yes, this sounds good to me
and does address our use case.

> >    a) What if the demotion targets are full? Would it reclaim or fail?
> >

Wei will chime in if he disagrees, but I think we _require_ that it
fails, not falls back to reclaim. The interface is asking for
demotion, and is called memory.demote. For such an interface to fall
back to reclaim would be very confusing to userspace and may trigger
reclaim on a high priority job that we want to shield from proactive
reclaim.

> > 3) Would memory.reclaim and memory.demote still need nodemasks?

memory.demote will need a nodemask, for sure. Today the nodemask would
be useful if there is a specific node in the top tier that is
overloaded and we want to reduce the pressure by demoting. In the
future there will be N tiers and the nodemask says which tier to
demote from.

I don't think memory.reclaim would need a nodemask anymore? At least I
no longer see the use for it for us.

> >    Would
> >    they return -EINVAL if a) memory.reclaim gets passed only toptier
> >    nodes or b) memory.demote gets passed any lasttier nodes?
>

Honestly it would be great if memory.reclaim can force reclaim from a
top tier nodes. It breaks the aginig pipeline, yes, but if the user is
specifically asking for that because they decided in their usecase
it's a good idea then the kernel should comply IMO. Not a strict
requirement for us. Wei will chime in if he disagrees.

memory.demote returning -EINVAL for lasttier nodes makes sense to me.

> I would also add
> 4) Do we want to allow to control the demotion path (e.g. which node to
>    demote from and to) and how to achieve that?

We care deeply about specifying which node to demote _from_. That
would be some node that is approaching pressure and we're looking for
proactive saving from. So far I haven't seen any reason to control
which nodes to demote _to_. The kernel deciding that based on the
aging pipeline and the node distances sounds good to me. Obviously
someone else may find that useful.

> 5) Is the demotion api restricted to multi-tier systems or any numa
>    configuration allowed as well?
>

demotion will of course not work on single tiered systems. The
interface may return some failure on such systems or not be available
at all.

> --
> Michal Hocko
> SUSE Labs