lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 12 Dec 2022 23:48:57 -0800
From:   Wei Xu <weixugc@...gle.com>
To:     "Huang, Ying" <ying.huang@...el.com>
Cc:     Mina Almasry <almasrymina@...gle.com>,
        Michal Hocko <mhocko@...e.com>, Tejun Heo <tj@...nel.org>,
        Zefan Li <lizefan.x@...edance.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Yosry Ahmed <yosryahmed@...gle.com>, fvdl@...gle.com,
        bagasdotme@...il.com, cgroups@...r.kernel.org,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim

On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying <ying.huang@...el.com> wrote:
>
> Mina Almasry <almasrymina@...gle.com> writes:
>
> > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@...e.com> wrote:
> >>
> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote:
> >> > The nodes= arg instructs the kernel to only scan the given nodes for
> >> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >> >
> >> > nodes 0,1 -> top tier
> >> > nodes 2,3 -> second tier
> >> >
> >> > $ echo "1m nodes=0" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> >> > Since node 0 is a top tier node, demotion will be attempted first. This
> >> > is useful to direct proactive reclaim to specific nodes that are under
> >> > pressure.
> >> >
> >> > $ echo "1m nodes=2,3" > memory.reclaim
> >> >
> >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> >> > since this tier of memory has no demotion targets the memory will be
> >> > reclaimed.
> >> >
> >> > $ echo "1m nodes=0,1" > memory.reclaim
> >> >
> >> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> >> > be desirable according to the userspace policy if there is pressure on
> >> > the top tiers. Since these nodes have demotion targets, the kernel will
> >> > attempt demotion first.
> >> >
> >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> >> > reclaim""), the proactive reclaim interface memory.reclaim does both
> >> > reclaim and demotion. Reclaim and demotion incur different latency costs
> >> > to the jobs in the cgroup. Demoted memory would still be addressable
> >> > by the userspace at a higher latency, but reclaimed memory would need to
> >> > incur a pagefault.
> >> >
> >> > The 'nodes' arg is useful to allow the userspace to control demotion
> >> > and reclaim independently according to its policy: if the memory.reclaim
> >> > is called on a node with demotion targets, it will attempt demotion first;
> >> > if it is called on a node without demotion targets, it will only attempt
> >> > reclaim.
> >> >
> >> > Acked-by: Michal Hocko <mhocko@...e.com>
> >> > Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> >>
> >> After discussion in [1] I have realized that I haven't really thought
> >> through all the consequences of this patch and therefore I am retracting
> >> my ack here. I am not nacking the patch at this statge but I also think
> >> this shouldn't be merged now and we should really consider all the
> >> consequences.
> >>
> >> Let me summarize my main concerns here as well. The proposed
> >> implementation doesn't apply the provided nodemask to the whole reclaim
> >> process. This means that demotion can happen outside of the mask so the
> >> the user request cannot really control demotion targets and that limits
> >> the interface should there be any need for a finer grained control in
> >> the future (see an example in [2]).
> >> Another problem is that this can limit future reclaim extensions because
> >> of existing assumptions of the interface [3] - specify only top-tier
> >> node to force the aging without actually reclaiming any charges and
> >> (ab)use the interface only for aging on multi-tier system. A change to
> >> the reclaim to not demote in some cases could break this usecase.
> >>
> >
> > I think this is correct. My use case is to request from the kernel to
> > do demotion without reclaim in the cgroup, and the reason for that is
> > stated in the commit message:
> >
> > "Reclaim and demotion incur different latency costs to the jobs in the
> > cgroup. Demoted memory would still be addressable by the userspace at
> > a higher latency, but reclaimed memory would need to incur a
> > pagefault."
> >
> > For jobs of some latency tiers, we would like to trigger proactive
> > demotion (which incurs relatively low latency on the job), but not
> > trigger proactive reclaim (which incurs a pagefault). I initially had
> > proposed a separate interface for this, but Johannes directed me to
> > this interface instead in [1]. In the same email Johannes also tells
> > me that meta's reclaim stack relies on memory.reclaim triggering
> > demotion, so it seems that I'm not the first to take a dependency on
> > this. Additionally in [2] Johannes also says it would be great if in
> > the long term reclaim policy and demotion policy do not diverge.
> >
> > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/
> > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
>
> After these discussion, I think the solution maybe use different
> interfaces for "proactive demote" and "proactive reclaim".  That is,
> reconsider "memory.demote".  In this way, we will always uncharge the
> cgroup for "memory.reclaim".  This avoid the possible confusion there.
> And, because demotion is considered aging, we don't need to disable
> demotion for "memory.reclaim", just don't count it.

+1 on memory.demote.

> Best Regards,
> Huang, Ying
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ