linux-kernel - Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y2i8UbaOjGyqwJQ6@feng-clx>
Date:   Mon, 7 Nov 2022 16:05:37 +0800
From:   Feng Tang <feng.tang@...el.com>
To:     Michal Hocko <mhocko@...e.com>
CC:     "Huang, Ying" <ying.huang@...el.com>,
        Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
        Waiman Long <longman@...hat.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Hansen, Dave" <dave.hansen@...el.com>,
        "Chen, Tim C" <tim.c.chen@...el.com>,
        "Yin, Fengwei" <fengwei.yin@...el.com>
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote:
> > > OK, then let's stop any complicated solution right here then. Let's
> > > start simple with a per-mm flag to disable demotion of an address space.
> > > Should there ever be a real demand for a more fine grained solution
> > > let's go further but I do not think we want a half baked solution
> > > without real usecases.
> > 
> > Yes, the concern about the high cost for mempolicy from you and Yang is
> > valid. 
> > 
> > How about the cpuset part?
> 
> Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a
> cpuset requires knowing all tasks associated with a page. Or am I just
> missing any magic? And no memcg->cpuset association is not a proper
> solution at all.

No, you are not missing anything. It's really difficult to find a
solution for all holes. And the patch is actually a best-efforts
approach, trying to cover cgroup v2 + memory controller enabled case,
which we think is a common user case for newer platforms with tiering
memory.
 
> > We've got bug reports from different channels
> > about using cpuset+docker to control meomry placement in memory tiering
> > system, leading to 2 commits solving them:
> > 
> > 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in
> > cpuset_init_smp()") 
> > https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@intel.com/
> > 
> > 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and
> > bail out early")
> > https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@intel.com/
> > 
> > >From these bug reports, I think it's reasonable to say there are quite
> > some real world users using cpuset+docker+memory-tiering-system.
> 
> I don't think anybody is questioning existence of those usecases. The
> primary question is whether any of them really require any non-trivial
> (read nodemask aware) demotion policies. In other words do we know of
> cpuset policy setups where demotion fallbacks are (partially) excluded?

For cpuset numa memory binding, there are possible usercases:

* User wants cpuset to bind some important containers to faster
  memory tiers for better latency/performance (where simply disabling
  demotion should work, like your per-mm flag solution)

* User wants to bind to a set of physically closer nodes (like faster
  CPU+DRAM node and slower PMEM node). With initial demotion code,
  our HW will have 1:1 demotion/promotion pair for one DRAM node and
  its closer PMEM node, and user's binding can work fine. And there
  are many other types of memory tiering system from other vendors,
  like many CPU-less DRAM nodes in system, and Aneesh's patchset[1]
  created a more general tiering interface, where IIUC each tier has
  a nodemask, and an upper tier can demote to the whole lower tier,
  where the demotion path is N:N mapping. And for this, fine-tuning
  cpuset nodes binding needs this handling.

[1]. https://lore.kernel.org/lkml/20220818131042.113280-1-aneesh.kumar@linux.ibm.com/

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs