linux-kernel - Re: [PATCH v3 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <61054afa-9f18-45f1-987d-e6f242012096@linux.ibm.com>
Date: Mon, 25 Mar 2024 10:32:18 +0530
From: Donet Tom <donettom@...ux.ibm.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Aneesh Kumar <aneesh.kumar@...nel.org>,
        Michal Hocko <mhocko@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Mel Gorman <mgorman@...e.de>, Feng Tang <feng.tang@...el.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Zijlstra
 <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        Rik van Riel <riel@...riel.com>, Johannes Weiner <hannes@...xchg.org>,
        Matthew Wilcox <willy@...radead.org>, Vlastimil Babka <vbabka@...e.cz>,
        Dan Williams <dan.j.williams@...el.com>,
        Hugh Dickins <hughd@...gle.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH v3 2/2] mm/numa_balancing:Allow migrate on protnone
 reference with MPOL_PREFERRED_MANY policy


On 3/25/24 08:18, Huang, Ying wrote:
> Donet Tom <donettom@...ux.ibm.com> writes:
>
>> On 3/22/24 14:02, Huang, Ying wrote:
>>> Donet Tom <donettom@...ux.ibm.com> writes:
>>>
>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
>>>> nodes") added support for migrate on protnone reference with MPOL_BIND
>>>> memory policy. This allowed numa fault migration when the executing node
>>>> is part of the policy mask for MPOL_BIND. This patch extends migration
>>>> support to MPOL_PREFERRED_MANY policy.
>>>>
>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
>>>> the kernel should not allocate pages from the slower memory tier via
>>>> allocation control zonelist fallback. Instead, we should move cold pages
>>>> from the faster memory node via memory demotion. For a page allocation,
>>>> kswapd is only woken up after we try to allocate pages from all nodes in
>>>> the allocation zone list. This implies that, without using memory
>>>> policies, we will end up allocating hot pages in the slower memory tier.
>>>>
>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
>>>> allocation control when we have memory tiers in the system. With
>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
>>>> of faster memory nodes. When we fail to allocate pages from the faster
>>>> memory node, kswapd would be woken up, allowing demotion of cold pages
>>>> to slower memory nodes.
>>>>
>>>> With the current kernel, such usage of memory policies implies we can't
>>>> do page promotion from a slower memory tier to a faster memory tier
>>>> using numa fault. This patch fixes this issue.
>>>>
>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
>>>> mask, we allow numa migration to the executing nodes. If the executing
>>>> node is not in the policy node mask, we do not allow numa migration.
>>> Can we provide more information about this?  I suggest to use an
>>> example, for instance, pages may be distributed among multiple sockets
>>> unexpectedly.
>> Thank you for your suggestion. However, this commit message explains all the scenarios.
> Yes.  The commit message is correct and covers many cases.  What I
> suggested is to describe why we do that?  An examples can not covers all
> possibility, but it is easy to be understood.  For example, something as
> below?
>
> For example, on a 2-sockets system, there are N0, N1, N2 in socket 0, N3
> in socket 1.  N0, N1, N3 have fast memory and CPU, while N2 has slow
> memory and no CPU.  For a workload, we may use MPOL_PREFERRED_MANY with
> nodemask with N0 and N1 set because the workload runs on CPUs of socket
> 0 at most times.  Then, even if the workload runs on CPUs of N3
> occasionally, we will not try to migrate the workload pages from N2 to
> N3 because users may want to avoid cross-socket access as much as
> possible in the long term.
>
>> For example, Consider a system with 3 numa nodes (N0,N1 and N6).
>> N0 and N1 are tier1 DRAM nodes  and N6 is tier 2 PMEM node.
>>
>> Scenario 1: The process is executing on N1,
>>              If the executing node is in the policy node mask,
>>              Curr Loc Pages - The numa node where page present(folio node)
>> ==================================================================================
>> Process      Policy          Curr Loc Pages                 Observations
>> -----------------------------------------------------------------------------------
>> N1           N0 N1 N6              N0                   Pages Migrated from N0 to N1
>> N1           N0 N1 N6              N6                   Pages Migrated from N6 to N1
>> N1           N0 N1                 N1                   Pages Migrated from N1 to N6
> Pages are not Migrating ?

Sorry .This is a mistake. In this case Pages are not migrating.

Thanks
Donet.

>
>> N1           N0 N1                 N6                   Pages Migrated from N6 to N1
>> ------------------------------------------------------------------------------------
>> Scenario 2:  The process is executing on N1,
>>               If the executing node is NOT in the policy node mask,
>>               Curr Loc Pages - The numa node where page present(folio node)
>> ===================================================================================
>> Process       Policy       Curr Loc Pages       Observations
>> -----------------------------------------------------------------------------------
>> N1            N0 N6             N0              Pages are not Migrating
>> N1            N0 N6             N6              Pages are not migration,
>> N1            N0                N0              Pages are not Migrating
>> ------------------------------------------------------------------------------------
>>
>> Scenario 3: The process is executing on N1,
>>              If the executing node and folio nodes are  NOT in the policy node mask,
>>              Curr Loc Pages - The numa node where page present (folio node)
>> ====================================================================================
>> Thread    Policy       Curr Loc Pages           Observations
>> ------------------------------------------------------------------------------------
>> N1          N0               N6                 Pages are not Migrating
>> N1          N6               N0                 Pages are not Migrating
>> ------------------------------------------------------------------------------------
>>
>> We can conclude that even if the pages are distributed among multiple sockets,
>> if the executing node is in the policy node mask, we allow numa migration to the
>> executing nodes. If the executing node is not in the policy node mask,
>> we do not allow numa migration.
>>
> [snip]
>
> --
> Best Regards,
> Huang, Ying