[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54FB91E8-C415-439F-988C-292B85F9794D@nvidia.com>
Date: Thu, 15 Jul 2021 13:52:46 -0400
From: Zi Yan <ziy@...dia.com>
To: Huang Ying <ying.huang@...el.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
Dave Hansen <dave.hansen@...ux.intel.com>,
Yang Shi <shy828301@...il.com>,
Oscar Salvador <osalvador@...e.de>,
Michal Hocko <mhocko@...e.com>, Wei Xu <weixugc@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Dan Williams <dan.j.williams@...el.com>,
David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH -V10 1/9] mm/numa: automatically generate node migration order
On 15 Jul 2021, at 1:51, Huang Ying wrote:
> From: Dave Hansen <dave.hansen@...ux.intel.com>
>
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a node migration table. This allows creating single migration
> target for each NUMA node to enable the kernel to do NUMA page
> migrations instead of simply discarding colder pages. A node with no
> target is a "terminal node", so reclaim acts normally there. The
> migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
>
> When memory fills up on a node, memory contents can be
> automatically migrated to another node. The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
>
> The most straightforward way to generate the "to where" list would
> be to follow the page allocator fallback lists. Those lists
> already tell us if memory is full where to look next. It would
> also be logical to move memory in that order.
>
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists. This would potentially lead to migration
> cycles (A->B, B->A, A->B, ...).
>
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering. But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
>
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now. It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
>
> RCU is used to allow lock-less read of node_demotion[] and prevent
> demotion cycles been observed. If multiple reads of node_demotion[]
> are performed, a single rcu_read_lock() must be held over all reads to
> ensure no cycles are observed. Details are as follows.
>
> === What does RCU provide? ===
>
> Imaginge a simple loop which walks down the demotion path looking
s/Imaginge/Imagine
> for the last node:
>
> terminal_node = start_node;
> while (node_demotion[terminal_node] != NUMA_NO_NODE) {
> terminal_node = node_demotion[terminal_node];
> }
>
> The initial values are:
>
> node_demotion[0] = 1;
> node_demotion[1] = NUMA_NO_NODE;
>
> and are updated to:
>
> node_demotion[0] = NUMA_NO_NODE;
> node_demotion[1] = 0;
>
> What guarantees that the cycle is not observed:
>
> node_demotion[0] = 1;
> node_demotion[1] = 0;
>
> and would loop forever?
>
> With RCU, a rcu_read_lock/unlock() can be placed around the
> loop. Since the write side does a synchronize_rcu(), the loop
> that observed the old contents is known to be complete before the
> synchronize_rcu() has completed.
>
> RCU, combined with disable_all_migrate_targets(), ensures that
> the old migration state is not visible by the time
> __set_migration_target_nodes() is called.
>
> === What does READ_ONCE() provide? ===
>
> READ_ONCE() forbids the compiler from merging or reordering
> successive reads of node_demotion[]. This ensures that any
> updates are *eventually* observed.
>
> Consider the above loop again. The compiler could theoretically
> read the entirety of node_demotion[] into local storage
> (registers) and never go back to memory, and *permanently*
> observe bad values for node_demotion[].
>
> Note: RCU does not provide any universal compiler-ordering
> guarantees:
>
> https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
>
> This code is unused for now. It will be called later in the
> series.
>
> Signed-off-by: Dave Hansen <dave.hansen@...ux.intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@...el.com>
> Reviewed-by: Yang Shi <shy828301@...il.com>
> Reviewed-by: Oscar Salvador <osalvador@...e.de>
> Cc: Michal Hocko <mhocko@...e.com>
> Cc: Wei Xu <weixugc@...gle.com>
> Cc: Zi Yan <ziy@...dia.com>
> Cc: David Rientjes <rientjes@...gle.com>
> Cc: Dan Williams <dan.j.williams@...el.com>
> Cc: David Hildenbrand <david@...hat.com>
>
> --
>
> Changes from 20210618:
> * Merge patches for data structure definition and initialization
> * Move RCU usage from the next patch in series per Zi's comments
>
> Changes from 20210302:
> * Fix typo in node_demotion[] comment
>
> Changes since 20200122:
> * Make node_demotion[] __read_mostly
> * Add big node_demotion[] comment
>
> Changes in July 2020:
> - Remove loop from next_demotion_node() and get_online_mems().
> This means that the node returned by next_demotion_node()
> might now be offline, but the worst case is that the
> allocation fails. That's fine since it is transient.
> ---
> mm/internal.h | 5 ++
> mm/migrate.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 2 +-
> 3 files changed, 222 insertions(+), 1 deletion(-)
LGTM. Reviewed-by: Zi Yan <ziy@...dia.com>
—
Best Regards,
Yan, Zi
Download attachment "signature.asc" of type "application/pgp-signature" (855 bytes)
Powered by blists - more mailing lists