lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 15 Jul 2021 13:52:46 -0400
From:   Zi Yan <ziy@...dia.com>
To:     Huang Ying <ying.huang@...el.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Yang Shi <shy828301@...il.com>,
        Oscar Salvador <osalvador@...e.de>,
        Michal Hocko <mhocko@...e.com>, Wei Xu <weixugc@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Dan Williams <dan.j.williams@...el.com>,
        David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH -V10 1/9] mm/numa: automatically generate node migration order

On 15 Jul 2021, at 1:51, Huang Ying wrote:

> From: Dave Hansen <dave.hansen@...ux.intel.com>
>
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a node migration table. This allows creating single migration
> target for each NUMA node to enable the kernel to do NUMA page
> migrations instead of simply discarding colder pages. A node with no
> target is a "terminal node", so reclaim acts normally there.  The
> migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
>
> When memory fills up on a node, memory contents can be
> automatically migrated to another node.  The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
>
> The most straightforward way to generate the "to where" list would
> be to follow the page allocator fallback lists.  Those lists
> already tell us if memory is full where to look next.  It would
> also be logical to move memory in that order.
>
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists.  This would potentially lead to migration
> cycles (A->B, B->A, A->B, ...).
>
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering.  But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
>
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now.  It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
>
> RCU is used to allow lock-less read of node_demotion[] and prevent
> demotion cycles been observed.  If multiple reads of node_demotion[]
> are performed, a single rcu_read_lock() must be held over all reads to
> ensure no cycles are observed.  Details are as follows.
>
> === What does RCU provide? ===
>
> Imaginge a simple loop which walks down the demotion path looking

s/Imaginge/Imagine

> for the last node:
>
>         terminal_node = start_node;
>         while (node_demotion[terminal_node] != NUMA_NO_NODE) {
>                 terminal_node = node_demotion[terminal_node];
>         }
>
> The initial values are:
>
>         node_demotion[0] = 1;
>         node_demotion[1] = NUMA_NO_NODE;
>
> and are updated to:
>
>         node_demotion[0] = NUMA_NO_NODE;
>         node_demotion[1] = 0;
>
> What guarantees that the cycle is not observed:
>
>         node_demotion[0] = 1;
>         node_demotion[1] = 0;
>
> and would loop forever?
>
> With RCU, a rcu_read_lock/unlock() can be placed around the
> loop.  Since the write side does a synchronize_rcu(), the loop
> that observed the old contents is known to be complete before the
> synchronize_rcu() has completed.
>
> RCU, combined with disable_all_migrate_targets(), ensures that
> the old migration state is not visible by the time
> __set_migration_target_nodes() is called.
>
> === What does READ_ONCE() provide? ===
>
> READ_ONCE() forbids the compiler from merging or reordering
> successive reads of node_demotion[].  This ensures that any
> updates are *eventually* observed.
>
> Consider the above loop again.  The compiler could theoretically
> read the entirety of node_demotion[] into local storage
> (registers) and never go back to memory, and *permanently*
> observe bad values for node_demotion[].
>
> Note: RCU does not provide any universal compiler-ordering
> guarantees:
>
> 	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
>
> This code is unused for now.  It will be called later in the
> series.
>
> Signed-off-by: Dave Hansen <dave.hansen@...ux.intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@...el.com>
> Reviewed-by: Yang Shi <shy828301@...il.com>
> Reviewed-by: Oscar Salvador <osalvador@...e.de>
> Cc: Michal Hocko <mhocko@...e.com>
> Cc: Wei Xu <weixugc@...gle.com>
> Cc: Zi Yan <ziy@...dia.com>
> Cc: David Rientjes <rientjes@...gle.com>
> Cc: Dan Williams <dan.j.williams@...el.com>
> Cc: David Hildenbrand <david@...hat.com>
>
> --
>
> Changes from 20210618:
>  * Merge patches for data structure definition and initialization
>  * Move RCU usage from the next patch in series per Zi's comments
>
> Changes from 20210302:
>  * Fix typo in node_demotion[] comment
>
> Changes since 20200122:
>  * Make node_demotion[] __read_mostly
>  * Add big node_demotion[] comment
>
> Changes in July 2020:
>  - Remove loop from next_demotion_node() and get_online_mems().
>    This means that the node returned by next_demotion_node()
>    might now be offline, but the worst case is that the
>    allocation fails.  That's fine since it is transient.
> ---
>  mm/internal.h   |   5 ++
>  mm/migrate.c    | 216 ++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c |   2 +-
>  3 files changed, 222 insertions(+), 1 deletion(-)

LGTM. Reviewed-by: Zi Yan <ziy@...dia.com>


—
Best Regards,
Yan, Zi

Download attachment "signature.asc" of type "application/pgp-signature" (855 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ