linux-kernel - Re: [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3291d101-e362-4690-0f9d-048e13d3be03@linux.alibaba.com>
Date:   Wed, 26 Jun 2019 19:57:19 -0700
From:   Yang Shi <yang.shi@...ux.alibaba.com>
To:     mhocko@...e.com, mgorman@...hsingularity.net, riel@...riel.com,
        hannes@...xchg.org, akpm@...ux-foundation.org,
        dave.hansen@...el.com, keith.busch@...el.com,
        dan.j.williams@...el.com, fengguang.wu@...el.com, fan.du@...el.com,
        ying.huang@...el.com, ziy@...dia.com
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [v3 RFC PATCH 0/9] Migrate mode for node reclaim with
 heterogeneous memory hierarchy

Hi folks,


Any comment on this version?


Thanks,

Yang



On 6/13/19 4:29 PM, Yang Shi wrote:
> With Dave Hansen's patches merged into Linus's tree
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>
> PMEM could be hot plugged as NUMA node now.  But, how to use PMEM as NUMA
> node effectively and efficiently is worth exploring.
>
> There have been a couple of proposals posted on the mailing list [1] [2] [3].
>
> I already posted two versions of patchset for demoting/promoting memory pages
> between DRAM and PMEM before this topic was discussed at LSF/MM 2019
> (https://lwn.net/Articles/787418/).  I do appreciate all the great suggestions
> from the community.  This updated version implemented the most discussion,
> please see the below design section for the details.
>
>
> Changelog
> =========
> v2 --> v3:
> * Introduced "migrate mode" for node reclaim.  Just do demotion when
>    "migrate mode" is specified per Michal Hocko and Mel Gorman.
> * Introduced "migrate target" concept for VM per Mel Gorman.  The memory nodes
>    which are under DRAM in the hierarchy (i.e. lower bandwidth, higher latency,
>    larger capacity and cheaper than DRAM) are considered as "migrate target"
>    nodes.  When "migrate mode" is on, memory reclaim would demote pages to
>    the "migrate target" nodes.
> * Dropped "twice access" promotion patch per Michal Hocko.
> * Changed the subject for the patchset to reflect the update.
> * Rebased to 5.2-rc1.
>
> v1 --> v2:
> * Dropped the default allocation node mask.  The memory placement restriction
>    could be achieved by mempolicy or cpuset.
> * Dropped the new mempolicy since its semantic is not that clear yet.
> * Dropped PG_Promote flag.
> * Defined N_CPU_MEM nodemask for the nodes which have both CPU and memory.
> * Extended page_check_references() to implement "twice access" check for
>    anonymous page in NUMA balancing path.
> * Reworked the memory demotion code.
>
> v2: https://lore.kernel.org/linux-mm/1554955019-29472-1-git-send-email-yang.shi@linux.alibaba.com/
> v1: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/
>
>
> Design
> ======
> With the development of new memory technology, we could have cheaper and
> larger memory device on the system, which may have higher latency and lower
> bandwidth than DRAM, i.e. PMEM.  It could be used as persistent storage or
> volatile memory.
>
> It fits into the memory hierarchy as a second tier memory.  The patchset
> tries to explore an approach to utilize such memory to improve the memory
> placement.  Basically, the patchset tries to achieve this goal by doing
> memory promotion/demotion via NUMA balancing and memory reclaim.
>
> Introduce a new "migrate" mode for node reclaim.  When DRAM has memory
> pressure, demote pages to PMEM via node reclaim path if "migrate" mode is
> on.  Then NUMA balancing will promote pages to DRAM as long as the page is
> referenced again.  The memory pressure on PMEM node would push the inactive
> pages of PMEM to disk via swap.
>
> Introduce "primary" node and "migrate target" node concepts for VM (patch 1/9
> and 2/9).  The "primary" node is the node which has both CPU and memory.  The
> "migrate target" node is cpuless node and under DRAM in memory hierarchy
> (i.e. PMEM may be a suitable one, which has lower bandwidth, higher latency,
> larger capacity and is cheaper than DRAM).  The firmware is effectively going
> to enforce "cpu-less" nodes for any memory range that has differentiated
> performance from the conventional memory pool, or differentiated performance
> for a specific initiator.
>
> Defined "N_CPU_MEM" nodemask for the "primary" nodes in order to distinguish
> with cpuless nodes (memory only, i.e. PMEM nodes) and memoryless nodes (some
> architectures, i.e. Power, may have memoryless nodes).
>
> It is a little bit hard to find out suitable "migrate target" node since this
> needs firmware exposes the physical characteristics of the memory devices.
> I'm not quite sure what should be the best way and if it is ready to use now
> or not.  Since PMEM is the only available such device for now, so it sounds
> retrieving the information from SRAT is the easiest way.  We may figure out a
> better way in the future.
>
> The promotion/demotion happens only between "primary" nodes and "migrate target"
> nodes.  No promotion/demotion between "migrate target" nodes and promotion from
> "primary" nodes to "migrate target" nodes and demotion from "primary" nodes to
> "migrate target" nodes.  This guarantees there is no cycles for memory demotion
> or promotion.
>
> According to the discussion at LFS/MM 2019, "there should only be one node to
> which pages could be migrated".   So reclaim code just tries to demote the pages
> to the closest "migrate target" node and only tries once.  Otherwise "if all
> nodes in the system were on a fallback list, a page would have to move through
> every possible option - each RAM-based node and each persistent-memory node -
> before actually being reclaimed. It would be necessary to maintain the history
> of where each page has been, and would be likely to disrupt other workloads on
> the system".  This is what v2 patchset does, so keep doing it in the same way
> in v3.
>
> The demotion code moves all the migration candidate pages into one single list,
> then migrate them together (including THP).  This would improve the efficiency
> of migration according to Zi Yan's research.  If the migration fails, the
> unmigrated pages will be put back to LRU.
>
> Use the most opotimistic GFP flags to allocate pages on the "migrate target"
> node.
>   
> To reduce the failure rate of demotion, check if the "migrate target" node is
> contended or not.  If the "migrate target" node is contended, just do swap
> instead of migrate.  If migration is failed due to -ENOMEM, mark the node as
> contended.  The contended flag will be cleared once the node get balanced.
>
> For now "migrate" mode is not compatible with cpuset and mempolicy since it
> is hard to get the process's task_struct from struct page.  The cpuset and
> process's mempolicy are stored in task_struct instead of mm_struct.
>
> Anonymous page only for the time being since NUMA balancing can't promote
> unmapped page cache.  Page cache can be demoted easily, but promotion is a
> question, may do it via mark_page_accessed().
>
> Added vmstat counters for pgdemote_kswapd, pgdemote_direct and
> numa_pages_promoted.
>
> There are definitely still a lot of details need to be sorted out.  Any
> comment is welcome.
>
>
> Test
> ====
> The stress test was done with mmtests + applications workload (i.e. sysbench,
> grep, etc).
>
> Generate memory pressure by running mmtest's usemem-stress-numa-compact,
> then run other applications as workload to stress the promotion and demotion
> path.  The machine was still alive after the stress test had been running for
> ~30 hours.  The /proc/vmstat also shows:
>
> ...
> pgdemote_kswapd 3316563
> pgdemote_direct 1930721
> ...
> numa_pages_promoted 81838
>
>
> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
> [2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/
> [3]: https://lore.kernel.org/linux-mm/20190404071312.GD12864@dhcp22.suse.cz/T/#me1c1ed102741ba945c57071de9749e16a76e9f3d
>
>
> Yang Shi (9):
>        mm: define N_CPU_MEM node states
>        mm: Introduce migrate target nodemask
>        mm: page_alloc: make find_next_best_node find return migration target node
>        mm: migrate: make migrate_pages() return nr_succeeded
>        mm: vmscan: demote anon DRAM pages to migration target node
>        mm: vmscan: don't demote for memcg reclaim
>        mm: vmscan: check if the demote target node is contended or not
>        mm: vmscan: add page demotion counter
>        mm: numa: add page promotion counter
>
>   Documentation/sysctl/vm.txt    |   6 +++
>   drivers/acpi/numa.c            |  12 +++++
>   drivers/base/node.c            |   4 ++
>   include/linux/gfp.h            |  12 +++++
>   include/linux/migrate.h        |   6 ++-
>   include/linux/mmzone.h         |   3 ++
>   include/linux/nodemask.h       |   4 +-
>   include/linux/vm_event_item.h  |   3 ++
>   include/linux/vmstat.h         |   1 +
>   include/trace/events/migrate.h |   3 +-
>   mm/compaction.c                |   3 +-
>   mm/debug.c                     |   1 +
>   mm/gup.c                       |   4 +-
>   mm/huge_memory.c               |   4 ++
>   mm/internal.h                  |  23 ++++++++
>   mm/memory-failure.c            |   7 ++-
>   mm/memory.c                    |   4 ++
>   mm/memory_hotplug.c            |  10 +++-
>   mm/mempolicy.c                 |   7 ++-
>   mm/migrate.c                   |  33 ++++++++----
>   mm/page_alloc.c                |  20 +++++--
>   mm/vmscan.c                    | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
>   mm/vmstat.c                    |  14 ++++-
>   23 files changed, 323 insertions(+), 47 deletions(-)