[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef7d952f-a0c2-3947-a5bf-f6694acfdb02@linux.alibaba.com>
Date: Thu, 4 Apr 2019 17:32:15 -0700
From: Yang Shi <yang.shi@...ux.alibaba.com>
To: ziy@...dia.com, Dave Hansen <dave.hansen@...ux.intel.com>,
Keith Busch <keith.busch@...el.com>,
Fengguang Wu <fengguang.wu@...el.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Cc: Daniel Jordan <daniel.m.jordan@...cle.com>,
Michal Hocko <mhocko@...nel.org>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
Mel Gorman <mgorman@...hsingularity.net>,
John Hubbard <jhubbard@...dia.com>,
Mark Hairgrove <mhairgrove@...dia.com>,
Nitin Gupta <nigupta@...dia.com>,
Javier Cabezas <jcabezas@...dia.com>,
David Nellans <dnellans@...dia.com>
Subject: Re: [RFC PATCH 00/25] Accelerate page migration and use memcg for
PMEM management
On 4/3/19 7:00 PM, Zi Yan wrote:
> From: Zi Yan <ziy@...dia.com>
>
> Thanks to Dave Hansen's patches, which make PMEM as part of memory as NUMA nodes.
> How to use PMEM along with normal DRAM remains an open problem. There are
> several patchsets posted on the mailing list, proposing to use page migration to
> move pages between PMEM and DRAM using Linux page replacement policy [1,2,3].
> There are some important problems not addressed in these patches:
> 1. The page migration in Linux does not provide high enough throughput for us to
> fully exploit PMEM or other use cases.
> 2. Linux page replacement is running too infrequent to distinguish hot and cold
> pages.
>
> I am trying to attack the problems with this patch series. This is not a final
> solution, but I would like to gather more feedback and comments from the mailing
> list.
>
> Page migration throughput problem
> ====
>
> For example, in my recent email [4], I gave the page migration throughput numbers
> for different page migrations, none of which can achieve > 2.5GB/s throughput
> (the throughput is measured around kernel functions: migrate_pages() and
> migrate_page_copy()):
>
> | migrate_pages() | migrate_page_copy()
> migrating single 4KB page: | 0.312GB/s | 1.385GB/s
> migrating 512 4KB pages: | 0.854GB/s | 1.983GB/s
> migrating single 2MB THP: | 2.387GB/s | 2.481GB/s
>
> In reality, microbenchmarks show that Intel PMEM can provide ~65GB/s read
> throughput and ~16GB/s write throughput [5], which are much higher than
> the throughput achieved by Linux page migration.
>
> In addition, it is also desirable to use page migration to move data
> between high-bandwidth memory and DRAM, like IBM Summit, which exposes
> high-performance GPU memories as NUMA nodes [6]. This requires even higher page
> migration throughput.
>
> In this patch series, I propose four different ways of improving page migration
> throughput (mostly on 2MB THP migration):
> 1. multi-threaded page migration: Patch 03 to 06.
> 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08.
> 3. concurrent (batched) page migration: Patch 09, 10, and 11.
> 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7])
>
> Here are some throughput numbers showing clear throughput improvements on
> a two-socket NUMA machine with two Xeon E5-2650 v3 @ 2.30GHz and a 19.2GB/s
> bandwidth QPI link (the same machine as mentioned in [4]):
>
> | migrate_pages() | migrate_page_copy()
> => migrating single 2MB THP | 2.387GB/s | 2.481GB/s
> 2-thread single THP migration | 3.478GB/s | 3.704GB/s
> 4-thread single THP migration | 5.474GB/s | 6.054GB/s
> 8-thread single THP migration | 7.846GB/s | 9.029GB/s
> 16-thread single THP migration | 7.423GB/s | 8.464GB/s
> 16-ch. DMA single THP migration | 4.322GB/s | 4.536GB/s
>
> 2-thread 16-THP migration | 3.610GB/s | 3.838GB/s
> 2-thread 16-THP batched migration | 4.138GB/s | 4.344GB/s
> 4-thread 16-THP migration | 6.385GB/s | 7.031GB/s
> 4-thread 16-THP batched migration | 7.382GB/s | 8.072GB/s
> 8-thread 16-THP migration | 8.039GB/s | 9.029GB/s
> 8-thread 16-THP batched migration | 9.023GB/s | 10.056GB/s
> 16-thread 16-THP migration | 8.137GB/s | 9.137GB/s
> 16-thread 16-THP batched migration | 9.907GB/s | 11.175GB/s
>
> 1-thread 16-THP exchange | 4.135GB/s | 4.225GB/s
> 2-thread 16-THP batched exchange | 7.061GB/s | 7.325GB/s
> 4-thread 16-THP batched exchange | 9.729GB/s | 10.237GB/s
> 8-thread 16-THP batched exchange | 9.992GB/s | 10.533GB/s
> 16-thread 16-THP batched exchange | 9.520GB/s | 10.056GB/s
>
> => migrating 512 4KB pages | 0.854GB/s | 1.983GB/s
> 1-thread 512-4KB batched exchange | 1.271GB/s | 3.433GB/s
> 2-thread 512-4KB batched exchange | 1.240GB/s | 3.190GB/s
> 4-thread 512-4KB batched exchange | 1.255GB/s | 3.823GB/s
> 8-thread 512-4KB batched exchange | 1.336GB/s | 3.921GB/s
> 16-thread 512-4KB batched exchange | 1.334GB/s | 3.897GB/s
>
> Concerns were raised on how to avoid CPU resource competition between
> page migration and user applications and have power awareness.
> Daniel Jordan recently posted a multi-threaded ktask patch series could be
> a solution [8].
>
>
> Infrequent page list update problem
> ====
>
> Current page lists are updated by calling shrink_list() when memory pressure
> comes, which might not be frequent enough to keep track of hot and cold pages.
> Because all pages are on active lists at the first time shrink_list() is called
> and the reference bit on the pages might not reflect the up to date access status
> of these pages. But we also do not want to periodically shrink the global page
> lists, which adds unnecessary overheads to the whole system. So I propose to
> actively shrink page lists on the memcg we are interested in.
>
> Patch 18 to 25 add a new system call to shrink page lists on given application's
> memcg and migrate pages between two NUMA nodes. It isolates the impact from the
> rest of the system. To share DRAM among different applications, Patch 18 and 19
> add per-node memcg size limit, so you can limit the memory usage for particular
> NUMA node(s).
This sounds a little bit confusing to me. Is it totally user's decision
about when to call the syscall to shrink page lists? But, how would user
know when is a good timing? Could you please elaborate the usecase?
Thanks,
Yang
>
>
> Patch structure
> ====
> 1. multi-threaded page migration: Patch 01 to 06.
> 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08.
> 3. concurrent (batched) page migration: Patch 09, 10, and 11.
> 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7])
> 5. per-node size limit in memcg: Patch 18 and 19.
> 6. actively shrink page lists and perform page migration in given memcg: Patch 20 to 25.
>
>
> Any comment is welcome.
>
> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
> [2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/
> [3]: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/
> [4]: https://lore.kernel.org/linux-mm/6A903D34-A293-4056-B135-6FA227DE1828@nvidia.com/
> [5]: https://www.storagereview.com/supermicro_superserver_with_intel_optane_dc_persistent_memory_first_look_review
> [6]: https://www.ibm.com/thought-leadership/summit-supercomputer/
> [7]: https://lore.kernel.org/linux-mm/20190215220856.29749-1-zi.yan@sent.com/
> [8]: https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/
>
> Zi Yan (25):
> mm: migrate: Change migrate_mode to support combination migration
> modes.
> mm: migrate: Add mode parameter to support future page copy routines.
> mm: migrate: Add a multi-threaded page migration function.
> mm: migrate: Add copy_page_multithread into migrate_pages.
> mm: migrate: Add vm.accel_page_copy in sysfs to control page copy
> acceleration.
> mm: migrate: Make the number of copy threads adjustable via sysctl.
> mm: migrate: Add copy_page_dma to use DMA Engine to copy pages.
> mm: migrate: Add copy_page_dma into migrate_page_copy.
> mm: migrate: Add copy_page_lists_dma_always to support copy a list of
> pages.
> mm: migrate: copy_page_lists_mt() to copy a page list using
> multi-threads.
> mm: migrate: Add concurrent page migration into move_pages syscall.
> exchange pages: new page migration mechanism: exchange_pages()
> exchange pages: add multi-threaded exchange pages.
> exchange pages: concurrent exchange pages.
> exchange pages: exchange anonymous page and file-backed page.
> exchange page: Add THP exchange support.
> exchange page: Add exchange_page() syscall.
> memcg: Add per node memory usage&max stats in memcg.
> mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit.
> memory manage: Add memory manage syscall.
> mm: move update_lru_sizes() to mm_inline.h for broader use.
> memory manage: active/inactive page list manipulation in memcg.
> memory manage: page migration based page manipulation between NUMA
> nodes.
> memory manage: limit migration batch size.
> memory manage: use exchange pages to memory manage to improve
> throughput.
>
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> fs/aio.c | 12 +-
> fs/f2fs/data.c | 6 +-
> fs/hugetlbfs/inode.c | 4 +-
> fs/iomap.c | 4 +-
> fs/ubifs/file.c | 4 +-
> include/linux/cgroup-defs.h | 1 +
> include/linux/exchange.h | 27 +
> include/linux/highmem.h | 3 +
> include/linux/ksm.h | 4 +
> include/linux/memcontrol.h | 67 ++
> include/linux/migrate.h | 12 +-
> include/linux/migrate_mode.h | 8 +
> include/linux/mm_inline.h | 21 +
> include/linux/sched/coredump.h | 1 +
> include/linux/sched/sysctl.h | 3 +
> include/linux/syscalls.h | 10 +
> include/uapi/linux/mempolicy.h | 9 +-
> kernel/sysctl.c | 47 +
> mm/Makefile | 5 +
> mm/balloon_compaction.c | 2 +-
> mm/compaction.c | 22 +-
> mm/copy_page.c | 708 +++++++++++++++
> mm/exchange.c | 1560 ++++++++++++++++++++++++++++++++
> mm/exchange_page.c | 228 +++++
> mm/internal.h | 113 +++
> mm/ksm.c | 35 +
> mm/memcontrol.c | 80 ++
> mm/memory_manage.c | 649 +++++++++++++
> mm/mempolicy.c | 38 +-
> mm/migrate.c | 621 ++++++++++++-
> mm/vmscan.c | 115 +--
> mm/zsmalloc.c | 2 +-
> 33 files changed, 4261 insertions(+), 162 deletions(-)
> create mode 100644 include/linux/exchange.h
> create mode 100644 mm/copy_page.c
> create mode 100644 mm/exchange.c
> create mode 100644 mm/exchange_page.c
> create mode 100644 mm/memory_manage.c
>
> --
> 2.7.4
Powered by blists - more mailing lists