lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Mon,  6 May 2013 21:56:57 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	qemu-devel@...gnu.org, linux-kernel@...r.kernel.org
Cc:	Isaku Yamahata <yamahata@...inux.co.jp>,
	Juan Quintela <quintela@...hat.com>,
	Orit Wasserman <owasserm@...hat.com>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Anthony Liguori <aliguori@...ibm.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Hugh Dickins <hughd@...gle.com>
Subject: [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages()

Hello everyone,

this is a patchset to implement two new kernel features:
MADV_USERFAULT and remap_anon_pages.

The combination of the two features are what I would propose to
implement postcopy live migration, and in general demand paging of
remote memory, hosted in different cloud nodes with KSM. It might also
be used without virt to offload parts of memory to different nodes
using some userland library and a network memory manager.

Postcopy live migration is currently implemented using a chardevice,
which remains open for the whole VM lifetime and all virtual memory
then becomes owned by the chardevice and it's not anonymous anymore.

http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg05274.html

The main cons of the chardevice design is that all nice Linux MM
features (like swapping/THP/KSM/automatic-NUMA-balancing) are disabled
if the guest physical memory doesn't remain in anonymous memory. This
is entirely solved by this alternative kernel solution. In fact
remap_anon_pages will move THP pages natively by just updating two pmd
pointers if alignment and length permits without any THP split.

The other bonus is that MADV_USERFAULT and remap_anon_pages are
implemented in the MM core and remap_anon_pages furthermore provides a
functionality similar to what is already available for filebacked
pages with remap_file_pages. That is usually more maintainable than
having MM parts in a chardevice.

In addition to asking review of the internals, this also need review
the user APIs, as both those features are userland visible changes.

MADV_USERFAULT is only enabled for anonymous mappings so far but it
could be extended. To be strict, -EINVAL is returned if run on non
anonymous mappings (where it would currently be a noop).

The remap_anon_pages syscall API is not vectored, as I expect it used
for demand paging only (where there can be just one faulting range per
fault) or for large ranges where vectoring isn't going to provide
performance advantages.

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed, and it will return
-EFAULT at the first sign of something unexpected (like a page already
mapped in the destination pmd/pte, potentially signaling an userland
thread race condition with two threads userfaulting on the same
destination address). mremap is not strict like that: it would drop
the destination range silently and it would succeed in such a
condition. So on the API side, I wonder if I should add a flag to
remap_anon_pages to provide non-strict behavior more similar to
mremap. OTOH not providing the permissive mremap behavior may actually
be better to force userland to be strict and be sure it knows what it
is doing (otherwise it should use mremap in the first place?).

Comments welcome, thanks!
Andrea

Andrea Arcangeli (4):
  mm: madvise MADV_USERFAULT
  mm: rmap preparation for remap_anon_pages
  mm: swp_entry_swapcount
  mm: sys_remap_anon_pages

 arch/alpha/include/uapi/asm/mman.h     |   3 +
 arch/mips/include/uapi/asm/mman.h      |   3 +
 arch/parisc/include/uapi/asm/mman.h    |   3 +
 arch/x86/syscalls/syscall_32.tbl       |   1 +
 arch/x86/syscalls/syscall_64.tbl       |   1 +
 arch/xtensa/include/uapi/asm/mman.h    |   3 +
 include/linux/huge_mm.h                |   6 +
 include/linux/mm.h                     |   1 +
 include/linux/mm_types.h               |   2 +-
 include/linux/swap.h                   |   6 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/mman-common.h |   3 +
 kernel/sys_ni.c                        |   1 +
 mm/fremap.c                            | 440 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c                       | 158 ++++++++++--
 mm/madvise.c                           |  16 ++
 mm/memory.c                            |  10 +
 mm/rmap.c                              |   9 +
 mm/swapfile.c                          |  13 +
 19 files changed, 667 insertions(+), 15 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ