lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 30 May 2017 17:43:26 +0200
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Arnd Bergmann <arnd@...db.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Pavel Emelyanov <xemul@...tuozzo.com>,
        linux-mm <linux-mm@...ck.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote:
> I sysctl for the mapcount can be increased, right? I also assume that
> those vmas will get merged after the post copy is done.

Assuming you enlarge the sysctl to the worst possible case, with 64bit
address space you can have billions of VMAs if you're migrating 4T of
RAM and you're unlucky and the address space gets fragmented. The
unswappable kernel memory overhead would be relatively large
(i.e. dozen gigabytes of RAM in vm_area_struct slab), and each
find_vma operation would need to walk ~40 steps across that large vma
rbtree. There's a reason the sysctl exist. Not to tell all those
unnecessary vma mangling operations would be protected by the mmap_sem
for writing.

Not creating a ton of vmas and enabling vma-less pte mangling with a
single large vma and only using mmap_sem for reading during all the
pte mangling, is one of the primary design motivations for
userfaultfd.

> I understand that part but it sounds awfully one purpose thing to me.
> Are we going to add other MADVISE_RESET_$FOO to clear other flags just
> because we can race in this specific use case?

Those already exists, see for example MADV_NORMAL, clearing
~VM_RAND_READ & ~VM_SEQ_READ after calling MADV_SEQUENTIAL or
MADV_RANDOM.

Or MADV_DOFORK after MADV_DONTFORK. MADV_DONTDUMP after MADV_DODUMP. Etc..

> But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to
> enable/disable thp. Doesn't that sound little bit too much for a single
> feature to you?

MADV_NOHUGEPAGE doesn't mean clearing the flag set with
MADV_HUGEPAGE. MADV_NOHUGEPAGE disables THP on the region if the
global sysfs "enabled" tune is set to "always". MADV_HUGEPAGE enables
THP if the global "enabled" sysfs tune is set to "madvise". The two
MADV_NOHUGEPAGE and MADV_HUGEPAGE are needed to leverage the three-way
setting of "never" "madvise" "always" of the global tune.

The "madvise" global tune exists if you want to save RAM and you don't
care much about performance but still allowing apps like QEMU where no
memory is lost by enabling THP, to use THP.

There's no way to clear either of those two flags and bring back the
default behavior of the global sysfs tune, so it's not redundant at
the very least.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ