linux-kernel - Re: [LSF/MM TOPIC][LSF/MM,ATTEND] shared TLB, hugetlb reservations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170314183706.GO27056@redhat.com>
Date:   Tue, 14 Mar 2017 19:37:06 +0100
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Mike Kravetz <mike.kravetz@...cle.com>
Cc:     lsf-pc@...ts.linux-foundation.org, linux-mm@...ck.org,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Dr. David Alan Gilbert" <dgilbert@...hat.com>,
        qemu-devel@...gnu.org, Mike Rapoport <rppt@...ux.vnet.ibm.com>
Subject: Re: [LSF/MM TOPIC][LSF/MM,ATTEND] shared TLB, hugetlb reservations

Hello,

On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote:
> On 01/10/2017 03:02 PM, Mike Kravetz wrote:
> > Another more concrete topic is hugetlb reservations.  Michal Hocko
> > proposed the topic "mm patches review bandwidth", and brought up the
> > related subject of areas in need of attention from an architectural
> > POV.  I suggested that hugetlb reservations was one such area.  I'm
> > guessing it was introduced to solve a rather concrete problem.  However,
> > over time additional hugetlb functionality was added and the
> > capabilities of the reservation code was stretched to accommodate.
> > It would be good to step back and take a look at the design of this
> > code to determine if a rewrite/redesign is necessary.  Michal suggested
> > documenting the current design/code as a first step.  If people think
> > this is worth discussion at the summit, I could put together such a
> > design before the gathering.
> 
> I attempted to put together a design/overview of how hugetlb reservations
> currently work.  Hopefully, this will be useful.

Another area of hugetlbfs that is not clear is the status of
MADV_REMOVE and the behavior of fallocate punch hole that deviates
from more standard shmem semantics. That might also be a topic of
interest related to your hugetlbfs topic and marginally related to
userfaultfd.

The current status for anon, shmem and hugetlbfs like this:

MADV_DONTNEED works: anon, !VM_SHARED shmem
MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED
MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED

MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED
MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED

fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED,
	  	     	    shmem VM_SHARED
fallocate punch hole doesn't work: anon, shmem !VM_SHARED

So what happens in qemu is:

anon			-> MADV_DONTNEED

shmem !VM_SHARED	-> MADV_DONTNEED (fallocate punch hole wouldn't zap
			   private pages, but it does on hugetlbfs)

shmem VM_SHARED		-> fallocate punch hole (MADV_REMOVE would
      			   work too)

hugetlbfs !VM_SHARED	-> fallocate punch hole (works for hugetlbfs
			   but not for shmem !VM_SHARED)

hugetlbfs VM_SHARED	-> fallocate punch hole (MADV_REMOVE would work too)

This means qemu has to carry around information on the type of memory
it got from the initial memblock setup, so at live migration time it
can zap the memory with the right call. (NOTE: such memory is not
generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped
and it must be zapped well before calling userfaultfd the first time).

To do this qemu uses fstatfs and finds out which kind of memory it's
dealing with to use the right call depending on which memory.

In short it'd be better to have something like a generic MADV_REMOVE
that guarantees a non-present fault after it succeeds, no matter what
kind of memory is mapped in the virtual range that has to be
zapped. The above is far from ideal from a userland developer
prospective.

Overall fallocate punch hole covers the most cases so to keep the code
simpler ironically MADV_REMOVE ends up being never used despite it
provides a more friendly API than fallocate to qemu. The files are
always mapped and the older code only dealt with virtual addresses
(before hugetlbfs and shmem entered thee equation). Ideally qemu wants
to call the same madvise regardles if the memory is anon shmem or
hugetlbfs without having to carry around file descriptor, file offsets
and superblock types.

It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs
!VM_SHARED mappings and why fallocate punch hole is also zapping
private cow-like pages from !VM_SHARED mappings (although if it
didn't, it would be impossible to zap those... so it's good luck it
does).

Thanks,
Andrea

PS. CC'ed also qemu-devel in case it may help clarify why things are
implemented they way they are in the postcopy live migration
hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs
share=on.