linux-kernel - Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <66a285fc-e54e-4247-8801-e7e17ad795a6@redhat.com>
Date: Thu, 20 Jun 2024 20:53:07 +0200
From: David Hildenbrand <david@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Fuad Tabba <tabba@...gle.com>, Christoph Hellwig <hch@...radead.org>,
 John Hubbard <jhubbard@...dia.com>, Elliot Berman
 <quic_eberman@...cinc.com>, Andrew Morton <akpm@...ux-foundation.org>,
 Shuah Khan <shuah@...nel.org>, Matthew Wilcox <willy@...radead.org>,
 maz@...nel.org, kvm@...r.kernel.org, linux-arm-msm@...r.kernel.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 linux-kselftest@...r.kernel.org, pbonzini@...hat.com
Subject: Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning

On 20.06.24 18:36, Jason Gunthorpe wrote:
> On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
> 
>> If we could disallow pinning any shared pages, that would make life a lot
>> easier, but I think there were reasons for why we might require it. To
>> convert shared->private, simply unmap that folio (only the shared parts
>> could possibly be mapped) from all user page tables.
> 
> IMHO it should be reasonable to make it work like ZONE_MOVABLE and
> FOLL_LONGTERM. Making a shared page private is really no different
> from moving it.
> 
> And if you have built a VMM that uses VMA mapped shared pages and
> short-term pinning then you should really also ensure that the VM is
> aware when the pins go away. For instance if you are doing some virtio
> thing with O_DIRECT pinning then the guest will know the pins are gone
> when it observes virtio completions.
> 
> In this way making private is just like moving, we unmap the page and
> then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large 
folio is (validly) longterm pinned and you want to convert another 
shared subpage to private?

Sure, we can unmap the whole large folio (including all shared parts) 
before the conversion, just like we would do for migration. But we 
cannot detect that nobody pinned that subpage that we want to convert to 
private.

Core-mm is not, and will not, track pins per subpage.

So I only see two options:

a) Disallow long-term pinning. That means, we can, with a bit of wait,
    always convert subpages shared->private after unmapping them and
    waiting for the short-term pin to go away. Not too bad, and we
    already have other mechanisms disallow long-term pinnings (especially
    writable fs ones!).

b) Expose the large folio as multiple 4k folios to the core-mm.

b) would look as follows: we allocate a gigantic page from the (hugetlb) 
reserve into guest_memfd. Then, we break it down into individual 4k 
folios by splitting/demoting the folio. We make sure that all 4k folios 
are unmovable (raised refcount). We keep tracking internally that these 
4k folios comprise a single large gigantic page.

Core-mm can track for us now without any modifications per (previously 
subpage,) now small folios GUP pins and page table mappings without 
modifications.

Once we unmap the gigantic page from guest_memfd, we recronstruct the 
gigantic page and hand it back to the reserve (only possible once all 
pins are gone).

We can still map the whole thing into the KVM guest+iommu using a single 
large unit, because guest_memfd knows the origin/relationship of these 
pages. But we would only map individual pages into user page tables 
(unless we use large VM_PFNMAP mappings, but then also pinning would not 
work, so that's likely also not what we want).

The downside is that we won't benefit from vmemmap optimizations for 
large folios from hugetlb, and have more tracking overhead when mapping 
individual pages into user page tables.

OTOH, maybe we really *need* per-page tracking and this might be the 
simplest way forward, making GUP and friends just work naturally with it.

> 
>>> I'm kind of surprised the CC folks don't want the same thing for
>>> exactly the same reason. It is much easier to recover the huge
>>> mappings for the S2 in the presence of shared holes if you track it
>>> this way. Even CC will have this problem, to some degree, too.
>>
>> Precisely! RH (and therefore, me) is primarily interested in existing
>> guest_memfd users at this point ("CC"), and I don't see an easy way to get
>> that running with huge pages in the existing model reasonably well ...
> 
> IMHO it is an important topic so I'm glad you are thinking about it.

Thank my manager ;)

> 
> There is definately some overlap here where if you do teach
> guest_memfd about huge pages then you must also provide a away to map
> the fragments of them that have become shared. I think there is little
> option here unless you double allocate and/or destroy the performance
> properties of the huge pages.

Right, and that's not what we want.

> 
> It is just the nature of our system that shared pages must be in VMAs
> and must be copy_to/from_user/GUP'able/etc.

Right. Longterm GUP is not a real requirement.

-- 
Cheers,

David / dhildenb