linux-kernel - Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a809ed94-e3b8-4f1e-964b-44a049de8a81@redhat.com>
Date: Thu, 17 Oct 2024 17:02:22 +0200
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: Ackerley Tng <ackerleytng@...gle.com>, tabba@...gle.com,
 quic_eberman@...cinc.com, roypat@...zon.co.uk, jgg@...dia.com,
 rientjes@...gle.com, fvdl@...gle.com, jthoughton@...gle.com,
 seanjc@...gle.com, pbonzini@...hat.com, zhiquan1.li@...el.com,
 fan.du@...el.com, jun.miao@...el.com, isaku.yamahata@...el.com,
 muchun.song@...ux.dev, erdemaktas@...gle.com, vannapurve@...gle.com,
 qperret@...gle.com, jhubbard@...dia.com, willy@...radead.org,
 shuah@...nel.org, brauner@...nel.org, bfoster@...hat.com,
 kent.overstreet@...ux.dev, pvorel@...e.cz, rppt@...nel.org,
 richard.weiyang@...il.com, anup@...infault.org, haibo1.xu@...el.com,
 ajones@...tanamicro.com, vkuznets@...hat.com,
 maciej.wieczor-retman@...el.com, pgonda@...gle.com, oliver.upton@...ux.dev,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, kvm@...r.kernel.org,
 linux-kselftest@...r.kernel.org
Subject: Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a
 struct kvm_gmem_private

On 16.10.24 22:16, Peter Xu wrote:
> On Wed, Oct 16, 2024 at 10:45:43AM +0200, David Hildenbrand wrote:
>> On 16.10.24 01:42, Ackerley Tng wrote:
>>> Peter Xu <peterx@...hat.com> writes:
>>>
>>>> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>>>>> Peter Xu <peterx@...hat.com> writes:
>>>>>
>>>>>> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>>>>>>> The faultability xarray is stored on the inode since faultability is a
>>>>>>> property of the guest_memfd's memory contents.
>>>>>>>
>>>>>>> In this RFC, presence of an entry in the xarray indicates faultable,
>>>>>>> but this could be flipped so that presence indicates unfaultable. For
>>>>>>> flexibility, a special value "FAULT" is used instead of a simple
>>>>>>> boolean.
>>>>>>>
>>>>>>> However, at some stages of a VM's lifecycle there could be more
>>>>>>> private pages, and at other stages there could be more shared pages.
>>>>>>>
>>>>>>> This is likely to be replaced by a better data structure in a future
>>>>>>> revision to better support ranges.
>>>>>>>
>>>>>>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>>>>>>> pointer. inode->i_mapping->i_private_data.
>>>>>>
>>>>>> Could you help explain the difference between faultability v.s. the
>>>>>> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>>>>>> confused, otherwise might be good to enrich the commit message.
>>>>>
>>>>> Thank you for this question, I'll add this to the commit message to the
>>>>> next revision if Fuad's patch set [1] doesn't make it first.
>>>>>
>>>>> Reason (a): To elaborate on the explanation in [1],
>>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>>>>> private or shared, and faultability is whether the page is allowed to be
>>>>> faulted in by userspace.
>>>>>
>>>>> These two are similar but may not be the same thing. In pKVM, pKVM
>>>>> cannot trust userspace's configuration of private/shared, and other
>>>>> information will go into determining the private/shared setting in
>>>>> faultability.
>>>>
>>>> It makes sense to me that the kernel has the right to decide which page is
>>>> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
>>>> case is most / all pages are private, until some requests to share them for
>>>> special purposes (like DMA).  But that'll need to be initiated as a request
>>>> from the guest not the userspace hypervisor.
>>>
>>> For TDX, the plan is that the guest will request the page to be remapped
>>> as shared or private, and the handler for that request will exit to
>>> the userspace VMM.
>>>
>>> The userspace VMM will then do any necessary coordination (e.g. for a
>>> shared to private conversion it may need to unpin pages from DMA), and
>>> then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
>>> the guest's requested conversion. This is where
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
>>>
>>> Patch 38 [1] updates
>>> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
>>> demonstrate the usage flow for x86.
>>>
>>> Fuad will be in a better position to explain the flow for pKVM.
>>>
>>>> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
>>>> planned to be used in the future. Currently it's always set at least in
>>>> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>>>>
>>>> If it's driven by the userspace side of the hypervisor, I wonder when
>>>> should the user app request some different value it already was, if the
>>>> kernel already has an answer in this case.  It made me even more confused,
>>>> as we have this in the API doc:
>>>>
>>>>           Note, there is no "get" API.  Userspace is responsible for
>>>>           explicitly tracking the state of a gfn/page as needed.
>>>>
>>>> And I do wonder whether we will still need some API just to query whether
>>>> the kernel allows the page to be mapped or not (aka, the "real" shared /
>>>> private status of a guest page).  I guess that's not directly relevant to
>>>> the faultability to be introduced here, but if you or anyone know please
>>>> kindly share, I'd love to learn about it.
>>>
>>> The userspace VMM will track the initial shared/private state, in the
>>> sense that when the VM is created, the mem_attr_array is initialized
>>> such that the guest pages are all shared.
>>>
>>> Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
>>> it should record all changes so it knows what the state is in the
>>> kernel.
>>>
>>> Even if userspace VMM doesn't record the state properly, if the
>>> KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
>>> (e.g. setting an already private page to private), it will just be a
>>> no-op in the kernel.
>>>
>>>>>
>>>>> Perhaps Fuad can elaborate more here.
>>>>>
>>>>> Reason (b): In this patch series (mostly focus on x86 first), we're
>>>>> using faultability to prevent any future faults before checking that
>>>>> there are no mappings.
>>>>>
>>>>> Having a different xarray from mem_attr_array allows us to disable
>>>>> faulting before committing to changing mem_attr_array. Please see
>>>>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>>>>>
>>>>> We're not completely sure about the effectiveness of using faultability
>>>>> to block off future faults here, in future revisions we may be using a
>>>>> different approach. The folio_lock() is probably important if we need to
>>>>> check mapcount. Please let me know if you have any ideas!
>>>>>
>>>>> The starting point of having a different xarray was pKVM's requirement
>>>>> of having separate xarrays, and we later realized that the xarray could
>>>>> be used for reason (b). For x86 we could perhaps eventually remove the
>>>>> second xarray? Not sure as of now.
>>>>
>>>> Just had a quick look at patch 27:
>>>>
>>>> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>>>>
>>>> I'm not yet sure what's protecting from faultability being modified against
>>>> a concurrent fault().
>>>>
>>>> I wonder whether one can use the folio lock to serialize that, so that one
>>>> needs to take the folio lock to modify/lookup the folio's faultability,
>>>> then it may naturally match with the fault() handler design, where
>>>> kvm_gmem_get_folio() needs to lock the page first.
>>>>
>>>> But then kvm_gmem_is_faultable() will need to also be called only after the
>>>> folio is locked to avoid races.
>>>
>>> My bad. In our rush to get this series out before LPC, the patch series
>>> was not organized very well. Patch 39 [2] adds the
>>> lock. filemap_invalidate_lock_shared() should make sure that faulting
>>> doesn't race with faultability updates.
>>>
>>>>>> The latter is per-slot, so one level higher, however I don't think it's a
>>>>>> common use case for mapping the same gmemfd in multiple slots anyway for
>>>>>> KVM (besides corner cases like live upgrade).  So perhaps this is not about
>>>>>> layering but something else?  For example, any use case where PRIVATE and
>>>>>> FAULTABLE can be reported with different values.
>>>>>>
>>>>>> Another higher level question is, is there any plan to support non-CoCo
>>>>>> context for 1G?
>>>>>
>>>>> I believe guest_memfd users are generally in favor of eventually using
>>>>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>>>>> in the case of CoCo) page support.
>>>>>
>>>>> However, core-mm's fault path does not support mapping at anything
>>>>> higher than the PMD level (other than hugetlb_fault(), which the
>>>>> community wants to move away from), so core-mm wouldn't be able to map
>>>>> 1G pages taken from HugeTLB.
>>>>
>>>> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
>>>> referring to some other challenges?
>>>>
>>>
>>> IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
>>> PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
>>> core-mm using vm_operations_struct.huge_fault().
>>
>>
>> Just to clarify a bit for Peter: as has been discussed previously, there are
>> rather big difference between CoCo and non-CoCo VMs.
>>
>> In CoCo VMs, the primary portion of all pages are private, and they are not
>> mapped into user space. Only a handful of pages are commonly shared and
>> mapped into user space.
>>
>> In non-CoCo VMs, all pages are shared and (for the time being) all pages are
>> mapped into user space from where KVM will consume them.
>>
>>
>> Installing pmd/pud mappings into user space (recall: shared memory only) is
>> currently not really a requirement for CoCo VMs, and therefore not the focus
>> of this work.
>>
>> Further, it's currently considered to be incompatible with getting in-place
>> private<->share conversion on *page* granularity right, as we will be
>> exposing huge/gigantic folios via individual small folios to core-MM.
>> Mapping a PMD/PUD into core-mm, that is composed of multiple folios is not
>> going to fly, unless using a PFNMAP, which has been briefly discussed as
>> well, bu disregarded so far (no page pinning support).
>>
>> So in the context of this work here, huge faults and PUD/PMD *user space
>> page tables* do not apply.
>>
>> For non-CoCo VMs there is no in-place conversion problem. One could use the
>> same CoCo implementation, but without user space pud/pmd mappings. KVM and
>> VFIO would have to consume this memory via the guest_memfd in memslots
>> instead of via the user space mappings to more easily get PMD/PUD mappings
>> into the secondary MMU. And the downsides would be sacrificing the vmemmap
> 
> Is there chance that when !CoCo will be supported, then external modules
> (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?

I expect this at least initially to be the case. At some point, we might 
see a transition to fd+offset for some interfaces.

I recall that there was a similar discussion when specifying "shared" 
memory in a KVM memory slot that will be backed by a guest_memfd: 
initially, this would be via VA and not via guest_memfd+offset. I recall 
Sean and James wants it to stay that way (sorry if I am wrong!), and 
James might require that to get the fancy uffd mechanism flying.

> 
> To support CoCo, I understand gmem+offset is required all over the places.
> However in a non-CoCo context, I wonder whether the other modules are
> required to stick with gmem+offset, or they can reuse the old VA ways,
> because how it works can fundamentally be the same as before, except that
> the folios now will be managed by gmemfd.
 > > I think the good thing with such approach is when developing CoCo 
support
> for all these modules, there's less constraints / concerns to be compatible
> with non-CoCo use case, also it'll make it even easier to be used in
> production before all CoCo facilities ready, as most infrastructures are
> already around and being used for years if VA can be mapped and GUPed like
> before.

Right, but even if most interfaces support guest_memfd+offset, things 
like DIRECT_IO to shared guest memory will require VA+GUP (someone 
brought that up at LPC).

-- 
Cheers,

David / dhildenb