linux-kernel - Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1d243dde-2ddf-4875-890d-e6bb47931e40@redhat.com>
Date: Wed, 16 Oct 2024 10:45:43 +0200
From: David Hildenbrand <david@...hat.com>
To: Ackerley Tng <ackerleytng@...gle.com>, Peter Xu <peterx@...hat.com>
Cc: tabba@...gle.com, quic_eberman@...cinc.com, roypat@...zon.co.uk,
 jgg@...dia.com, rientjes@...gle.com, fvdl@...gle.com, jthoughton@...gle.com,
 seanjc@...gle.com, pbonzini@...hat.com, zhiquan1.li@...el.com,
 fan.du@...el.com, jun.miao@...el.com, isaku.yamahata@...el.com,
 muchun.song@...ux.dev, erdemaktas@...gle.com, vannapurve@...gle.com,
 qperret@...gle.com, jhubbard@...dia.com, willy@...radead.org,
 shuah@...nel.org, brauner@...nel.org, bfoster@...hat.com,
 kent.overstreet@...ux.dev, pvorel@...e.cz, rppt@...nel.org,
 richard.weiyang@...il.com, anup@...infault.org, haibo1.xu@...el.com,
 ajones@...tanamicro.com, vkuznets@...hat.com,
 maciej.wieczor-retman@...el.com, pgonda@...gle.com, oliver.upton@...ux.dev,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, kvm@...r.kernel.org,
 linux-kselftest@...r.kernel.org
Subject: Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a
 struct kvm_gmem_private

On 16.10.24 01:42, Ackerley Tng wrote:
> Peter Xu <peterx@...hat.com> writes:
> 
>> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>>> Peter Xu <peterx@...hat.com> writes:
>>>
>>>> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>>>>> The faultability xarray is stored on the inode since faultability is a
>>>>> property of the guest_memfd's memory contents.
>>>>>
>>>>> In this RFC, presence of an entry in the xarray indicates faultable,
>>>>> but this could be flipped so that presence indicates unfaultable. For
>>>>> flexibility, a special value "FAULT" is used instead of a simple
>>>>> boolean.
>>>>>
>>>>> However, at some stages of a VM's lifecycle there could be more
>>>>> private pages, and at other stages there could be more shared pages.
>>>>>
>>>>> This is likely to be replaced by a better data structure in a future
>>>>> revision to better support ranges.
>>>>>
>>>>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>>>>> pointer. inode->i_mapping->i_private_data.
>>>>
>>>> Could you help explain the difference between faultability v.s. the
>>>> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>>>> confused, otherwise might be good to enrich the commit message.
>>>
>>> Thank you for this question, I'll add this to the commit message to the
>>> next revision if Fuad's patch set [1] doesn't make it first.
>>>
>>> Reason (a): To elaborate on the explanation in [1],
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>>> private or shared, and faultability is whether the page is allowed to be
>>> faulted in by userspace.
>>>
>>> These two are similar but may not be the same thing. In pKVM, pKVM
>>> cannot trust userspace's configuration of private/shared, and other
>>> information will go into determining the private/shared setting in
>>> faultability.
>>
>> It makes sense to me that the kernel has the right to decide which page is
>> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
>> case is most / all pages are private, until some requests to share them for
>> special purposes (like DMA).  But that'll need to be initiated as a request
>> from the guest not the userspace hypervisor.
> 
> For TDX, the plan is that the guest will request the page to be remapped
> as shared or private, and the handler for that request will exit to
> the userspace VMM.
> 
> The userspace VMM will then do any necessary coordination (e.g. for a
> shared to private conversion it may need to unpin pages from DMA), and
> then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
> the guest's requested conversion. This is where
> KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
> 
> Patch 38 [1] updates
> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
> demonstrate the usage flow for x86.
> 
> Fuad will be in a better position to explain the flow for pKVM.
> 
>> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
>> planned to be used in the future. Currently it's always set at least in
>> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>>
>> If it's driven by the userspace side of the hypervisor, I wonder when
>> should the user app request some different value it already was, if the
>> kernel already has an answer in this case.  It made me even more confused,
>> as we have this in the API doc:
>>
>>          Note, there is no "get" API.  Userspace is responsible for
>>          explicitly tracking the state of a gfn/page as needed.
>>
>> And I do wonder whether we will still need some API just to query whether
>> the kernel allows the page to be mapped or not (aka, the "real" shared /
>> private status of a guest page).  I guess that's not directly relevant to
>> the faultability to be introduced here, but if you or anyone know please
>> kindly share, I'd love to learn about it.
> 
> The userspace VMM will track the initial shared/private state, in the
> sense that when the VM is created, the mem_attr_array is initialized
> such that the guest pages are all shared.
> 
> Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
> it should record all changes so it knows what the state is in the
> kernel.
> 
> Even if userspace VMM doesn't record the state properly, if the
> KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
> (e.g. setting an already private page to private), it will just be a
> no-op in the kernel.
> 
>>>
>>> Perhaps Fuad can elaborate more here.
>>>
>>> Reason (b): In this patch series (mostly focus on x86 first), we're
>>> using faultability to prevent any future faults before checking that
>>> there are no mappings.
>>>
>>> Having a different xarray from mem_attr_array allows us to disable
>>> faulting before committing to changing mem_attr_array. Please see
>>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>>>
>>> We're not completely sure about the effectiveness of using faultability
>>> to block off future faults here, in future revisions we may be using a
>>> different approach. The folio_lock() is probably important if we need to
>>> check mapcount. Please let me know if you have any ideas!
>>>
>>> The starting point of having a different xarray was pKVM's requirement
>>> of having separate xarrays, and we later realized that the xarray could
>>> be used for reason (b). For x86 we could perhaps eventually remove the
>>> second xarray? Not sure as of now.
>>
>> Just had a quick look at patch 27:
>>
>> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>>
>> I'm not yet sure what's protecting from faultability being modified against
>> a concurrent fault().
>>
>> I wonder whether one can use the folio lock to serialize that, so that one
>> needs to take the folio lock to modify/lookup the folio's faultability,
>> then it may naturally match with the fault() handler design, where
>> kvm_gmem_get_folio() needs to lock the page first.
>>
>> But then kvm_gmem_is_faultable() will need to also be called only after the
>> folio is locked to avoid races.
> 
> My bad. In our rush to get this series out before LPC, the patch series
> was not organized very well. Patch 39 [2] adds the
> lock. filemap_invalidate_lock_shared() should make sure that faulting
> doesn't race with faultability updates.
> 
>>>> The latter is per-slot, so one level higher, however I don't think it's a
>>>> common use case for mapping the same gmemfd in multiple slots anyway for
>>>> KVM (besides corner cases like live upgrade).  So perhaps this is not about
>>>> layering but something else?  For example, any use case where PRIVATE and
>>>> FAULTABLE can be reported with different values.
>>>>
>>>> Another higher level question is, is there any plan to support non-CoCo
>>>> context for 1G?
>>>
>>> I believe guest_memfd users are generally in favor of eventually using
>>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>>> in the case of CoCo) page support.
>>>
>>> However, core-mm's fault path does not support mapping at anything
>>> higher than the PMD level (other than hugetlb_fault(), which the
>>> community wants to move away from), so core-mm wouldn't be able to map
>>> 1G pages taken from HugeTLB.
>>
>> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
>> referring to some other challenges?
>>
> 
> IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
> PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
> core-mm using vm_operations_struct.huge_fault().


Just to clarify a bit for Peter: as has been discussed previously, there 
are rather big difference between CoCo and non-CoCo VMs.

In CoCo VMs, the primary portion of all pages are private, and they are 
not mapped into user space. Only a handful of pages are commonly shared 
and mapped into user space.

In non-CoCo VMs, all pages are shared and (for the time being) all pages 
are mapped into user space from where KVM will consume them.


Installing pmd/pud mappings into user space (recall: shared memory only) 
is currently not really a requirement for CoCo VMs, and therefore not 
the focus of this work.

Further, it's currently considered to be incompatible with getting 
in-place private<->share conversion on *page* granularity right, as we 
will be exposing huge/gigantic folios via individual small folios to 
core-MM. Mapping a PMD/PUD into core-mm, that is composed of multiple 
folios is not going to fly, unless using a PFNMAP, which has been 
briefly discussed as well, bu disregarded so far (no page pinning support).

So in the context of this work here, huge faults and PUD/PMD *user space 
page tables* do not apply.

For non-CoCo VMs there is no in-place conversion problem. One could use 
the same CoCo implementation, but without user space pud/pmd mappings. 
KVM and VFIO would have to consume this memory via the guest_memfd in 
memslots instead of via the user space mappings to more easily get 
PMD/PUD mappings into the secondary MMU. And the downsides would be 
sacrificing the vmemmap optimization and PMD/PUD user space mappings, 
while at the same time benefiting from being able to easily map only 
parts of a huge/gigantic page into user space.


So I consider pmd/pud user space mappings for non-CoCo an independent 
work item, not something that is part of the current effort of 
huge/gigantic pages with in-place conversion at page granularity for 
CoCo VMs.


More information is available in the bi-weekly upstream MM meeting (that 
was recorded) and the LPC talks, where most of that has been discussed.

-- 
Cheers,

David / dhildenb