[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240807113514068-0700.eberman@hu-eberman-lv.qualcomm.com>
Date: Wed, 7 Aug 2024 12:06:33 -0700
From: Elliot Berman <quic_eberman@...cinc.com>
To: Patrick Roy <roypat@...zon.co.uk>
CC: Andrew Morton <akpm@...ux-foundation.org>,
Paolo Bonzini
<pbonzini@...hat.com>,
Sean Christopherson <seanjc@...gle.com>,
Fuad Tabba
<tabba@...gle.com>, David Hildenbrand <david@...hat.com>,
<qperret@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>,
<linux-coco@...ts.linux.dev>, <linux-arm-msm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
<kvm@...r.kernel.org>, James Gowans <jgowans@...zon.com>,
"Kalyazin, Nikita"
<kalyazin@...zon.co.uk>,
"Manwaring, Derek" <derekmn@...zon.com>,
"Cali,
Marco" <xmarcalx@...zon.co.uk>
Subject: Re: [PATCH RFC 3/4] mm: guest_memfd: Add option to remove guest
private memory from direct map
On Wed, Aug 07, 2024 at 11:57:35AM +0100, Patrick Roy wrote:
> On Wed, 2024-08-07 at 07:48 +0100, Patrick Roy wrote:
> > On Tue, 2024-08-06 at 21:13 +0100, Elliot Berman wrote:
> >> On Tue, Aug 06, 2024 at 04:39:24PM +0100, Patrick Roy wrote:
> >>> On the other hand, as Paolo pointed out in my patches [1], just using a
> >>> page flag to track direct map presence for gmem is not enough. We
> >>> actually need to keep a refcount in folio->private to keep track of how
> >>> many different actors request a folio's direct map presence (in the
> >>> specific case in my patch series, it was different pfn_to_gfn_caches for
> >>> the kvm-clock structures of different vcpus, which the guest can place
> >>> into the same gfn). While this might not be a concern for the the
> >>> pKVM/Gunyah case, where the guest dictates memory state, it's required
> >>> for the non-CoCo case where KVM/userspace can set arbitrary guest gfns
> >>> to shared if it needs/wants to access them for whatever reason. So for
> >>> this we'd need to have PG_private=1 mean "direct map entry restored" (as
> >>> if PG_private=0, there is no folio->private).
> >>>
> >>> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315
> >>>
> >>
> >> I wonder if we can use the folio refcount itself, assuming we can rely
> >> on refcount == 1 means we can do shared->private conversion.
> >>
> >> In gpc_map_gmem, we convert private->shared. There's no problem here in
> >> the non-CoCo case.
> >>
> >> In gpc_unmap, we *try* to convert back from shared->private. If
> >> refcount>2, then the conversion would fail. The last gpc_unmap would be
> >> able to successfully convert back to private.
> >>
> >> Do you see any concerns with this approach?
> >
> > The gfn_to_pfn_cache does not keep an elevated refcount on the cached
> > page, and instead responds to MMU notifiers to detect whether the cached
> > translation has been invalidated, iirc. So the folio refcount will
> > not reflect the number of gpcs holding that folio.
> >
Ah, fair enough. This is kinda like a GUP pin which would prevent us
from making page private, but without the pin part.
[...]
> >>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags)
> >>>> {
> >>>> + unsigned long gmem_flags = (unsigned long)file->private_data;
> >>>> struct inode *inode = file_inode(file);
> >>>> struct guest_memfd_operations *ops = inode->i_private;
> >>>> struct folio *folio;
> >>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> >>>> goto out_err;
> >>>> }
> >>>>
> >>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
> >>>> + r = guest_memfd_folio_private(folio);
> >>>> + if (r)
> >>>> + goto out_err;
> >>>> + }
> >>>> +
> >>>
> >>> How does a caller of guest_memfd_grab_folio know whether a folio needs
> >>> to be removed from the direct map? E.g. how can a caller know ahead of
> >>> time whether guest_memfd_grab_folio will return a freshly allocated
> >>> folio (which thus needs to be removed from the direct map), vs a folio
> >>> that already exists and has been removed from the direct map (probably
> >>> fine to remove from direct map again), vs a folio that already exists
> >>> and is currently re-inserted into the direct map for whatever reason
> >>> (must not remove these from the direct map, as other parts of
> >>> KVM/userspace probably don't expect the direct map entries to disappear
> >>> from underneath them). I couldn't figure this one out for my series,
> >>> which is why I went with hooking into the PG_uptodate logic to always
> >>> remove direct map entries on freshly allocated folios.
> >>>
> >>
> >> gmem_flags come from the owner. If the caller (in non-CoCo case) wants
>
> Ah, oops, I got it mixed up with the new `flags` parameter.
>
> >> to restore the direct map right away, it'd have to be a direct
> >> operation. As an optimization, we could add option that asks for page in
> >> "shared" state. If allocating new page, we can return it right away
> >> without removing from direct map. If grabbing existing folio, it would
> >> try to do the private->shared conversion.
>
> My concern is more with the implicit shared->private conversion that
> happens on every call to guest_memfd_grab_folio (and thus
> kvm_gmem_get_pfn) when grabbing existing folios. If something else
> marked the folio as shared, then we cannot punch it out of the direct
> map again until that something is done using the folio (when working on
> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were
> temporarily marked shared, as I was seeing panics because of this). And
> if the folio is currently private, there's nothing to do. So either way,
> guest_memfd_grab_folio shouldn't touch the direct map entry for existing
> folios.
>
What I did could be documented/commented better.
If ops->accessible() is *not* provided, all guest_memfd allocations will
immediately remove from direct map and treat them immediately like guest
private (goal is to match what KVM does today on tip).
If ops->accessible() is provided, then guest_memfd allocations start
as "shared" and KVM/Gunyah need to do the shared->private conversion
when they want to do the private conversion on the folio. "Shared" is
the default because that is effectively a no-op.
For the non-CoCo case you're interested in, we'd have the
ops->accessible() provided and we wouldn't pull out the direct map from
gpc.
Thanks,
Elliot
Powered by blists - more mailing lists