lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a7c5bfc0-1648-4ae1-ba08-e706596e014b@redhat.com>
Date: Wed, 7 Aug 2024 18:12:00 +0200
From: David Hildenbrand <david@...hat.com>
To: Elliot Berman <quic_eberman@...cinc.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Paolo Bonzini <pbonzini@...hat.com>, Sean Christopherson
 <seanjc@...gle.com>, Fuad Tabba <tabba@...gle.com>,
 Patrick Roy <roypat@...zon.co.uk>, qperret@...gle.com,
 Ackerley Tng <ackerleytng@...gle.com>, linux-coco@...ts.linux.dev,
 linux-arm-msm@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, kvm@...r.kernel.org
Subject: Re: [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages

On 06.08.24 19:14, Elliot Berman wrote:
> On Tue, Aug 06, 2024 at 03:51:22PM +0200, David Hildenbrand wrote:
>>> -	if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
>>> +	if (!ops->accessible && (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)) {
>>>    		r = guest_memfd_folio_private(folio);
>>>    		if (r)
>>>    			goto out_err;
>>> @@ -107,6 +109,82 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
>>>    }
>>>    EXPORT_SYMBOL_GPL(guest_memfd_grab_folio);
>>> +int guest_memfd_make_inaccessible(struct file *file, struct folio *folio)
>>> +{
>>> +	unsigned long gmem_flags = (unsigned long)file->private_data;
>>> +	unsigned long i;
>>> +	int r;
>>> +
>>> +	unmap_mapping_folio(folio);
>>> +
>>> +	/**
>>> +	 * We can't use the refcount. It might be elevated due to
>>> +	 * guest/vcpu trying to access same folio as another vcpu
>>> +	 * or because userspace is trying to access folio for same reason
>>
>> As discussed, that's insufficient. We really have to drive the refcount to 1
>> -- the single reference we expect.
>>
>> What is the exact problem you are running into here? Who can just grab a
>> reference and maybe do nasty things with it?
>>
> 
> Right, I remember we had discussed it. The problem I faced was if 2
> vcpus fault on same page, they would race to look up the folio in
> filemap, increment refcount, then try to lock the folio. One of the
> vcpus wins the lock, while the other waits. The vcpu that gets the
> lock vcpu will see the elevated refcount.
> 
> I was in middle of writing an explanation why I think this is best
> approach and realized I think it should be possible to do
> shared->private conversion and actually have single reference. There
> would be some cost to walk through the allocated folios and convert them
> to private before any vcpu runs. The approach I had gone with was to
> do conversions as late as possible.

We certainly have to support conversion while the VCPUs are running.

The VCPUs might be able to avoid grabbing a folio reference for the 
conversion and only do the folio_lock(): as long as we have a guarantee 
that we will disallow freeing the folio in gmem, for example, by syncing 
against FALLOC_FL_PUNCH_HOLE.

So if we can rely on the "gmem" reference to the folio that cannot go 
away while we do what we do, we should be fine.

<random though>

Meanwhile, I was thinking if we would want to track the references we
hand out to "safe" users differently.

Safe references would only be references that would survive a 
private<->shared conversion, like KVM MMU mappings maybe?

KVM would then have to be thought to return these gmem references 
differently.

The idea would be to track these "safe" references differently 
(page->private?) and only allow dropping *our* guest_memfd reference if 
all these "safe" references are gone. That is, FALLOC_FL_PUNCH_HOLE 
would also fail if there are any "safe" reference remaining.

<\random though>

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ