linux-kernel - Re: [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240816164546141-0700.eberman@hu-eberman-lv.qualcomm.com>
Date: Fri, 16 Aug 2024 16:52:46 -0700
From: Elliot Berman <quic_eberman@...cinc.com>
To: David Hildenbrand <david@...hat.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, Fuad Tabba <tabba@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Paolo Bonzini
	<pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Patrick Roy
	<roypat@...zon.co.uk>, <qperret@...gle.com>,
        <linux-coco@...ts.linux.dev>, <linux-arm-msm@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
        <kvm@...r.kernel.org>
Subject: Re: [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages

On Sat, Aug 17, 2024 at 12:03:50AM +0200, David Hildenbrand wrote:
> On 16.08.24 23:52, Ackerley Tng wrote:
> > David Hildenbrand <david@...hat.com> writes:
> > 
> > > On 16.08.24 19:45, Ackerley Tng wrote:
> > > > 
> > > > <snip>
> > > > 
> > > > IIUC folio_lock() isn't a prerequisite for taking a refcount on the
> > > > folio.
> > > 
> > > Right, to do folio_lock() you only have to guarantee that the folio
> > > cannot get freed concurrently. So you piggyback on another reference
> > > (you hold indirectly).
> > > 
> > > > 
> > > > Even if we are able to figure out a "safe" refcount, and check that the
> > > > current refcount == "safe" refcount before removing from direct map,
> > > > what's stopping some other part of the kernel from taking a refcount
> > > > just after the check happens and causing trouble with the folio's
> > > > removal from direct map?
> > > 
> > > Once the page was unmapped from user space, and there were no additional
> > > references (e.g., GUP, whatever), any new references can only be
> > > (should, unless BUG :) ) temporary speculative references that should
> > > not try accessing page content, and that should back off if the folio is
> > > not deemed interesting or cannot be locked. (e.g., page
> > > migration/compaction/offlining).
> > 
> > I thought about it again - I think the vmsplice() cases are taken care
> > of once we check that the folios are not mapped into userspace, since
> > vmsplice() reads from a mapping.
> > 
> > splice() reads from the fd directly, but that's taken care since
> > guest_memfd doesn't have a .splice_read() handler.
> > 
> > Reading /proc/pid/mem also requires the pages to first be mapped, IIUC,
> > otherwise the pages won't show up, so checking that there are no more
> > mappings to userspace takes care of this.
> 
> You have a misconception.
> 
> You can map pages to user space, GUP them, and then unmap them from user
> space. A GUP reference can outlive your user space mappings, easily.
> 
> So once there is a raised refcount, it could as well just be from vmsplice,
> or a pending reference from /proc/pid/mem, O_DIRECT, ...
> 
> > 
> > > 
> > > Of course, there are some corner cases (kgdb, hibernation, /proc/kcore),
> > > but most of these can be dealt with in one way or the other (make these
> > > back off and not read/write page content, similar to how we handled it
> > > for secretmem).
> > 
> > Does that really leave us with these corner cases? And so perhaps we
> > could get away with just taking the folio_lock() to keep away the
> > speculative references? So something like
> > 
> >    1. Check that the folio is not mapped and not pinned.
> 
> To do that, you have to lookup the folio first. That currently requires a
> refcount increment, even if only temporarily. Maybe we could avoid that, if
> we can guarantee that we are the only one modifying the pageache here, and
> we sync against that ourselves.
> 
> >    2. folio_lock() all the folios about to be removed from direct map
> >    -- With the lock, all other accesses should be speculative --
> >    3. Check that the refcount == "safe" refcount
> >        3a. Unlock and return to userspace with -EAGAIN
> >    4. Remove from direct map
> >    5. folio_unlock() all those folios
> > 
> > Perhaps a very naive question: can the "safe" refcount be statically
> > determined by walking through the code and counting where refcount is
> > expected to be incremented?
> 
> 
> Depends on how we design it. But if you hand out "safe" references to KVM
> etc, you'd have to track that -- and how often -- somehow. At which point we
> are at "increment/decrement" safe reference to track that for you.
>

Just a status update: I've gotten the "safe" reference counter
implementation working for Gunyah now. It feels a bit flimsy because
we're juggling 3 reference counters*, but it seems like the right thing
to do after all the discussions here. It's passing all the Gunyah unit
tests I have which have so far been pretty good at finding issues.

I need to clean up the patches now and I'm aiming to have it out for RFC
next week.

* folio refcount, "accessible" refcount, and "safe" refcount

Thanks,
Elliot