linux-kernel - Re: [PATCH] vfio/type1: Unpin zero pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220907095552.336c8f34.alex.williamson@redhat.com>
Date:   Wed, 7 Sep 2022 09:55:52 -0600
From:   Alex Williamson <alex.williamson@...hat.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
Cc:     David Hildenbrand <david@...hat.com>,
        "Tian, Kevin" <kevin.tian@...el.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "lpivarc@...hat.com" <lpivarc@...hat.com>,
        "Liu, Jingqi" <jingqi.liu@...el.com>,
        "Lu, Baolu" <baolu.lu@...el.com>
Subject: Re: [PATCH] vfio/type1: Unpin zero pages

On Wed, 7 Sep 2022 09:48:59 -0300
Jason Gunthorpe <jgg@...pe.ca> wrote:

> On Wed, Sep 07, 2022 at 11:00:21AM +0200, David Hildenbrand wrote:
> > > > I do wonder if that's a real issue, though. One approach would be to
> > > > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit
> > > > for a while. Of course, that only works if we assume that such pinned
> > > > zeropages are only extremely rarely longterm-pinned for a single VM
> > > > instance by VFIO.  
> > > 
> > > I'm confused, doesn't vfio increment the memlock for every page of VA
> > > it pins? Why would it matter if the page was COW'd or not? It is
> > > already accounted for today as though it was a unique page.
> > > 
> > > IOW if we add FOLL_FORCE it won't change the value of the memlock.  
> > 
> > I only briefly skimmed over the code Alex might be able to provide more
> > details and correct me if I'm wrong:
> > 
> > vfio_pin_pages_remote() contains a comment:
> > 
> > "Reserved pages aren't counted against the user, externally pinned pages are
> > already counted against the user."
> > 
> > is_invalid_reserved_pfn() should return "true" for the shared zeropage and
> > prevent us from accounting it via vfio_lock_acct(). Otherwise,
> > vfio_find_vpfn() seems to be in place to avoid double-accounting pages.  
> 
> is_invalid_reserved_pfn() is supposed to return 'true' for PFNs that
> cannot be returned from pin_user_pages():
> 
> /*
>  * Some mappings aren't backed by a struct page, for example an mmap'd
>  * MMIO range for our own or another device.  These use a different
>  * pfn conversion and shouldn't be tracked as locked pages.
>  * For compound pages, any driver that sets the reserved bit in head
>  * page needs to set the reserved bit in all subpages to be safe.
>  */
> static bool is_invalid_reserved_pfn(unsigned long pfn)
> 
> What it is talking about by 'different pfn conversion' is the
> follow_fault_pfn() path, not the PUP path.
> 
> So, it is some way for VFIO to keep track of when a pfn was returned
> by PUP vs follow_fault_pfn(), because it treats those two paths quite
> differently.
> 
> I lost track of what the original cause of this bug is - however AFAIK
> pin_user_pages() used to succeed when the zero page is mapped.

It does currently, modulo getting broken occasionally.

> No other PUP user call this follow_fault_pfn() hacky path, and we
> expect things like O_DIRECT to work properly even when reading from VA
> that has the zero page mapped.

zero page shouldn't take that path, we get the pages via PUP.

> So, if we go back far enough in the git history we will find a case
> where PUP is returning something for the zero page, and that something
> caused is_invalid_reserved_pfn() == false since VFIO did work at some
> point.

Can we assume that?  It takes a while for a refcount leak on the zero
page to cause an overflow.  My assumption is that it's never worked, we
pin zero pages, don't account them against the locked memory limits
because our is_invalid_reserved_pfn() test returns true, and therefore
we don't unpin them.

> IHMO we should simply go back to the historical behavior - make
> is_invalid_reserved_pfn() check for the zero_pfn and return
> false. Meaning that PUP returned it.

We've never explicitly tested for zero_pfn and as David notes,
accounting the zero page against the user's locked memory limits has
user visible consequences.  VMs that worked with a specific locked
memory limit may no longer work.  Logically, this seems to be the one
case of duplicate accounting that we get right relative to the user's
locked memory limit and the current implementation of pinning the zero
page.  We're not locking any resources that aren't effectively already
locked.  Thanks,

Alex