linux-kernel - Re: [PATCH] vfio/type1: Unpin zero pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220907125627.0579e592.alex.williamson@redhat.com>
Date:   Wed, 7 Sep 2022 12:56:27 -0600
From:   Alex Williamson <alex.williamson@...hat.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
Cc:     David Hildenbrand <david@...hat.com>,
        "Tian, Kevin" <kevin.tian@...el.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "lpivarc@...hat.com" <lpivarc@...hat.com>,
        "Liu, Jingqi" <jingqi.liu@...el.com>,
        "Lu, Baolu" <baolu.lu@...el.com>
Subject: Re: [PATCH] vfio/type1: Unpin zero pages

On Wed, 7 Sep 2022 13:40:52 -0300
Jason Gunthorpe <jgg@...pe.ca> wrote:

> On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote:
> 
> > > So, if we go back far enough in the git history we will find a case
> > > where PUP is returning something for the zero page, and that something
> > > caused is_invalid_reserved_pfn() == false since VFIO did work at some
> > > point.  
> > 
> > Can we assume that?  It takes a while for a refcount leak on the zero
> > page to cause an overflow.  My assumption is that it's never worked, we
> > pin zero pages, don't account them against the locked memory limits
> > because our is_invalid_reserved_pfn() test returns true, and therefore
> > we don't unpin them.  
> 
> Oh, you think it has been buggy forever? That is not great..
>  
> > > IHMO we should simply go back to the historical behavior - make
> > > is_invalid_reserved_pfn() check for the zero_pfn and return
> > > false. Meaning that PUP returned it.  
> > 
> > We've never explicitly tested for zero_pfn and as David notes,
> > accounting the zero page against the user's locked memory limits has
> > user visible consequences.  VMs that worked with a specific locked
> > memory limit may no longer work.    
> 
> Yes, but the question is if that is a strict ABI we have to preserve,
> because if you take that view it also means because VFIO has this
> historical bug that David can't fix the FOLL_FORCE issue either.
> 
> If the view holds for memlock then it should hold for cgroups
> also. This means the kernel can never change anything about
> GFP_KERNEL_ACCOUNT allocations because it might impact userspace
> having set a tight limit there.
> 
> It means we can't fix the bug that VFIO is using the wrong storage for
> memlock.
> 
> It means qemu can't change anything about how it sets up this memory,
> ie Kevin's idea to change the ordering.
> 
> On the other hand the "abi break" we are talking about is that a user
> might have to increase a configured limit in a config file after a
> kernel upgrade.
> 
> IDK what consensus exists here, I've never heard of anyone saying
> these limits are a strict ABI like this.. I think at least for cgroup
> that would be so invasive as to immediately be off the table.

I thought we'd already agreed that we were stuck with locked_vm for
type1 and any compatibility mode of type1 due to this.  Native iommufd
support can do the right thing since userspace will need to account for
various new usage models anyway.

I've raised the issue with David for the zero page accounting, but I
don't know what the solution is.  libvirt automatically adds a 1GB
fudge factor to the VM locked memory limits to account for things like
ROM mappings, or at least the non-zeropage backed portion of those
ROMs.  I think that most management tools have adopted similar, so the
majority of users shouldn't notice.  However this won't cover all
users, so we certainly risk breaking userspace if we introduce hard
page accounting of zero pages.

I think David suggested possibly allowing some degree of exceeding
locked memory limits for zero page COWs.  Potentially type1 could do
this as well; special case handling with some heuristically determined,
module parameter defined limit.  We might also consider whether we
could just ignore zero page mappings, maybe with a optional "strict"
mode module option to generate an errno on such mappings.  Thanks,

Alex