linux-kernel - Re: [PATCH] vfio/type1: Unpin zero pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7ed97088-b4f0-e36f-c935-87cd1e94c574@redhat.com>
Date:   Wed, 7 Sep 2022 21:14:54 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Alex Williamson <alex.williamson@...hat.com>,
        Jason Gunthorpe <jgg@...pe.ca>
Cc:     "Tian, Kevin" <kevin.tian@...el.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "lpivarc@...hat.com" <lpivarc@...hat.com>,
        "Liu, Jingqi" <jingqi.liu@...el.com>,
        "Lu, Baolu" <baolu.lu@...el.com>
Subject: Re: [PATCH] vfio/type1: Unpin zero pages

On 07.09.22 20:56, Alex Williamson wrote:
> On Wed, 7 Sep 2022 13:40:52 -0300
> Jason Gunthorpe <jgg@...pe.ca> wrote:
> 
>> On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote:
>>
>>>> So, if we go back far enough in the git history we will find a case
>>>> where PUP is returning something for the zero page, and that something
>>>> caused is_invalid_reserved_pfn() == false since VFIO did work at some
>>>> point.
>>>
>>> Can we assume that?  It takes a while for a refcount leak on the zero
>>> page to cause an overflow.  My assumption is that it's never worked, we
>>> pin zero pages, don't account them against the locked memory limits
>>> because our is_invalid_reserved_pfn() test returns true, and therefore
>>> we don't unpin them.
>>
>> Oh, you think it has been buggy forever? That is not great..
>>   
>>>> IHMO we should simply go back to the historical behavior - make
>>>> is_invalid_reserved_pfn() check for the zero_pfn and return
>>>> false. Meaning that PUP returned it.
>>>
>>> We've never explicitly tested for zero_pfn and as David notes,
>>> accounting the zero page against the user's locked memory limits has
>>> user visible consequences.  VMs that worked with a specific locked
>>> memory limit may no longer work.
>>
>> Yes, but the question is if that is a strict ABI we have to preserve,
>> because if you take that view it also means because VFIO has this
>> historical bug that David can't fix the FOLL_FORCE issue either.
>>
>> If the view holds for memlock then it should hold for cgroups
>> also. This means the kernel can never change anything about
>> GFP_KERNEL_ACCOUNT allocations because it might impact userspace
>> having set a tight limit there.
>>
>> It means we can't fix the bug that VFIO is using the wrong storage for
>> memlock.
>>
>> It means qemu can't change anything about how it sets up this memory,
>> ie Kevin's idea to change the ordering.
>>
>> On the other hand the "abi break" we are talking about is that a user
>> might have to increase a configured limit in a config file after a
>> kernel upgrade.
>>
>> IDK what consensus exists here, I've never heard of anyone saying
>> these limits are a strict ABI like this.. I think at least for cgroup
>> that would be so invasive as to immediately be off the table.
> 
> I thought we'd already agreed that we were stuck with locked_vm for
> type1 and any compatibility mode of type1 due to this.  Native iommufd
> support can do the right thing since userspace will need to account for
> various new usage models anyway.
> 
> I've raised the issue with David for the zero page accounting, but I
> don't know what the solution is.  libvirt automatically adds a 1GB
> fudge factor to the VM locked memory limits to account for things like
> ROM mappings, or at least the non-zeropage backed portion of those
> ROMs.  I think that most management tools have adopted similar, so the
> majority of users shouldn't notice.  However this won't cover all
> users, so we certainly risk breaking userspace if we introduce hard
> page accounting of zero pages.
> 
> I think David suggested possibly allowing some degree of exceeding
> locked memory limits for zero page COWs.  Potentially type1 could do
> this as well; special case handling with some heuristically determined,
> module parameter defined limit.  We might also consider whether we
> could just ignore zero page mappings, maybe with a optional "strict"
> mode module option to generate an errno on such mappings.  Thanks,

So far I played with the ideas

a) allow slightly exceeding the limit and warn

b) weird vfio kernel parameter to control the zeropage behavior (old vs.
    new)

I certainly have in mind that we might need some toggle for vfio.

-- 
Thanks,

David / dhildenb