[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7ed97088-b4f0-e36f-c935-87cd1e94c574@redhat.com>
Date: Wed, 7 Sep 2022 21:14:54 +0200
From: David Hildenbrand <david@...hat.com>
To: Alex Williamson <alex.williamson@...hat.com>,
Jason Gunthorpe <jgg@...pe.ca>
Cc: "Tian, Kevin" <kevin.tian@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"lpivarc@...hat.com" <lpivarc@...hat.com>,
"Liu, Jingqi" <jingqi.liu@...el.com>,
"Lu, Baolu" <baolu.lu@...el.com>
Subject: Re: [PATCH] vfio/type1: Unpin zero pages
On 07.09.22 20:56, Alex Williamson wrote:
> On Wed, 7 Sep 2022 13:40:52 -0300
> Jason Gunthorpe <jgg@...pe.ca> wrote:
>
>> On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote:
>>
>>>> So, if we go back far enough in the git history we will find a case
>>>> where PUP is returning something for the zero page, and that something
>>>> caused is_invalid_reserved_pfn() == false since VFIO did work at some
>>>> point.
>>>
>>> Can we assume that? It takes a while for a refcount leak on the zero
>>> page to cause an overflow. My assumption is that it's never worked, we
>>> pin zero pages, don't account them against the locked memory limits
>>> because our is_invalid_reserved_pfn() test returns true, and therefore
>>> we don't unpin them.
>>
>> Oh, you think it has been buggy forever? That is not great..
>>
>>>> IHMO we should simply go back to the historical behavior - make
>>>> is_invalid_reserved_pfn() check for the zero_pfn and return
>>>> false. Meaning that PUP returned it.
>>>
>>> We've never explicitly tested for zero_pfn and as David notes,
>>> accounting the zero page against the user's locked memory limits has
>>> user visible consequences. VMs that worked with a specific locked
>>> memory limit may no longer work.
>>
>> Yes, but the question is if that is a strict ABI we have to preserve,
>> because if you take that view it also means because VFIO has this
>> historical bug that David can't fix the FOLL_FORCE issue either.
>>
>> If the view holds for memlock then it should hold for cgroups
>> also. This means the kernel can never change anything about
>> GFP_KERNEL_ACCOUNT allocations because it might impact userspace
>> having set a tight limit there.
>>
>> It means we can't fix the bug that VFIO is using the wrong storage for
>> memlock.
>>
>> It means qemu can't change anything about how it sets up this memory,
>> ie Kevin's idea to change the ordering.
>>
>> On the other hand the "abi break" we are talking about is that a user
>> might have to increase a configured limit in a config file after a
>> kernel upgrade.
>>
>> IDK what consensus exists here, I've never heard of anyone saying
>> these limits are a strict ABI like this.. I think at least for cgroup
>> that would be so invasive as to immediately be off the table.
>
> I thought we'd already agreed that we were stuck with locked_vm for
> type1 and any compatibility mode of type1 due to this. Native iommufd
> support can do the right thing since userspace will need to account for
> various new usage models anyway.
>
> I've raised the issue with David for the zero page accounting, but I
> don't know what the solution is. libvirt automatically adds a 1GB
> fudge factor to the VM locked memory limits to account for things like
> ROM mappings, or at least the non-zeropage backed portion of those
> ROMs. I think that most management tools have adopted similar, so the
> majority of users shouldn't notice. However this won't cover all
> users, so we certainly risk breaking userspace if we introduce hard
> page accounting of zero pages.
>
> I think David suggested possibly allowing some degree of exceeding
> locked memory limits for zero page COWs. Potentially type1 could do
> this as well; special case handling with some heuristically determined,
> module parameter defined limit. We might also consider whether we
> could just ignore zero page mappings, maybe with a optional "strict"
> mode module option to generate an errno on such mappings. Thanks,
So far I played with the ideas
a) allow slightly exceeding the limit and warn
b) weird vfio kernel parameter to control the zeropage behavior (old vs.
new)
I certainly have in mind that we might need some toggle for vfio.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists