[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7b7a7b0-9404-6b0f-99b5-346af041a479@oracle.com>
Date: Tue, 28 Jun 2022 09:54:19 -0400
From: Steven Sistare <steven.sistare@...cle.com>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: Alex Williamson <alex.williamson@...hat.com>,
lizhe.67@...edance.com, cohuck@...hat.com, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, lizefan.x@...edance.com
Subject: Re: [PATCH] vfio: remove useless judgement
On 6/28/2022 9:03 AM, Jason Gunthorpe wrote:
> On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote:
>> For cpr, old qemu directly exec's new qemu, so task does not change.
>>
>> To support fork+exec, the ownership test needs to be deleted or modified.
>>
>> Pinned page accounting is another issue, as the parent counts pins in its
>> mm->locked_vm. If the child unmaps, it cannot simply decrement its own
>> mm->locked_vm counter.
>
> It is fine already:
>
> mm = async ? get_task_mm(dma->task) : dma->task->mm;
> if (!mm)
> return -ESRCH; /* process exited */
>
> ret = mmap_write_lock_killable(mm);
> if (!ret) {
> ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
> dma->lock_cap);
>
> Each 'dma' already stores a pointer to the mm that sourced it and only
> manipulates the counter in that mm. AFAICT 'current' is not used
> during unmap.
Ah yes, no problem then.
Limits become looser, though, as the child can pin an additional RLIMIT_MEMLOCK
of pages. That is the natural consequence of mm->locked_vm being a per process limit,
but probably not what the application wants. Another argument for switching to
user->locked_vm.
>> As you and I have discussed, the count is also wrong in the direct
>> exec model, because exec clears mm->locked_vm.
>
> Really? Yikes, I thought exec would generate a new mm?
Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped
on the floor. The existing dma points to the same task, but task->mm has changed,
and dma->task->mm->locked_vm is 0. An unmap ioctl drives it negative.
I have prototyped a few possible fixes. One changes vfio to use user->locked_vm.
Another changes to mm->pinned_vm and preserves it during exec. A third preserves
mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes
vfio pins and mlocks. The mlock component must be cleared during exec, and we don't
have a separate count for it.
>> I am thinking vfio could count pins in struct user locked_vm to handle both
>> models. The user struct and its count would persist across direct exec,
>> and be shared by parent and child for fork+exec. However, that does change
>> the RLIMIT_MEMLOCK value that applications must set, because the limit must
>> accommodate vfio plus other sub-systems that count in user->locked_vm, which
>> includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all
>> processes of that user, not just a single process.
>
> We discussed this, for iommufd we are currently planning to go this
> way and will See How it Goes.
Yes, I have followed that thread with interest.
- Steve
Powered by blists - more mailing lists