linux-kernel - Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <70a2a878-a9d7-4257-b100-f125a9ee0e71@redhat.com>
Date: Fri, 13 Jun 2025 18:08:02 +0200
From: David Hildenbrand <david@...hat.com>
To: Alex Williamson <alex.williamson@...hat.com>
Cc: lizhe.67@...edance.com, kvm@...r.kernel.org,
 linux-kernel@...r.kernel.org, peterx@...hat.com
Subject: Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large
 folio

On 13.06.25 16:16, Alex Williamson wrote:
> On Fri, 13 Jun 2025 15:37:40 +0200
> David Hildenbrand <david@...hat.com> wrote:
> 
>> On 13.06.25 08:29, lizhe.67@...edance.com wrote:
>>> On Thu, 12 Jun 2025 16:32:39 -0600, alex.williamson@...hat.com wrote:
>>>    
>>>>>    drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
>>>>>    1 file changed, 41 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index 28ee4b8d39ae..2f6c0074d7b3 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>>>    	return true;
>>>>>    }
>>>>>    
>>>>> -static int put_pfn(unsigned long pfn, int prot)
>>>>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>>>>    {
>>>>> -	if (!is_invalid_reserved_pfn(pfn)) {
>>>>> -		struct page *page = pfn_to_page(pfn);
>>>>> +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>>>>> +}
>>>>>    
>>>>> -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>>>>> -		return 1;
>>>>> +/*
>>>>> + * The caller must ensure that these npages PFNs belong to the same folio.
>>>>> + */
>>>>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>>>>> +{
>>>>> +	if (!is_invalid_reserved_pfn(pfn)) {
>>>>> +		_put_pfns(pfn_to_page(pfn), npages, prot);
>>>>> +		return npages;
>>>>>    	}
>>>>>    	return 0;
>>>>>    }
>>>>>    
>>>>> +static inline int put_pfn(unsigned long pfn, int prot)
>>>>> +{
>>>>> +	return put_pfns(pfn, 1, prot);
>>>>> +}
>>>>> +
>>>>>    #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>>>>    
>>>>>    static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>>>>> @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>>>>    				    unsigned long pfn, unsigned long npage,
>>>>>    				    bool do_accounting)
>>>>>    {
>>>>> -	long unlocked = 0, locked = 0;
>>>>> -	long i;
>>>>> +	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>>>>    
>>>>> -	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
>>>>> -		if (put_pfn(pfn++, dma->prot)) {
>>>>> -			unlocked++;
>>>>> -			if (vfio_find_vpfn(dma, iova))
>>>>> -				locked++;
>>>>> +	while (npage) {
>>>>> +		struct folio *folio;
>>>>> +		struct page *page;
>>>>> +		long step = 1;
>>>>> +
>>>>> +		if (is_invalid_reserved_pfn(pfn))
>>>>> +			goto next;
>>>>> +
>>>>> +		page = pfn_to_page(pfn);
>>>>> +		folio = page_folio(page);
>>>>> +
>>>>> +		if (!folio_test_large(folio)) {
>>>>> +			_put_pfns(page, 1, dma->prot);
>>>>> +		} else {
>>>>> +			step = min_t(long, npage,
>>>>> +				folio_nr_pages(folio) -
>>>>> +				folio_page_idx(folio, page));
>>>>> +			_put_pfns(page, step, dma->prot);
>>>>>    		}
>>>>> +
>>>>> +		unlocked += step;
>>>>> +next:
>>>>
>>>> Usage of @step is inconsistent, goto isn't really necessary either, how
>>>> about:
>>>>
>>>> 	while (npage) {
>>>> 		unsigned long step = 1;
>>>>
>>>> 		if (!is_invalid_reserved_pfn(pfn)) {
>>>> 			struct page *page = pfn_to_page(pfn);
>>>> 			struct folio *folio = page_folio(page);
>>>> 			long nr_pages = folio_nr_pages(folio);
>>>>
>>>> 			if (nr_pages > 1)
>>>> 				step = min_t(long, npage,
>>>> 					nr_pages -
>>>> 					folio_page_idx(folio, page));
>>>>
>>>> 			_put_pfns(page, step, dma->prot);
>>>> 			unlocked += step;
>>>> 		}
>>>>   
>>>
>>> That's great. This implementation is much better.
>>>
>>> I'm a bit uncertain about the best type to use for the 'step'
>>> variable here. I've been trying to keep things consistent with the
>>> put_pfn() function, so I set the type of the second parameter in
>>> _put_pfns() to 'int'(we pass 'step' as the second argument to
>>> _put_pfns()).
>>>
>>> Using unsigned long for 'step' should definitely work here, as the
>>> number of pages in a large folio currently falls within the range
>>> that can be represented by an int. However, there is still a
>>> potential risk of truncation that we need to be mindful of.
>>>    
>>>>> +		pfn += step;
>>>>> +		iova += PAGE_SIZE * step;
>>>>> +		npage -= step;
>>>>>    	}
>>>>>    
>>>>>    	if (do_accounting)
>>>>
>>>> AIUI, the idea is that we know we have npage contiguous pfns and we
>>>> currently test invalid/reserved, call pfn_to_page(), call
>>>> unpin_user_pages_dirty_lock(), and test vpfn for each individually.
>>>>
>>>> This instead wants to batch the vpfn accounted pfns using the range
>>>> helper added for the mapping patch,
>>>
>>> Yes. We use vpfn_pages() just to track the locked pages.
>>>    
>>>> infer that continuous pfns have the
>>>> same invalid/reserved state, the pages are sequential, and that we can
>>>> use the end of the folio to mark any inflections in those assumptions
>>>> otherwise.  Do I have that correct?
>>>
>>> Yes. I think we're definitely on the same page here.
>>>    
>>>> I think this could be split into two patches, one simply batching the
>>>> vpfn accounting and the next introducing the folio dependency.  The
>>>> contributions of each to the overall performance improvement would be
>>>> interesting.
>>>
>>> I've made an initial attempt, and here are the two patches after
>>> splitting them up.
>>>
>>> 1. batch-vpfn-accounting-patch:
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 28ee4b8d39ae..c8ddcee5aa68 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>>    				    unsigned long pfn, unsigned long npage,
>>>    				    bool do_accounting)
>>>    {
>>> -	long unlocked = 0, locked = 0;
>>> +	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>>    	long i;
>>>    
>>> -	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
>>> -		if (put_pfn(pfn++, dma->prot)) {
>>> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
>>> +		if (put_pfn(pfn++, dma->prot))
>>>    			unlocked++;
>>> -			if (vfio_find_vpfn(dma, iova))
>>> -				locked++;
>>> -		}
>>> -	}
>>>    
>>>    	if (do_accounting)
>>>    		vfio_lock_acct(dma, locked - unlocked, true);
>>> -----------------
>>>
>>> 2. large-folio-optimization-patch:
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index c8ddcee5aa68..48c2ba4ba4eb 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>    	return true;
>>>    }
>>>    
>>> -static int put_pfn(unsigned long pfn, int prot)
>>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>>    {
>>> -	if (!is_invalid_reserved_pfn(pfn)) {
>>> -		struct page *page = pfn_to_page(pfn);
>>> +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>>> +}
>>>    
>>> -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>>> -		return 1;
>>> +/*
>>> + * The caller must ensure that these npages PFNs belong to the same folio.
>>> + */
>>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>>> +{
>>> +	if (!is_invalid_reserved_pfn(pfn)) {
>>> +		_put_pfns(pfn_to_page(pfn), npages, prot);
>>> +		return npages;
>>>    	}
>>>    	return 0;
>>>    }
>>>    
>>> +static inline int put_pfn(unsigned long pfn, int prot)
>>> +{
>>> +	return put_pfns(pfn, 1, prot);
>>> +}
>>> +
>>>    #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>>    
>>>    static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>>> @@ -806,11 +817,28 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>>    				    bool do_accounting)
>>>    {
>>>    	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>> -	long i;
>>>    
>>> -	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
>>> -		if (put_pfn(pfn++, dma->prot))
>>> -			unlocked++;
>>> +	while (npage) {
>>> +		long step = 1;
>>> +
>>> +		if (!is_invalid_reserved_pfn(pfn)) {
>>> +			struct page *page = pfn_to_page(pfn);
>>> +			struct folio *folio = page_folio(page);
>>> +			long nr_pages = folio_nr_pages(folio);
>>> +
>>> +			if (nr_pages > 1)
>>> +				step = min_t(long, npage,
>>> +					nr_pages -
>>> +					folio_page_idx(folio, page));
>>> +
>>> +			_put_pfns(page, step, dma->prot);
>>
>> I'm confused, why do we batch pages by looking at the folio, to then
>> pass the pages into unpin_user_page_range_dirty_lock?
>>
>> Why does the folio relationship matter at all here?
>>
>> Aren't we making the same mistake that we are jumping over pages we
>> shouldn't be jumping over, because we assume they belong to that folio?
> 
> That's my concern as well.  On the mapping side we had an array of
> pages from gup and we tested each page in the gup array relative to the
> folio pages.  I think that's because the gup array could have
> non-sequential pages, but aiui the folio should have sequential
> pages(?).  Here I think we're trying to assume that sequential pfns
> results in sequential pages and folios should have sequential pages, so
> the folio just gives us a point to look for changes in invalid/reserved.
> 
> Is that valid?  Thanks,

Oh, it is!

I thought we would be iterating pages, but if we are iterating PFNs it 
should be fine.

-- 
Cheers,

David / dhildenb