netdev - Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <028506a9-d5f8-04e9-7427-5a39e0772987@oracle.com>
Date:   Wed, 20 Jul 2022 19:16:35 +0100
From:   Joao Martins <joao.m.martins@...cle.com>
To:     Alex Williamson <alex.williamson@...hat.com>
Cc:     Yishai Hadas <yishaih@...dia.com>, jgg@...dia.com,
        saeedm@...dia.com, kvm@...r.kernel.org, netdev@...r.kernel.org,
        kuba@...nel.org, kevin.tian@...el.com, leonro@...dia.com,
        maorg@...dia.com, cohuck@...hat.com
Subject: Re: [PATCH V2 vfio 05/11] vfio: Add an IOVA bitmap support

On 7/20/22 17:47, Alex Williamson wrote:
> On Wed, 20 Jul 2022 02:57:24 +0100
> Joao Martins <joao.m.martins@...cle.com> wrote:
>> On 7/19/22 20:01, Alex Williamson wrote:
>>> On Thu, 14 Jul 2022 11:12:45 +0300
>>> Yishai Hadas <yishaih@...dia.com> wrote:
>>>> From: Joao Martins <joao.m.martins@...cle.com>

[snip]

>>>> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
>>>> new file mode 100644
>>>> index 000000000000..c474c351634a
>>>> --- /dev/null
>>>> +++ b/include/linux/iova_bitmap.h
>>>> @@ -0,0 +1,46 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>>>> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>>>> + */
>>>> +
>>>> +#ifndef _IOVA_BITMAP_H_
>>>> +#define _IOVA_BITMAP_H_
>>>> +
>>>> +#include <linux/highmem.h>
>>>> +#include <linux/mm.h>
>>>> +#include <linux/uio.h>
>>>> +
>>>> +struct iova_bitmap {
>>>> +	unsigned long iova;
>>>> +	unsigned long pgshift;
>>>> +	unsigned long start_offset;
>>>> +	unsigned long npages;
>>>> +	struct page **pages;
>>>> +};
>>>> +
>>>> +struct iova_bitmap_iter {
>>>> +	struct iova_bitmap dirty;
>>>> +	u64 __user *data;
>>>> +	size_t offset;
>>>> +	size_t count;
>>>> +	unsigned long iova;
>>>> +	unsigned long length;
>>>> +};
>>>> +
>>>> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
>>>> +			  unsigned long length, u64 __user *data);
>>>> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
>>>> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
>>>> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
>>>> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
>>>> +void iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
>>>> +int iova_bitmap_iter_get(struct iova_bitmap_iter *iter);
>>>> +void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
>>>> +void iova_bitmap_init(struct iova_bitmap *bitmap,
>>>> +		      unsigned long base, unsigned long pgshift);
>>>> +unsigned int iova_bitmap_set(struct iova_bitmap *dirty,
>>>> +			     unsigned long iova,
>>>> +			     unsigned long length);
>>>> +
>>>> +#endif  
>>>
>>> No relevant comments, no theory of operation.  I found this really
>>> difficult to review and the page handling is still not clear to me.
>>> I'm not willing to take on maintainership of this code under
>>> drivers/vfio/ as is.   
>>
>> Sorry for the lack of comments/docs and lack of clearity in some of the
>> functions. I'll document all functions/fields and add a comment bloc at
>> the top explaining the theory on how it should be used/works, alongside
>> the improvements you suggested above.
>>
>> Meanwhile what is less clear for you on the page handling? We are essentially
>> calculating the number of pages based of @offset and @count and then preping
>> the iova_bitmap (@dirty) with the base IOVA and page offset. iova_bitmap_set()
>> then computes where is the should start setting bits, and then it kmap() each page
>> and sets the said bits. So far I am not caching kmap() kaddr,
>> so the majority of iova_bitmap_set() complexity comes from iterating over number
>> of bits to kmap and accounting to the offset that user bitmap address had.
> 
> It could have saved a lot of struggling through this code if it were
> presented as a windowing scheme to iterate over a user bitmap.
> 
> As I understand it more though, does the API really fit the expected use
> cases?  As presented here and used in the following patch, we map every
> section of the user bitmap, present that section to the device driver
> and ask them to mark dirty bits and atomically clear their internal
> tracker for that sub-range.  This seems really inefficient.
> 
So with either IOMMU and VFIO vendor driver the hardware may marshal their dirty
bits in entirely separate manners. On IOMMUs it is unbounded and PTEs format
vary, so there's no way but to walk all domain pagetables since the beginning of the
(mapped) IOVA range and check that every PTE is dirty or not and this is going to be
rather expensive, the next cost would be between 1) to copy bitmaps back and forth or
2) pin . 2)  it's cheaper if it is over 2M chunks (i.e. fewer atomics there) unless
we take the slow-path. On VFIO there's no intermediate storage for the driver,
and even we were going to preregister anything vendor we would have to copy MBs
of bitmaps to user memory (worst case e.g 32MiB per Tb). Although there's some
unefficiency on unnecessary pinning of potential non-dirty IOVA ranges if the user
doesn't mark anything dirty.

Trying to avoid copies as iova_bitmap, the main cost is in the pinning and
making it dependent on dirties (rather than windowing) means we could pin
individual pages of the bitmap more often (with efficiency being a bit more
tied to the VF workload or vIOMMU).

> Are we just counting on the fact that each 2MB window of dirty bitmap
> is 64GB of guest RAM (assuming 4KB pages) and there's likely something
> dirty there?
> 
Yes, and likely there's enough except when we get reports for very big
IOVA ranges when usually there's more than one iteration. 4K of user
memory would represent a section (128M).

> It seems like a more efficient API might be for us to call the device
> driver with an iterator object, which the device driver uses to call
> back into this bitmap helper to set specific iova+length ranges as
> dirty.

I can explore another variant. With some control over how it advances
the bitmap the driver could easily adjust the iova_bitmap as it see fit
without necessarily having to walk the whole bitmap memory while retaining
the same iova_bitmap general facility. The downside(?) would be that end
drivers (iommu driver, or vfio vendor driver) need to work (pin) with user
buffers rather than kernel managed pages.

> The iterator could still cache the kmap'd page (or pages) to
> optimize localized dirties, but we don't necessarily need to kmap and
> present every section of the bitmap to the driver. 
kmap_local_page() is cheap (IOW page_address(page)), unless its highmem (AIUI).

The expensive part for zerocopy approach is having to pin pages prior to have
iova_bitmap_set(). If the device is doing IOs scattered across a relatively
ram sections I am not sure how the caching will be efficient.

> Thanks,
>