[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3916258-bd64-5cc8-7cbe-f7338a96bf58@oracle.com>
Date: Thu, 1 Sep 2022 20:39:40 +0100
From: joao.m.martins@...cle.com
To: Alex Williamson <alex.williamson@...hat.com>,
Yishai Hadas <yishaih@...dia.com>
Cc: jgg@...dia.com, saeedm@...dia.com, kvm@...r.kernel.org,
netdev@...r.kernel.org, kuba@...nel.org, kevin.tian@...el.com,
leonro@...dia.com, maorg@...dia.com, cohuck@...hat.com
Subject: Re: [PATCH V5 vfio 04/10] vfio: Add an IOVA bitmap support
On 01/09/2022 19:47, Alex Williamson wrote:
> On Thu, 1 Sep 2022 12:38:47 +0300
> Yishai Hadas <yishaih@...dia.com> wrote:
>
>> From: Joao Martins <joao.m.martins@...cle.com>
>>
>> The new facility adds a bunch of wrappers that abstract how an IOVA range
>> is represented in a bitmap that is granulated by a given page_size. So it
>> translates all the lifting of dealing with user pointers into its
>> corresponding kernel addresses backing said user memory into doing finally
>> the (non-atomic) bitmap ops to change various bits.
>>
>> The formula for the bitmap is:
>>
>> data[(iova / page_size) / 64] & (1ULL << (iova % 64))
>>
>> Where 64 is the number of bits in a unsigned long (depending on arch)
>>
>> It introduces an IOVA iterator that uses a windowing scheme to minimize the
>> pinning overhead, as opposed to pinning it on demand 4K at a time. Assuming
>> a 4K kernel page and 4K requested page size, we can use a single kernel
>> page to hold 512 page pointers, mapping 2M of bitmap, representing 64G of
>> IOVA space.
>>
>> An example usage of these helpers for a given @base_iova, @page_size,
>> @length and __user @data:
>>
>> bitmap = iova_bitmap_alloc(base_iova, page_size, length, data);
>> if (IS_ERR(bitmap))
>> return -ENOMEM;
>>
>> ret = iova_bitmap_for_each(bitmap, arg, dirty_reporter_fn);
>>
>> iova_bitmap_free(bitmap);
>>
>> An implementation of the lower end -- referred to above as
>> dirty_reporter_fn to exemplify -- that is tracking dirty bits would mark
>> an IOVA as dirty as following:
>>
>> iova_bitmap_set(bitmap, iova, page_size);
>>
>> Or a contiguous range (example two pages):
>>
>> iova_bitmap_set(bitmap, iova, 2 * page_size);
>>
>> The facility is intended to be used for user bitmaps representing dirtied
>> IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
>>
>> Signed-off-by: Joao Martins <joao.m.martins@...cle.com>
>> Signed-off-by: Yishai Hadas <yishaih@...dia.com>
>> ---
>> drivers/vfio/Makefile | 6 +-
>> drivers/vfio/iova_bitmap.c | 426 ++++++++++++++++++++++++++++++++++++
>> include/linux/iova_bitmap.h | 24 ++
>> 3 files changed, 454 insertions(+), 2 deletions(-)
>> create mode 100644 drivers/vfio/iova_bitmap.c
>> create mode 100644 include/linux/iova_bitmap.h
>>
>> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
>> index 1a32357592e3..d67c604d0407 100644
>> --- a/drivers/vfio/Makefile
>> +++ b/drivers/vfio/Makefile
>> @@ -1,9 +1,11 @@
>> # SPDX-License-Identifier: GPL-2.0
>> vfio_virqfd-y := virqfd.o
>>
>> -vfio-y += vfio_main.o
>> -
>> obj-$(CONFIG_VFIO) += vfio.o
>> +
>> +vfio-y += vfio_main.o \
>> + iova_bitmap.o \
>> +
>> obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
>> obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>> obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
>> new file mode 100644
>> index 000000000000..4211a16eb542
>> --- /dev/null
>> +++ b/drivers/vfio/iova_bitmap.c
>> @@ -0,0 +1,426 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>> + * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +#include <linux/iova_bitmap.h>
>> +#include <linux/mm.h>
>> +#include <linux/highmem.h>
>> +
>> +#define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE)
>> +
>> +/*
>> + * struct iova_bitmap_map - A bitmap representing an IOVA range
>> + *
>> + * Main data structure for tracking mapped user pages of bitmap data.
>> + *
>> + * For example, for something recording dirty IOVAs, it will be provided a
>> + * struct iova_bitmap structure, as a general structure for iterating the
>> + * total IOVA range. The struct iova_bitmap_map, though, represents the
>> + * subset of said IOVA space that is pinned by its parent structure (struct
>> + * iova_bitmap).
>> + *
>> + * The user does not need to exact location of the bits in the bitmap.
>> + * From user perspective the only API available is iova_bitmap_set() which
>> + * records the IOVA *range* in the bitmap by setting the corresponding
>> + * bits.
>> + *
>> + * The bitmap is an array of u64 whereas each bit represents an IOVA of
>> + * range of (1 << pgshift). Thus formula for the bitmap data to be set is:
>> + *
>> + * data[(iova / page_size) / 64] & (1ULL << (iova % 64))
>> + */
>> +struct iova_bitmap_map {
>> + /* base IOVA representing bit 0 of the first page */
>> + unsigned long iova;
>> +
>> + /* page size order that each bit granules to */
>> + unsigned long pgshift;
>> +
>> + /* page offset of the first user page pinned */
>> + unsigned long pgoff;
>> +
>> + /* number of pages pinned */
>> + unsigned long npages;
>> +
>> + /* pinned pages representing the bitmap data */
>> + struct page **pages;
>> +};
>> +
>> +/*
>> + * struct iova_bitmap - The IOVA bitmap object
>> + *
>> + * Main data structure for iterating over the bitmap data.
>> + *
>> + * Abstracts the pinning work and iterates in IOVA ranges.
>> + * It uses a windowing scheme and pins the bitmap in relatively
>> + * big ranges e.g.
>> + *
>> + * The bitmap object uses one base page to store all the pinned pages
>> + * pointers related to the bitmap. For sizeof(struct page) == 64 it stores
>
> sizeof(struct page*) == 8?
>
yeah, we talk about pointers the actual struct page doesn't matter. I'll change that.
>> + * 512 struct pages which, if the base page size is 4K, it means 2M of bitmap
>
> s/struct pages/struct page pointers/
>
/me nods
>> + * data is pinned at a time. If the iova_bitmap page size is also 4K
>> + * then the range window to iterate is 64G.
>> + *
>> + * For example iterating on a total IOVA range of 4G..128G, it will walk
>> + * through this set of ranges:
>> + *
>> + * 4G - 68G-1 (64G)
>> + * 68G - 128G-1 (64G)
>> + *
>> + * An example of the APIs on how to use/iterate over the IOVA bitmap:
>> + *
>> + * bitmap = iova_bitmap_alloc(iova, length, page_size, data);
>> + * if (IS_ERR(bitmap))
>> + * return PTR_ERR(bitmap);
>> + *
>> + * ret = iova_bitmap_for_each(bitmap, arg, dirty_reporter_fn);
>> + *
>> + * iova_bitmap_free(bitmap);
>> + *
>> + * An implementation of the lower end (referred to above as
>> + * dirty_reporter_fn to exemplify), that is tracking dirty bits would mark
>> + * an IOVA as dirty as following:
>> + * iova_bitmap_set(bitmap, iova, page_size);
>> + * Or a contiguous range (example two pages):
>> + * iova_bitmap_set(bitmap, iova, 2 * page_size);
>
> This seems like it implies a stronger correlation to the
> iova_bitmap_alloc() page_size than actually exists. The implementation
> of the dirty_reporter_fn() may not know the reporting page_size. The
> value here is just a size_t and iova_bitmap handles the rest, right?
>
Correct.
The intent was to show an example of what the different usage have
an effect in the end bitmap data (1 page and then 2 pages). An alternative
would be:
An implementation of the lower end (referred to above as
dirty_reporter_fn to exemplify), that is tracking dirty bits would mark
an IOVA range spanning @iova_length as dirty, using the configured
@page_size:
iova_bitmap_set(bitmap, iova, iova_length)
But with a different length variable (i.e. iova_length) to avoid being confused with
the length in iova_bitmap_alloc right before this paragraph. But the example in the
patch looks a bit more clear on the outcomes to me personally.
>> + *
>> + * The internals of the object uses an index @mapped_base_index that indexes
>> + * which u64 word of the bitmap is mapped, up to @mapped_total_index.
>> + * Those keep being incremented until @mapped_total_index reaches while
>
> s/reaches/is reached/
>
/me nods
>> + * mapping up to PAGE_SIZE / sizeof(struct page*) maximum of pages.
>> + *
>> + * The IOVA bitmap is usually located on what tracks DMA mapped ranges or
>> + * some form of IOVA range tracking that co-relates to the user passed
>> + * bitmap.
>> + */
>> +struct iova_bitmap {
>> + /* IOVA range representing the currently mapped bitmap data */
>> + struct iova_bitmap_map mapped;
>> +
>> + /* userspace address of the bitmap */
>> + u64 __user *bitmap;
>> +
>> + /* u64 index that @mapped points to */
>> + unsigned long mapped_base_index;
>> +
>> + /* how many u64 can we walk in total */
>> + unsigned long mapped_total_index;
>> +
>> + /* base IOVA of the whole bitmap */
>> + unsigned long iova;
>> +
>> + /* length of the IOVA range for the whole bitmap */
>> + size_t length;
>> +};
>> +
>> +/*
>> + * Converts a relative IOVA to a bitmap index.
>> + * This function provides the index into the u64 array (bitmap::bitmap)
>> + * for a given IOVA offset.
>> + * Relative IOVA means relative to the bitmap::mapped base IOVA
>> + * (stored in mapped::iova). All computations in this file are done using
>> + * relative IOVAs and thus avoid an extra subtraction against mapped::iova.
>> + * The user API iova_bitmap_set() always uses a regular absolute IOVAs.
>> + */
>> +static unsigned long iova_bitmap_offset_to_index(struct iova_bitmap *bitmap,
>> + unsigned long iova)
>> +{
>> + unsigned long pgsize = 1 << bitmap->mapped.pgshift;
>> +
>> + return iova / (BITS_PER_TYPE(*bitmap->bitmap) * pgsize);
>> +}
>> +
>> +/*
>> + * Converts a bitmap index to a *relative* IOVA.
>> + */
>> +static unsigned long iova_bitmap_index_to_offset(struct iova_bitmap *bitmap,
>> + unsigned long index)
>> +{
>> + unsigned long pgshift = bitmap->mapped.pgshift;
>> +
>> + return (index * BITS_PER_TYPE(*bitmap->bitmap)) << pgshift;
>> +}
>> +
>> +/*
>> + * Returns the base IOVA of the mapped range.
>> + */
>> +static unsigned long iova_bitmap_mapped_iova(struct iova_bitmap *bitmap)
>> +{
>> + unsigned long skip = bitmap->mapped_base_index;
>> +
>> + return bitmap->iova + iova_bitmap_index_to_offset(bitmap, skip);
>> +}
>> +
>> +/*
>> + * Pins the bitmap user pages for the current range window.
>> + * This is internal to IOVA bitmap and called when advancing the
>> + * index (@mapped_base_index) or allocating the bitmap.
>> + */
>> +static int iova_bitmap_get(struct iova_bitmap *bitmap)
>> +{
>> + struct iova_bitmap_map *mapped = &bitmap->mapped;
>> + unsigned long npages;
>> + u64 __user *addr;
>> + long ret;
>> +
>> + /*
>> + * @mapped_base_index is the index of the currently mapped u64 words
>> + * that we have access. Anything before @mapped_base_index is not
>> + * mapped. The range @mapped_base_index .. @mapped_total_index-1 is
>> + * mapped but capped at a maximum number of pages.
>> + */
>> + npages = DIV_ROUND_UP((bitmap->mapped_total_index -
>> + bitmap->mapped_base_index) *
>> + sizeof(*bitmap->bitmap), PAGE_SIZE);
>> +
>> + /*
>> + * We always cap at max number of 'struct page' a base page can fit.
>> + * This is, for example, on x86 means 2M of bitmap data max.
>> + */
>> + npages = min(npages, PAGE_SIZE / sizeof(struct page *));
>> +
>> + /*
>> + * Bitmap address to be pinned is calculated via pointer arithmetic
>> + * with bitmap u64 word index.
>> + */
>> + addr = bitmap->bitmap + bitmap->mapped_base_index;
>> +
>> + ret = pin_user_pages_fast((unsigned long)addr, npages,
>> + FOLL_WRITE, mapped->pages);
>> + if (ret <= 0)
>> + return -EFAULT;
>> +
>> + mapped->npages = (unsigned long)ret;
>> + /* Base IOVA where @pages point to i.e. bit 0 of the first page */
>> + mapped->iova = iova_bitmap_mapped_iova(bitmap);
>> +
>> + /*
>> + * offset of the page where pinned pages bit 0 is located.
>> + * This handles the case where the bitmap is not PAGE_SIZE
>> + * aligned.
>> + */
>> + mapped->pgoff = offset_in_page(addr);
>> + return 0;
>> +}
>> +
>> +/*
>> + * Unpins the bitmap user pages and clears @npages
>> + * (un)pinning is abstracted from API user and it's done when advancing
>> + * the index or freeing the bitmap.
>> + */
>> +static void iova_bitmap_put(struct iova_bitmap *bitmap)
>> +{
>> + struct iova_bitmap_map *mapped = &bitmap->mapped;
>> +
>> + if (mapped->npages) {
>> + unpin_user_pages(mapped->pages, mapped->npages);
>> + mapped->npages = 0;
>> + }
>> +}
>> +
>> +/**
>> + * iova_bitmap_alloc() - Allocates an IOVA bitmap object
>> + * @iova: Start address of the IOVA range
>> + * @length: Length of the IOVA range
>> + * @page_size: Page size of the IOVA bitmap. It defines what each bit
>> + * granularity represents
>> + * @data: Userspace address of the bitmap
>> + *
>> + * Allocates an IOVA object and initializes all its fields including the
>> + * first user pages of @data.
>> + *
>> + * Return: A pointer to a newly allocated struct iova_bitmap
>> + * or ERR_PTR() on error.
>> + */
>> +struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
>> + unsigned long page_size, u64 __user *data)
>> +{
>> + struct iova_bitmap_map *mapped;
>> + struct iova_bitmap *bitmap;
>> + int rc;
>> +
>> + bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL);
>> + if (!bitmap)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + mapped = &bitmap->mapped;
>> + mapped->pgshift = __ffs(page_size);
>> + bitmap->bitmap = data;
>> + bitmap->mapped_total_index =
>> + iova_bitmap_offset_to_index(bitmap, length - 1) + 1;
>> + bitmap->iova = iova;
>> + bitmap->length = length;
>> + mapped->iova = iova;
>> + mapped->pages = (struct page **)__get_free_page(GFP_KERNEL);
>> + if (!mapped->pages) {
>> + rc = -ENOMEM;
>> + goto err;
>> + }
>> +
>> + rc = iova_bitmap_get(bitmap);
>> + if (rc)
>> + goto err;
>> + return bitmap;
>> +
>> +err:
>> + iova_bitmap_free(bitmap);
>> + return ERR_PTR(rc);
>> +}
>> +
>> +/**
>> + * iova_bitmap_free() - Frees an IOVA bitmap object
>> + * @bitmap: IOVA bitmap to free
>> + *
>> + * It unpins and releases pages array memory and clears any leftover
>> + * state.
>> + */
>> +void iova_bitmap_free(struct iova_bitmap *bitmap)
>> +{
>> + struct iova_bitmap_map *mapped = &bitmap->mapped;
>> +
>> + iova_bitmap_put(bitmap);
>> +
>> + if (mapped->pages) {
>> + free_page((unsigned long)mapped->pages);
>> + mapped->pages = NULL;
>> + }
>> +
>> + kfree(bitmap);
>> +}
>> +
>> +/*
>> + * Returns the remaining bitmap indexes from mapped_total_index to process for
>> + * the currently pinned bitmap pages.
>> + */
>> +static unsigned long iova_bitmap_mapped_remaining(struct iova_bitmap *bitmap)
>> +{
>> + unsigned long remaining;
>> +
>> + remaining = bitmap->mapped_total_index - bitmap->mapped_base_index;
>> + remaining = min_t(unsigned long, remaining,
>> + (bitmap->mapped.npages << PAGE_SHIFT) / sizeof(*bitmap->bitmap));
>> +
>> + return remaining;
>> +}
>> +
>> +/*
>> + * Returns the length of the mapped IOVA range.
>> + */
>> +static unsigned long iova_bitmap_mapped_length(struct iova_bitmap *bitmap)
>> +{
>> + unsigned long max_iova = bitmap->iova + bitmap->length - 1;
>> + unsigned long iova = iova_bitmap_mapped_iova(bitmap);
>> + unsigned long remaining;
>> +
>> + /*
>> + * iova_bitmap_mapped_remaining() returns a number of indexes which
>> + * when converted to IOVA gives us a max length that the bitmap
>> + * pinned data can cover. Afterwards, that is capped to
>> + * only cover the IOVA range in @bitmap::iova .. @bitmap::length.
>> + */
>> + remaining = iova_bitmap_index_to_offset(bitmap,
>> + iova_bitmap_mapped_remaining(bitmap));
>> +
>> + if (iova + remaining - 1 > max_iova)
>> + remaining -= ((iova + remaining - 1) - max_iova);
>> +
>> + return remaining;
>> +}
>> +
>> +/*
>> + * Returns true if there's not more data to iterate.
>> + */
>> +static bool iova_bitmap_done(struct iova_bitmap *bitmap)
>> +{
>> + return bitmap->mapped_base_index >= bitmap->mapped_total_index;
>> +}
>> +
>> +/*
>> + * Advances to the next range, releases the current pinned
>> + * pages and pins the next set of bitmap pages.
>> + * Returns 0 on success or otherwise errno.
>> + */
>> +static int iova_bitmap_advance(struct iova_bitmap *bitmap)
>> +{
>> + unsigned long iova = iova_bitmap_mapped_length(bitmap) - 1;
>> + unsigned long count = iova_bitmap_offset_to_index(bitmap, iova) + 1;
>> +
>> + bitmap->mapped_base_index += count;
>> +
>> + iova_bitmap_put(bitmap);
>> + if (iova_bitmap_done(bitmap))
>> + return 0;
>> +
>> + /* When advancing the index we pin the next set of bitmap pages */
>> + return iova_bitmap_get(bitmap);
>> +}
>> +
>> +/**
>> + * iova_bitmap_for_each() - Iterates over the bitmap
>> + * @bitmap: IOVA bitmap to iterate
>> + * @opaque: Additional argument to pass to the callback
>> + * @fn: Function that gets called for each IOVA range
>> + *
>> + * Helper function to iterate over bitmap data representing a portion of IOVA
>> + * space. It hides the complexity of iterating bitmaps and translating the
>> + * mapped bitmap user pages into IOVA ranges to process.
>> + *
>> + * Return: 0 on success, and an error on failure either upon
>> + * iteration or when the callback returns an error.
>> + */
>> +int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
>> + int (*fn)(struct iova_bitmap *bitmap,
>> + unsigned long iova, size_t length,
>> + void *opaque))
>
> It might make sense to typedef an iova_bitmap_fn_t in the header to use
> here.
>
OK, will do. I wasn't sure which style was preferred so went with simplest on
first take.
>> +{
>> + int ret = 0;
>> +
>> + for (; !iova_bitmap_done(bitmap) && !ret;
>> + ret = iova_bitmap_advance(bitmap)) {
>> + ret = fn(bitmap, iova_bitmap_mapped_iova(bitmap),
>> + iova_bitmap_mapped_length(bitmap), opaque);
>> + if (ret)
>> + break;
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +/**
>> + * iova_bitmap_set() - Records an IOVA range in bitmap
>> + * @bitmap: IOVA bitmap
>> + * @iova: IOVA to start
>> + * @length: IOVA range length
>> + *
>> + * Set the bits corresponding to the range [iova .. iova+length-1] in
>> + * the user bitmap.
>> + *
>> + * Return: The number of bits set.
>
> Is this relevant to the caller?
>
The thinking that number of bits was a way for caller to validate that
'some bits' was set, i.e. sort of error return value. But none of the callers
use it today, it's true. Suppose I should remove it, following bitmap_set()
returning void too.
>> + */
>> +unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
>> + unsigned long iova, size_t length)
>> +{
>> + struct iova_bitmap_map *mapped = &bitmap->mapped;
>> + unsigned long nbits = max(1UL, length >> mapped->pgshift), set = nbits;
>> + unsigned long offset = (iova - mapped->iova) >> mapped->pgshift;
>
> There's no sanity testing here that the caller provided an iova within
> the mapped ranged. Thanks,
>
Much of the bitmap helpers don't check that the offset is within the range
of the passed ulong array. So I followed the same thinking and the
caller is /provided/ with the range that the IOVA bitmap covers. The intention
was minimizing the number of operations given that this function sits on the
hot path. I can add this extra check.
> Alex
>
>> + unsigned long page_idx = offset / BITS_PER_PAGE;
>> + unsigned long page_offset = mapped->pgoff;
>> + void *kaddr;
>> +
>> + offset = offset % BITS_PER_PAGE;
>> +
>> + do {
>> + unsigned long size = min(BITS_PER_PAGE - offset, nbits);
>> +
>> + kaddr = kmap_local_page(mapped->pages[page_idx]);
>> + bitmap_set(kaddr + page_offset, offset, size);
>> + kunmap_local(kaddr);
>> + page_offset = offset = 0;
>> + nbits -= size;
>> + page_idx++;
>> + } while (nbits > 0);
>> +
>> + return set;
>> +}
>> +EXPORT_SYMBOL_GPL(iova_bitmap_set);
>> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
>> new file mode 100644
>> index 000000000000..ab3b4fa6ac48
>> --- /dev/null
>> +++ b/include/linux/iova_bitmap.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>> + * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +#ifndef _IOVA_BITMAP_H_
>> +#define _IOVA_BITMAP_H_
>> +
>> +#include <linux/types.h>
>> +
>> +struct iova_bitmap;
>> +
>> +struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
>> + unsigned long page_size,
>> + u64 __user *data);
>> +void iova_bitmap_free(struct iova_bitmap *bitmap);
>> +int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *opaque,
>> + int (*fn)(struct iova_bitmap *bitmap,
>> + unsigned long iova, size_t length,
>> + void *opaque));
>> +unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
>> + unsigned long iova, size_t length);
>> +
>> +#endif
>
Powered by blists - more mailing lists