lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <010263ba-9740-4bcf-7af6-37e33e4612c0@kaod.org>
Date: Thu, 7 Sep 2023 19:34:34 +0200
From: Cédric Le Goater <clg@...d.org>
To: Joao Martins <joao.m.martins@...cle.com>
Cc: Yishai Hadas <yishaih@...dia.com>, alex.williamson@...hat.com,
 saeedm@...dia.com, kvm@...r.kernel.org, netdev@...r.kernel.org,
 kuba@...nel.org, kevin.tian@...el.com, leonro@...dia.com, maorg@...dia.com,
 cohuck@...hat.com, 'Avihai Horon' <avihaih@...dia.com>,
 Jason Gunthorpe <jgg@...dia.com>, Tarun Gupta <targupta@...dia.com>
Subject: Re: [PATCH V7 vfio 07/10] vfio/mlx5: Create and destroy page tracker
 object

On 9/7/23 18:33, Joao Martins wrote:
> On 07/09/2023 13:16, Cédric Le Goater wrote:
>> On 9/7/23 12:51, Joao Martins wrote:
>>> On 07/09/2023 10:56, Cédric Le Goater wrote:
>>>> On 9/6/23 13:51, Jason Gunthorpe wrote:
>>>>> On Wed, Sep 06, 2023 at 10:55:26AM +0200, Cédric Le Goater wrote:
>>>>>
>>>>>>> +    WARN_ON(node);
>>>>>>> +    log_addr_space_size = ilog2(total_ranges_len);
>>>>>>> +    if (log_addr_space_size <
>>>>>>> +        (MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_log_min_addr_space)) ||
>>>>>>> +        log_addr_space_size >
>>>>>>> +        (MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_log_max_addr_space))) {
>>>>>>> +        err = -EOPNOTSUPP;
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>
>>>>>>
>>>>>> We are seeing an issue with dirty page tracking when doing migration
>>>>>> of an OVMF VM guest. The vfio-pci variant driver for the MLX5 VF
>>>>>> device complains when dirty page tracking is initialized from QEMU :
>>>>>>
>>>>>>      qemu-kvm: 0000:b1:00.2: Failed to start DMA logging, err -95 (Operation
>>>>>> not supported)
>>>>>>
>>>>>> The 64-bit computed range is  :
>>>>>>
>>>>>>      vfio_device_dirty_tracking_start nr_ranges 2 32:[0x0 - 0x807fffff],
>>>>>> 64:[0x100000000 - 0x3838000fffff]
>>>>>>
>>>>>> which seems to be too large for the HW. AFAICT, the MLX5 HW has a 42
>>>>>> bits address space limitation for dirty tracking (min is 12). Is it a
>>>>>> FW tunable or a strict limitation ?
>>>>>
>>>>> It would be good to explain where this is coming from, all devices
>>>>> need to make some decision on what address space ranges to track and I
>>>>> would say 2^42 is already pretty generous limit..
>>>>
>>>>
>>>> QEMU computes the DMA logging ranges for two predefined ranges: 32-bit
>>>> and 64-bit. In the OVMF case, QEMU includes in the 64-bit range, RAM
>>>> (at the lower part) and device RAM regions (at the top of the address
>>>> space). The size of that range can be bigger than the 2^42 limit of
>>>> the MLX5 HW for dirty tracking. QEMU is not making much effort to be
>>>> smart. There is room for improvement.
>>>>
>>>
>>> Interesting, we haven't reproduced this in our testing with OVMF multi-TB
>>> configs with these VFs. Could you share the OVMF base version you were using?
>>
>> edk2-ovmf-20230524-3.el9.noarch
>>
>> host is a :
>>          Architecture:            x86_64
>>        CPU op-mode(s):        32-bit, 64-bit
>>        Address sizes:         46 bits physical, 57 bits virtual
>>        Byte Order:            Little Endian
>>      CPU(s):                  48
>>        On-line CPU(s) list:   0-47
>>      Vendor ID:               GenuineIntel
>>        Model name:            Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
>>
>>
>>> or
>>> maybe we didn't triggered it considering the total device RAM regions would be
>>> small enough to fit the 32G PCI hole64 that is advertised that avoids a
>>> hypothetical relocation.
>>
>> You need RAM above 4G in the guest :
>>          100000000-27fffffff : System RAM
>>        237800000-2387fffff : Kernel code
>>        238800000-23932cfff : Kernel rodata
>>        239400000-239977cff : Kernel data
>>        23a202000-23b3fffff : Kernel bss
>>      380000000000-3807ffffffff : PCI Bus 0000:00
>>        380000000000-3800000fffff : 0000:00:03.0
>>          380000000000-3800000fffff : mlx5_core
> 
> Similar machine to yours, but in my 32G guests with older OVMF it's putting the
> PCI area after max-ram:
> 
> vfio_device_dirty_tracking_update section 0x0 - 0x9ffff -> update [0x0 - 0x9ffff]
> vfio_device_dirty_tracking_update section 0xc0000 - 0xcafff -> update [0x0 -
> 0xcafff]
> vfio_device_dirty_tracking_update section 0xcb000 - 0xcdfff -> update [0x0 -
> 0xcdfff]
> vfio_device_dirty_tracking_update section 0xce000 - 0xe7fff -> update [0x0 -
> 0xe7fff]
> vfio_device_dirty_tracking_update section 0xe8000 - 0xeffff -> update [0x0 -
> 0xeffff]
> vfio_device_dirty_tracking_update section 0xf0000 - 0xfffff -> update [0x0 -
> 0xfffff]
> vfio_device_dirty_tracking_update section 0x100000 - 0x7fffffff -> update [0x0 -
> 0x7fffffff]
> vfio_device_dirty_tracking_update section 0xfd000000 - 0xfdffffff -> update [0x0
> - 0xfdffffff]
> vfio_device_dirty_tracking_update section 0xfffc0000 - 0xffffffff -> update [0x0
> - 0xffffffff]
> vfio_device_dirty_tracking_update section 0x100000000 - 0x87fffffff -> update
> [0x100000000 - 0x87fffffff]
> vfio_device_dirty_tracking_update section 0x880000000 - 0x880001fff -> update
> [0x100000000 - 0x880001fff]
> vfio_device_dirty_tracking_update section 0x880003000 - 0x8ffffffff -> update
> [0x100000000 - 0x8ffffffff]
> 

and so the range is smaller.

The latest version of OVMF enables the dynamic mmio window which causes the issue.

>>
>> Activating the QEMU trace events shows quickly the issue :
>>
>>      vfio_device_dirty_tracking_update section 0x0 - 0x9ffff -> update [0x0 -
>> 0x9ffff]
>>      vfio_device_dirty_tracking_update section 0xa0000 - 0xaffff -> update [0x0 -
>> 0xaffff]
>>      vfio_device_dirty_tracking_update section 0xc0000 - 0xc3fff -> update [0x0 -
>> 0xc3fff]
>>      vfio_device_dirty_tracking_update section 0xc4000 - 0xdffff -> update [0x0 -
>> 0xdffff]
>>      vfio_device_dirty_tracking_update section 0xe0000 - 0xfffff -> update [0x0 -
>> 0xfffff]
>>      vfio_device_dirty_tracking_update section 0x100000 - 0x7fffffff -> update
>> [0x0 - 0x7fffffff]
>>      vfio_device_dirty_tracking_update section 0x80000000 - 0x807fffff -> update
>> [0x0 - 0x807fffff]
>>      vfio_device_dirty_tracking_update section 0x100000000 - 0x27fffffff ->
>> update [0x100000000 - 0x27fffffff]
>>      vfio_device_dirty_tracking_update section 0x383800000000 - 0x383800001fff ->
>> update [0x100000000 - 0x383800001fff]
>>      vfio_device_dirty_tracking_update section 0x383800003000 - 0x3838000fffff ->
>> update [0x100000000 - 0x3838000fffff]
>>
>> So that's nice. And with less RAM in the VM, 2G, migration should work though.
>>
>>> We could use do more than 2 ranges (or going back to sharing all ranges), or add
>>> a set of ranges that represents the device RAM without computing a min/max there
>>> (not sure we can figure that out from within the memory listener does all this
>>> logic);
>>
>> The listener is container based. May we could add one range per device
>> if we can identify a different owner per memory section.
>>
> 
> For brainstorm purposes ... Maybe something like this below. Should make your
> case work. As mentioned earlier in my case it's placed always at maxram+1, so
> makes no difference in having the "pci" range

yes.

That said, it looks good and it does handle the useless gap in the 64-bit range.
Thanks for doing it. We should explore the non OVMF case with this patch also.

C.


> 
> ------>8-------
> From: Joao Martins <joao.m.martins@...cle.com>
> Date: Thu, 7 Sep 2023 09:23:38 -0700
> Subject: [PATCH] vfio/common: Separate vfio-pci ranges
> 
> Signed-off-by: Joao Martins <joao.m.martins@...cle.com>
> ---
>   hw/vfio/common.c     | 48 ++++++++++++++++++++++++++++++++++++++++----
>   hw/vfio/trace-events |  2 +-
>   2 files changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f8b20aacc07c..f0b36a98c89a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -27,6 +27,7 @@
> 
>   #include "hw/vfio/vfio-common.h"
>   #include "hw/vfio/vfio.h"
> +#include "hw/vfio/pci.h"
>   #include "exec/address-spaces.h"
>   #include "exec/memory.h"
>   #include "exec/ram_addr.h"
> @@ -1424,6 +1425,8 @@ typedef struct VFIODirtyRanges {
>       hwaddr max32;
>       hwaddr min64;
>       hwaddr max64;
> +    hwaddr minpci;
> +    hwaddr maxpci;
>   } VFIODirtyRanges;
> 
>   typedef struct VFIODirtyRangesListener {
> @@ -1432,6 +1435,31 @@ typedef struct VFIODirtyRangesListener {
>       MemoryListener listener;
>   } VFIODirtyRangesListener;
> 
> +static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
> +                                     VFIOContainer *container)
> +{
> +    VFIOPCIDevice *pcidev;
> +    VFIODevice *vbasedev;
> +    VFIOGroup *group;
> +    Object *owner;
> +
> +    owner = memory_region_owner(section->mr);
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
> +                continue;
> +            }
> +            pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +            if (OBJECT(pcidev) == owner) {
> +                return true;
> +            }
> +        }
> +    }
> +
> +    return false;
> +}
> +
>   static void vfio_dirty_tracking_update(MemoryListener *listener,
>                                          MemoryRegionSection *section)
>   {
> @@ -1458,8 +1486,13 @@ static void vfio_dirty_tracking_update(MemoryListener
> *listener,
>        * would be an IOVATree but that has a much bigger runtime overhead and
>        * unnecessary complexity.
>        */
> -    min = (end <= UINT32_MAX) ? &range->min32 : &range->min64;
> -    max = (end <= UINT32_MAX) ? &range->max32 : &range->max64;
> +    if (!vfio_section_is_vfio_pci(section, dirty->container)) {
> +        min = (end <= UINT32_MAX) ? &range->min32 : &range->min64;
> +        max = (end <= UINT32_MAX) ? &range->max32 : &range->max64;
> +    } else {
> +        min = &range->minpci;
> +        max = &range->maxpci;
> +    }
> 
>       if (*min > iova) {
>           *min = iova;
> @@ -1485,6 +1518,7 @@ static void vfio_dirty_tracking_init(VFIOContainer *container,
>       memset(&dirty, 0, sizeof(dirty));
>       dirty.ranges.min32 = UINT32_MAX;
>       dirty.ranges.min64 = UINT64_MAX;
> +    dirty.ranges.minpci = UINT64_MAX;
>       dirty.listener = vfio_dirty_tracking_listener;
>       dirty.container = container;
> 
> @@ -1555,7 +1589,7 @@ vfio_device_feature_dma_logging_start_create(VFIOContainer
> *container,
>        * DMA logging uAPI guarantees to support at least a number of ranges that
>        * fits into a single host kernel base page.
>        */
> -    control->num_ranges = !!tracking->max32 + !!tracking->max64;
> +    control->num_ranges = !!tracking->max32 + !!tracking->max64 +
> !!tracking->maxpci;
>       ranges = g_try_new0(struct vfio_device_feature_dma_logging_range,
>                           control->num_ranges);
>       if (!ranges) {
> @@ -1574,11 +1608,17 @@
> vfio_device_feature_dma_logging_start_create(VFIOContainer *container,
>       if (tracking->max64) {
>           ranges->iova = tracking->min64;
>           ranges->length = (tracking->max64 - tracking->min64) + 1;
> +        ranges++;
> +    }
> +    if (tracking->maxpci) {
> +        ranges->iova = tracking->minpci;
> +        ranges->length = (tracking->maxpci - tracking->minpci) + 1;
>       }
> 
>       trace_vfio_device_dirty_tracking_start(control->num_ranges,
>                                              tracking->min32, tracking->max32,
> -                                           tracking->min64, tracking->max64);
> +                                           tracking->min64, tracking->max64,
> +                                           tracking->minpci, tracking->maxpci);
> 
>       return feature;
>   }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 444c15be47ee..ee5a44893334 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -104,7 +104,7 @@ vfio_known_safe_misalignment(const char *name, uint64_t
> iova, uint64_t offset_wi
>   vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t
> size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not
> aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>   vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64"
> - 0x%"PRIx64
>   vfio_device_dirty_tracking_update(uint64_t start, uint64_t end, uint64_t min,
> uint64_t max) "section 0x%"PRIx64" - 0x%"PRIx64" -> update [0x%"PRIx64" -
> 0x%"PRIx64"]"
> -vfio_device_dirty_tracking_start(int nr_ranges, uint64_t min32, uint64_t max32,
> uint64_t min64, uint64_t max64) "nr_ranges %d 32:[0x%"PRIx64" - 0x%"PRIx64"],
> 64:[0x%"PRIx64" - 0x%"PRIx64"]"
> +vfio_device_dirty_tracking_start(int nr_ranges, uint64_t min32, uint64_t max32,
> uint64_t min64, uint64_t max64, uint64_t minpci, uint64_t maxpci) "nr_ranges %d
> 32:[0x%"PRIx64" - 0x%"PRIx64"], 64:[0x%"PRIx64" - 0x%"PRIx64"], pci:[0x%"PRIx64"
> - 0x%"PRIx64"]"
>   vfio_disconnect_container(int fd) "close container->fd=%d"
>   vfio_put_group(int fd) "close group->fd=%d"
>   vfio_get_device(const char * name, unsigned int flags, unsigned int
> num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
> --
> 2.39.3
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ