lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f0c7e66d-358e-4d03-b43e-4cd0796e495d@nvidia.com>
Date: Tue, 3 Sep 2024 10:16:59 +0200
From: Dragos Tatulea <dtatulea@...dia.com>
To: Eugenio Perez Martin <eperezma@...hat.com>
Cc: Lei Yang <leiyang@...hat.com>, Jason Wang <jasowang@...hat.com>,
 Michael Tsirkin <mst@...hat.com>, Si-Wei Liu <si-wei.liu@...cle.com>,
 virtualization@...ts.linux-foundation.org, linux-kernel@...r.kernel.org,
 Gal Pressman <gal@...dia.com>, Leon Romanovsky <leon@...nel.org>,
 kvm@...r.kernel.org, Parav Pandit <parav@...dia.com>,
 Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Saeed Mahameed <saeedm@...dia.com>
Subject: Re: [PATCH vhost v2 00/10] vdpa/mlx5: Parallelize device
 suspend/resume



On 03.09.24 10:10, Eugenio Perez Martin wrote:
> On Tue, Sep 3, 2024 at 9:48 AM Dragos Tatulea <dtatulea@...dia.com> wrote:
>>
>>
>>
>> On 03.09.24 09:40, Lei Yang wrote:
>>> On Mon, Sep 2, 2024 at 7:05 PM Dragos Tatulea <dtatulea@...dia.com> wrote:
>>>>
>>>> Hi Lei,
>>>>
>>>> On 02.09.24 12:03, Lei Yang wrote:
>>>>> Hi Dragos
>>>>>
>>>>> QE tested this series with mellanox nic, it failed with [1] when
>>>>> booting guest, and host dmesg also will print messages [2]. This bug
>>>>> can be reproduced boot guest with vhost-vdpa device.
>>>>>
>>>>> [1] qemu) qemu-kvm: vhost VQ 1 ring restore failed: -1: Operation not
>>>>> permitted (1)
>>>>> qemu-kvm: vhost VQ 0 ring restore failed: -1: Operation not permitted (1)
>>>>> qemu-kvm: unable to start vhost net: 5: falling back on userspace virtio
>>>>> qemu-kvm: vhost_set_features failed: Device or resource busy (16)
>>>>> qemu-kvm: unable to start vhost net: 16: falling back on userspace virtio
>>>>>
>>>>> [2] Host dmesg:
>>>>> [ 1406.187977] mlx5_core 0000:0d:00.2:
>>>>> mlx5_vdpa_compat_reset:3267:(pid 8506): performing device reset
>>>>> [ 1406.189221] mlx5_core 0000:0d:00.2:
>>>>> mlx5_vdpa_compat_reset:3267:(pid 8506): performing device reset
>>>>> [ 1406.190354] mlx5_core 0000:0d:00.2:
>>>>> mlx5_vdpa_show_mr_leaks:573:(pid 8506) warning: mkey still alive after
>>>>> resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2
>>>>> [ 1471.538487] mlx5_core 0000:0d:00.2: cb_timeout_handler:938:(pid
>>>>> 428): cmd[13]: MODIFY_GENERAL_OBJECT(0xa01) Async, timeout. Will cause
>>>>> a leak of a command resource
>>>>> [ 1471.539486] mlx5_core 0000:0d:00.2: cb_timeout_handler:938:(pid
>>>>> 428): cmd[12]: MODIFY_GENERAL_OBJECT(0xa01) Async, timeout. Will cause
>>>>> a leak of a command resource
>>>>> [ 1471.540351] mlx5_core 0000:0d:00.2: modify_virtqueues:1617:(pid
>>>>> 8511) error: modify vq 0 failed, state: 0 -> 0, err: 0
>>>>> [ 1471.541433] mlx5_core 0000:0d:00.2: modify_virtqueues:1617:(pid
>>>>> 8511) error: modify vq 1 failed, state: 0 -> 0, err: -110
>>>>> [ 1471.542388] mlx5_core 0000:0d:00.2: mlx5_vdpa_set_status:3203:(pid
>>>>> 8511) warning: failed to resume VQs
>>>>> [ 1471.549778] mlx5_core 0000:0d:00.2:
>>>>> mlx5_vdpa_show_mr_leaks:573:(pid 8511) warning: mkey still alive after
>>>>> resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2
>>>>> [ 1512.929854] mlx5_core 0000:0d:00.2:
>>>>> mlx5_vdpa_compat_reset:3267:(pid 8565): performing device reset
>>>>> [ 1513.100290] mlx5_core 0000:0d:00.2:
>>>>> mlx5_vdpa_show_mr_leaks:573:(pid 8565) warning: mkey still alive after
>>>>> resource delete: mr: 000000000c5ccca2, mkey: 0x40000000, refcount: 2
>>>>>
>>>
>>> Hi Dragos
>>>
>>>> Can you provide more details about the qemu version and the vdpa device
>>>> options used?
>>>>
>>>> Also, which FW version are you using? There is a relevant bug in FW
>>>> 22.41.1000 which was fixed in the latest FW (22.42.1000). Did you
>>>> encounter any FW syndromes in the host dmesg log?
>>>
>>> This problem has gone when I updated the firmware version to
>>> 22.42.1000, and I tested it with regression tests using mellanox nic,
>>> everything works well.
>>>
>>> Tested-by: Lei Yang <leiyang@...hat.com>
>> Good to hear. Thanks for the quick reaction.
>>
> 
> Is it possible to add a check so it doesn't use the async fashion in old FW?
> 
Unfortunately not, it would have been there otherwise.

Note that this affects only FW version 22.41.1000. Older versions are not
affected because VQ resume is not supported.

Thanks,
Dragos

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ