linux-kernel - [PATCH vhost v2 00/10] vdpa/mlx5: Parallelize device suspend/resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240816090159.1967650-1-dtatulea@nvidia.com>
Date: Fri, 16 Aug 2024 12:01:49 +0300
From: Dragos Tatulea <dtatulea@...dia.com>
To: "Michael S . Tsirkin" <mst@...hat.com>, Jason Wang <jasowang@...hat.com>,
	Eugenio Perez Martin <eperezma@...hat.com>,
	<virtualization@...ts.linux-foundation.org>
CC: Dragos Tatulea <dtatulea@...dia.com>, Si-Wei Liu <si-wei.liu@...cle.com>,
	Saeed Mahameed <saeedm@...dia.com>, Leon Romanovsky <leon@...nel.org>,
	<kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Gal Pressman
	<gal@...dia.com>, Parav Pandit <parav@...dia.com>, Xuan Zhuo
	<xuanzhuo@...ux.alibaba.com>
Subject: [PATCH vhost v2 00/10] vdpa/mlx5: Parallelize device suspend/resume

This series parallelizes the mlx5_vdpa device suspend and resume
operations through the firmware async API. The purpose is to reduce live
migration downtime.

The series starts with changing the VQ suspend and resume commands
to the async API. After that, the switch is made to issue multiple
commands of the same type in parallel.

Then, the an additional improvement is added: keep the notifiers enabled
during suspend but make it a NOP. Upon resume make sure that the link
state is forwarded. This shaves around 30ms per device constant time.

Finally, use parallel VQ suspend and resume during the CVQ MQ command.

For 1 vDPA device x 32 VQs (16 VQPs), on a large VM (256 GB RAM, 32 CPUs
x 2 threads per core), the improvements are:

+-------------------+--------+--------+-----------+
| operation         | Before | After  | Reduction |
|-------------------+--------+--------+-----------|
| mlx5_vdpa_suspend | 37 ms  | 2.5 ms |     14x   |
| mlx5_vdpa_resume  | 16 ms  | 5 ms   |      3x   |
+-------------------+--------+--------+-----------+

---
v2:
- Changed to parallel VQ suspend/resume during CVQ MQ command.
  Support added in the last 2 patches.
- Made the fw async command more generic and moved it to resources.c.
  Did that because the following series (parallel mkey ops) needs this
  code as well.
  Dropped Acked-by from Eugenio on modified patches.
- Fixed kfree -> kvfree.
- Removed extra newline caught during review.
- As discussed in the v1, the series can be pulled in completely in
  the vhost tree [0]. The mlx5_core patch was reviewed by Tariq who is
  also a maintainer for mlx5_core.

[0] - https://lore.kernel.org/virtualization/6582792d-8db2-4bc0-bf3a-248fe5c8fc56@nvidia.com/T/#maefabb2fde5adfb322d16ca16ae64d540f75b7d2

Dragos Tatulea (10):
  net/mlx5: Support throttled commands from async API
  vdpa/mlx5: Introduce error logging function
  vdpa/mlx5: Introduce async fw command wrapper
  vdpa/mlx5: Use async API for vq query command
  vdpa/mlx5: Use async API for vq modify commands
  vdpa/mlx5: Parallelize device suspend
  vdpa/mlx5: Parallelize device resume
  vdpa/mlx5: Keep notifiers during suspend but ignore
  vdpa/mlx5: Small improvement for change_num_qps()
  vdpa/mlx5: Parallelize VQ suspend/resume for CVQ MQ command

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |  21 +-
 drivers/vdpa/mlx5/core/mlx5_vdpa.h            |  22 +
 drivers/vdpa/mlx5/core/resources.c            |  73 ++++
 drivers/vdpa/mlx5/net/mlx5_vnet.c             | 396 +++++++++++-------
 4 files changed, 361 insertions(+), 151 deletions(-)

-- 
2.45.1