[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211020114514.560ce2fa.alex.williamson@redhat.com>
Date: Wed, 20 Oct 2021 11:45:14 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Yishai Hadas <yishaih@...dia.com>, bhelgaas@...gle.com,
saeedm@...dia.com, linux-pci@...r.kernel.org, kvm@...r.kernel.org,
netdev@...r.kernel.org, kuba@...nel.org, leonro@...dia.com,
kwankhede@...dia.com, mgurtovoy@...dia.com, maorg@...dia.com
Subject: Re: [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI
reset_done error handler
On Wed, 20 Oct 2021 13:46:29 -0300
Jason Gunthorpe <jgg@...dia.com> wrote:
> On Wed, Oct 20, 2021 at 11:46:07AM +0300, Yishai Hadas wrote:
>
> > What is the expectation for a reasonable delay ? we may expect this system
> > WQ to run only short tasks and be very responsive.
>
> If the expectation is that qemu will see the error return and the turn
> around and issue FLR followed by another state operation then it does
> seem strange that there would be a delay.
>
> On the other hand, this doesn't seem that useful. If qemu tries to
> migrate and the device fails then the migration operation is toast and
> possibly the device is wrecked. It can't really issue a FLR without
> coordinating with the VM, and it cannot resume the VM as the device is
> now irrecoverably messed up.
>
> If we look at this from a RAS perspective would would be useful here
> is a way for qemu to request a fail safe migration data. This must
> always be available and cannot fail.
>
> When the failsafe is loaded into the device it would trigger the
> device's built-in RAS features to co-ordinate with the VM driver and
> recover. Perhaps qemu would also have to inject an AER or something.
>
> Basically instead of the device starting in an "empty ready to use
> state" it would start in a "failure detected, needs recovery" state.
The "fail-safe recovery state" is essentially the reset state of the
device. If a device enters an error state during migration, I would
think the ultimate recovery procedure would be to abort the migration,
send an AER to the VM, whereby the guest would trigger a reset, and
the RAS capabilities of the guest would handle failing over to a
multipath device, ejecting the failing device, etc.
However, regardless of the migration recovery strategy, userspace needs
a means to get the device back into an initial state in a deterministic
way without closing and re-opening the device (or polling for an
arbitrary length of time). That's the minimum viable product here.
Thanks,
Alex
Powered by blists - more mailing lists