[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211025082857.4baa4794.alex.williamson@redhat.com>
Date: Mon, 25 Oct 2021 08:28:57 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Cornelia Huck <cohuck@...hat.com>,
Yishai Hadas <yishaih@...dia.com>, bhelgaas@...gle.com,
saeedm@...dia.com, linux-pci@...r.kernel.org, kvm@...r.kernel.org,
netdev@...r.kernel.org, kuba@...nel.org, leonro@...dia.com,
kwankhede@...dia.com, mgurtovoy@...dia.com, maorg@...dia.com,
"Dr. David Alan Gilbert" <dgilbert@...hat.com>
Subject: Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver
for mlx5 devices
On Mon, 25 Oct 2021 09:29:38 -0300
Jason Gunthorpe <jgg@...dia.com> wrote:
> On Thu, Oct 21, 2021 at 03:47:29PM -0600, Alex Williamson wrote:
> > I recall that we previously suggested a very strict interpretation of
> > clearing the _RUNNING bit, but again I'm questioning if that's a real
> > requirement or simply a nice-to-have feature for some undefined
> > debugging capability. In raising the p2p DMA issue, we can see that a
> > hard stop independent of other devices is not really practical but I
> > also don't see that introducing a new state bit solves this problem any
> > more elegantly than proposed here. Thanks,
>
> I still disagree with this - the level of 'frozenness' of a device is
> something that belongs in the defined state exposed to userspace, not
> as a hidden internal state that userspace can't see.
>
> It makes the state transitions asymmetric between suspend/resume as
> resume does have a defined uAPI state for each level of frozeness and
> suspend does not.
>
> With the extra bit resume does:
>
> 0000, 0100, 1000, 0001
>
> And suspend does:
>
> 0001, 1001, 0010, 0000
>
> However, without the extra bit suspend is only
>
> 001, 010, 000
>
> With hidden state inside the 010
And what is the device supposed to do if it receives a DMA while in
this strictly defined stopped state? If it generates an unsupported
request, that can trigger a fatal platform error. If it silently drops
the DMA, then we have data loss. We're defining a catch-22 scenario
for drivers versus placing the onus on the user to quiesce the set of
devices in order to consider the migration status as valid. Thanks,
Alex
Powered by blists - more mailing lists