[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff61f675-a5a8-d1d6-a1bf-a5879914b4a9@amd.com>
Date: Wed, 9 Aug 2023 08:44:13 -0700
From: Brett Creeley <bcreeley@....com>
To: Alex Williamson <alex.williamson@...hat.com>,
Brett Creeley <brett.creeley@....com>
Cc: kvm@...r.kernel.org, netdev@...r.kernel.org, jgg@...dia.com,
yishaih@...dia.com, shameerali.kolothum.thodi@...wei.com,
kevin.tian@...el.com, horms@...nel.org, shannon.nelson@....com
Subject: Re: [PATCH v14 vfio 5/8] vfio/pds: Add VFIO live migration support
On 8/8/2023 3:27 PM, Alex Williamson wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Mon, 7 Aug 2023 13:57:52 -0700
> Brett Creeley <brett.creeley@....com> wrote:
> ...
>> +static int
>> +pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> + union pds_core_adminq_cmd cmd = {
>> + .lm_suspend_status = {
>> + .opcode = PDS_LM_CMD_SUSPEND_STATUS,
>> + .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> + },
>> + };
>> + struct device *dev = pds_vfio_to_dev(pds_vfio);
>> + union pds_core_adminq_comp comp = {};
>> + unsigned long time_limit;
>> + unsigned long time_start;
>> + unsigned long time_done;
>> + int err;
>> +
>> + time_start = jiffies;
>> + time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
>> + do {
>> + err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, &comp, true);
>> + if (err != -EAGAIN)
>> + break;
>> +
>> + msleep(SUSPEND_CHECK_INTERVAL_MS);
>> + } while (time_before(jiffies, time_limit));
>> +
>> + time_done = jiffies;
>> + dev_dbg(dev, "%s: vf%u: Suspend comp received in %d msecs\n", __func__,
>> + pds_vfio->vf_id, jiffies_to_msecs(time_done - time_start));
>> +
>> + /* Check the results */
>> + if (time_after_eq(time_done, time_limit)) {
>> + dev_err(dev, "%s: vf%u: Suspend comp timeout\n", __func__,
>> + pds_vfio->vf_id);
>> + err = -ETIMEDOUT;
>
> If the command completes successfully but exceeds the time limit
> this turns a success into a failure. Is that desired? Thanks,
>
> Alex
>
Yes, this is the desired behavior. Based on the testing we have done and
downtime expectations for live migration it seems like the suspend
operation timing out after 5 seconds is reasonable behavior. It could
succeed, but that could also be the case if the timeout value was 10,
20, 30, or more seconds. Even so, it seems like we don't want to keep
spinning in the driver forever and we also don't want to further
increase the VM downtime.
Thanks,
Brett
Powered by blists - more mailing lists