netdev - Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f1d31c66-4914-b5a4-4092-5e7a3f74ee76@amd.com>
Date: Fri, 16 Jun 2023 21:45:17 -0700
From: Brett Creeley <bcreeley@....com>
To: "Tian, Kevin" <kevin.tian@...el.com>,
 Brett Creeley <brett.creeley@....com>,
 "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
 "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
 "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
 "jgg@...dia.com" <jgg@...dia.com>, "yishaih@...dia.com"
 <yishaih@...dia.com>,
 "shameerali.kolothum.thodi@...wei.com" <shameerali.kolothum.thodi@...wei.com>
Cc: "shannon.nelson@....com" <shannon.nelson@....com>
Subject: Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support

On 6/16/2023 1:06 AM, Tian, Kevin wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
>> From: Brett Creeley <brett.creeley@....com>
>> Sent: Saturday, June 3, 2023 6:03 AM
>>
>> Add live migration support via the VFIO subsystem. The migration
>> implementation aligns with the definition from uapi/vfio.h and uses
>> the pds_core PF's adminq for device configuration.
>>
>> The ability to suspend, resume, and transfer VF device state data is
>> included along with the required admin queue command structures and
>> implementations.
>>
>> PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to
>> support
>> the VF device suspend operation.
>>
>> PDS_LM_CMD_RESUME is added to support the VF device resume operation.
>>
>> PDS_LM_CMD_STATUS is added to determine the exact size of the VF
>> device state data.
>>
>> PDS_LM_CMD_SAVE is added to get the VF device state data.
>>
>> PDS_LM_CMD_RESTORE is added to restore the VF device with the
>> previously saved data from PDS_LM_CMD_SAVE.
>>
>> PDS_LM_CMD_HOST_VF_STATUS is added to notify the device when
>> a migration is in/not-in progress from the host's perspective.
> 
> Here is 'the device' referring to the PF or VF?

Device is referring to the DSC/firmware not the function. I will clarify 
the wording here. Thanks.

> 
> and how would the device use this information?
> 
>> +
>> +static int pds_vfio_client_adminq_cmd(struct pds_vfio_pci_device *pds_vfio,
>> +                                   union pds_core_adminq_cmd *req,
>> +                                   size_t req_len,
>> +                                   union pds_core_adminq_comp *resp,
>> +                                   u64 flags)
>> +{
>> +     union pds_core_adminq_cmd cmd = {};
>> +     size_t cp_len;
>> +     int err;
>> +
>> +     /* Wrap the client request */
>> +     cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
>> +     cmd.client_request.client_id = cpu_to_le16(pds_vfio->client_id);
>> +     cp_len = min_t(size_t, req_len,
>> sizeof(cmd.client_request.client_cmd));
> 
> 'req_len' is kind of redundant. Looks all the callers use sizeof(req).

It does a memcpy based on the min size between req_len and the size of 
the request.

> 
>> +static int
>> +pds_vfio_suspend_wait_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_suspend_status = {
>> +                     .opcode = PDS_LM_CMD_SUSPEND_STATUS,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     unsigned long time_limit;
>> +     unsigned long time_start;
>> +     unsigned long time_done;
>> +     int err;
>> +
>> +     time_start = jiffies;
>> +     time_limit = time_start + HZ * SUSPEND_TIMEOUT_S;
>> +     do {
>> +             err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd,
>> sizeof(cmd),
>> +                                              &comp,
>> PDS_AQ_FLAG_FASTPOLL);
>> +             if (err != -EAGAIN)
>> +                     break;
>> +
>> +             msleep(SUSPEND_CHECK_INTERVAL_MS);
>> +     } while (time_before(jiffies, time_limit));
> 
> pds_vfio_client_adminq_cmd() has the exactly same mechanism
> with 5s timeout and 1ms poll interval when FASTPOLL is set.
> 
> probably you can introduce another flag to indicate retry on
> -EAGAIN and then handle it fully in pds_vfio_client_adminq_cmd()?

That's the entire purpose of this command. It uses 
pds_vfio_client_adminq_cmd() to poll the SUSPEND_STATUS. IMHO adding the 
polling mechanism in pds_vfio_client_adminq_cmd() and using it depending 
on a flag is just adding to the complexity and not offering any benefit. 
I plan to keep this function as is to separate the functionality. Thanks.

> 
>> +int pds_vfio_suspend_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_suspend = {
>> +                     .opcode = PDS_LM_CMD_SUSPEND,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +     int err;
>> +
>> +     dev_dbg(dev, "vf%u: Suspend device\n", pds_vfio->vf_id);
>> +
>> +     err = pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
>> &comp,
>> +                                      PDS_AQ_FLAG_FASTPOLL);
>> +     if (err) {
>> +             dev_err(dev, "vf%u: Suspend failed: %pe\n", pds_vfio->vf_id,
>> +                     ERR_PTR(err));
>> +             return err;
>> +     }
>> +
>> +     return pds_vfio_suspend_wait_device_cmd(pds_vfio);
>> +}
> 
> The logic in this function is very confusing.
> 
> PDS_LM_CMD_SUSPEND has a completion record:
> 
> +struct pds_lm_suspend_comp {
> +       u8     status;
> +       u8     rsvd;
> +       __le16 comp_index;
> +       union {
> +               __le64 state_size;
> +               u8     rsvd2[11];
> +       } __packed;
> +       u8     color;
> 
> Presumably this function can look at the completion record to know whether
> the suspend request succeeds.
> 
> Why do you require another wait_device step to query the suspend status?

The driver sends the initial suspend request to tell the DSC/firmware to 
suspend the VF's data/control path. The DSC/firmware will ack/nack the 
suspend request in the completion.

Then the driver polls the DSC/firmware to find when the VF's 
data/control path has been fully suspended. When the DSC/firmware isn't 
done suspending yet it will return -EAGAIN. Otherwise it will return 
success/failure.

I will add some comments clarifying these details.

> 
> and I have another question. Is it correct to hard-code the 5s timeout in
> the kernel w/o any input from the VMM? Note the guest has been stopped
> at this point then very likely the 5s timeout will kill any reasonable SLA which
> CSPs try to reach hard.

This gives the device a max of 5 seconds to suspend the VF's 
data/control path.

> 
> Ideally the VMM has an estimation how long a VM can be paused based on
> SLA, to-be-migrated state size, available network bandwidth, etc. and that
> hint should be passed to the kernel so any state transition which may violate
> that expectation can fail quickly to break the migration process and put the
> VM back to the running state.

For QEMU there is a parameter that can specify the downtime-limit that's 
used as you mentioned, but this does not include how long it takes the 
device to STOP (i.e. suspend the data/control path).

> 
> Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
> 
>> +
>> +int pds_vfio_resume_device_cmd(struct pds_vfio_pci_device *pds_vfio)
>> +{
>> +     union pds_core_adminq_cmd cmd = {
>> +             .lm_resume = {
>> +                     .opcode = PDS_LM_CMD_RESUME,
>> +                     .vf_id = cpu_to_le16(pds_vfio->vf_id),
>> +             },
>> +     };
>> +     struct device *dev = pds_vfio_to_dev(pds_vfio);
>> +     union pds_core_adminq_comp comp = {};
>> +
>> +     dev_dbg(dev, "vf%u: Resume device\n", pds_vfio->vf_id);
>> +
>> +     return pds_vfio_client_adminq_cmd(pds_vfio, &cmd, sizeof(cmd),
>> &comp,
>> +                                       0);
> 
> 'resume' is also in the blackout phase when the guest is not running.
> 
> So presumably FAST_POLL should be set otherwise the max 256ms
> poll interval (PDSC_ADMINQ_MAX_POLL_INTERVAL) is really inefficient.

Yeah this is a good catch. I think setting fast_poll = true would be a 
good idea here. Thanks.

> 
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING && next ==
>> VFIO_DEVICE_STATE_RUNNING_P2P) {
>> +             pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
>> +
>> PDS_LM_STA_IN_PROGRESS);
>> +             err = pds_vfio_suspend_device_cmd(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
>> VFIO_DEVICE_STATE_RUNNING) {
>> +             err = pds_vfio_resume_device_cmd(pds_vfio);
>> +             if (err)
>> +                     return ERR_PTR(err);
>> +
>> +             pds_vfio_send_host_vf_lm_status_cmd(pds_vfio,
>> PDS_LM_STA_NONE);
>> +             return NULL;
>> +     }
>> +
>> +     if (cur == VFIO_DEVICE_STATE_STOP && next ==
>> VFIO_DEVICE_STATE_RUNNING_P2P)
>> +             return NULL;
>> +
>> +     if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
>> VFIO_DEVICE_STATE_STOP)
>> +             return NULL;
> 
> I'm not sure whether P2P is actually supported here. By definition
> P2P means the device is stopped but still responds to p2p request
> from other devices. If you look at mlx example it uses different
> cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.
> 
> But in your case seems you simply move what is required in STOP
> into P2P. Probably you can just remove the support of P2P like
> hisilicon does.

In a previous review it was mentioned that P2P is more or less supported 
and this is how we are able to support it. Ideally we would not set the 
P2P feature and just implement the standard STOP/RUNNING states.

> 
>> +
>> +/**
>> + * struct pds_lm_comp - generic command completion
>> + * @status:  Status of the command (enum pds_core_status_code)
>> + * @rsvd:    Structure padding to 16 Bytes
>> + */
>> +struct pds_lm_comp {
>> +     u8 status;
>> +     u8 rsvd[15];
>> +};
> 
> not used. Looks most comp structures are defined w/o an user
> except struct pds_lm_status_comp.

I will look into this. Thanks.