linux-kernel - Re: [PATCH] vfio/pci: Support error recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 12 Dec 2016 21:49:01 +0800
From:   Cao jin <caoj.fnst@...fujitsu.com>
To:     <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>
CC:     <izumi.taku@...fujitsu.com>, <alex.williamson@...hat.com>,
        <mst@...hat.com>
Subject: Re: [PATCH] vfio/pci: Support error recovery

Hi,
I have 2 solutions(high level design) came to me, please see if they are
acceptable, or which one is acceptable. Also have some questions.

1. block guest access during host recovery

   add new field error_recovering in struct vfio_pci_device to
   indicate host recovery status. aer driver in host will still do
   reset link

   - set error_recovering in vfio-pci driver's error_detected, used to
     block all kinds of user access(config space, mmio)
   - in order to solve concurrent issue of device resetting & user
     access, check device state[*] in vfio-pci driver's resume, see if
     device reset is done, if it is, then clear"error_recovering", or
     else new a timer, check device state periodically until device
     reset is done. (what if device reset don't end for a long time?)
   - In qemu, translate guest link reset to host link reset.
     A question here: we already have link reset in host, is a second
     link reset necessary? why?
 
   [*] how to check device state: reading certain config space
       register, check return value is valid or not(All F's)

2. skip link reset in aer driver of host kernel, for vfio-pci.
   Let user decide how to do serious recovery

   add new field "user_driver" in struct pci_dev, used to skip link
   reset for vfio-pci; add new field "link_reset" in struct
   vfio_pci_device to indicate link has been reset or not during
   recovery

   - set user_driver in vfio_pci_probe(), to skip link reset for
     vfio-pci in host.
   - (use a flag)block user access(config, mmio) during host recovery
     (not sure if this step is necessary)
   - In qemu, translate guest link reset to host link reset.
   - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
     is executed
   - In vfio-pci driver's resume, new a timer, check "link_reset" field
     periodically, if it is set in reasonable time, then clear it and
     delete timer, or else, vfio-pci driver will does the link reset!


A quick question:
I don't know how devices is divided into iommu groups, is it possible
for functions in a multi-function device to be split into different groups?

-- 
Sincerely,
Cao jin


On 11/27/2016 07:34 PM, Cao jin wrote:
> It is user space driver's or device-specific driver's(in guest) responsbility
> to do a serious recovery when error happened. Link-reset is one part of
> recovery, when pci device is assigned to VM via vfio, link-reset will do
> twice in host & guest separately, which will cause many trouble for a
> successful recovery, so, disable the vfio-pci's link-reset in aer driver
> in host, this is a keypoint for guest to do error recovery successfully.
> 
> CC: alex.williamson@...hat.com
> CC: mst@...hat.com
> Signed-off-by: Cao jin <caoj.fnst@...fujitsu.com>
> ---
> This is actually a RFC version(has debug lines left), and has minor changes in
> aer driver, so I think maybe it is better not to CC pci guys in this round.
> Later will do.
> 
>  drivers/pci/pcie/aer/aerdrv_core.c  | 12 ++++++-
>  drivers/vfio/pci/vfio_pci.c         | 63 +++++++++++++++++++++++++++++++++++--
>  drivers/vfio/pci/vfio_pci_private.h |  2 ++
>  3 files changed, 74 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 521e39c..289fb8e 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -496,7 +496,17 @@ static void do_recovery(struct pci_dev *dev, int severity)
>  			"error_detected",
>  			report_error_detected);
>  
> -	if (severity == AER_FATAL) {
> +	/* vfio-pci as a general meta driver, it actually couldn't do any real
> +	 * recovery for device. It is user space driver, or device-specific
> +	 * driver in guest who should take care of the serious error recovery,
> +	 * link reset actually is one part of whole recovery. Doing reset_link
> +	 * in aer driver of host kernel for vfio-pci devices will cause many
> +	 * trouble for user space driver or guest's device-specific driver,
> +	 * for example: the serious recovery often need to read register in
> +	 * config space, but if register reading happens during link-resetting,
> +	 * it is quite possible to return invalid value like all F's, which
> +	 * will result in unpredictable error. */
> +	if (severity == AER_FATAL && strcmp(dev->driver->name, "vfio-pci")) {
>  		result = reset_link(dev);
>  		if (result != PCI_ERS_RESULT_RECOVERED)
>  			goto failed;
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 712a849..aefd751 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -535,6 +535,15 @@ static long vfio_pci_ioctl(void *device_data,
>  	struct vfio_pci_device *vdev = device_data;
>  	unsigned long minsz;
>  
> +	if (vdev->aer_recovering && (cmd == VFIO_DEVICE_SET_IRQS ||
> +	    cmd == VFIO_DEVICE_RESET || cmd == VFIO_DEVICE_PCI_HOT_RESET)) {
> +		int ret;
> +		ret = wait_for_completion_interruptible(
> +			&vdev->aer_error_completion);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	if (cmd == VFIO_DEVICE_GET_INFO) {
>  		struct vfio_device_info info;
>  
> @@ -1117,6 +1126,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	vdev->irq_type = VFIO_PCI_NUM_IRQS;
>  	mutex_init(&vdev->igate);
>  	spin_lock_init(&vdev->irqlock);
> +	init_completion(&vdev->aer_error_completion);
>  
>  	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
>  	if (ret) {
> @@ -1176,6 +1186,9 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>  {
>  	struct vfio_pci_device *vdev;
>  	struct vfio_device *device;
> +	u32 uncor_status = 0;
> +	unsigned int aer_cap_offset = 0;
> +	int ret;
>  
>  	device = vfio_device_get_from_dev(&pdev->dev);
>  	if (device == NULL)
> @@ -1187,10 +1200,30 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>  		return PCI_ERS_RESULT_DISCONNECT;
>  	}
>  
> +	/* get device's uncorrectable error status as soon as possible,
> +	 * and signal it to user space. The later we read it, the possibility
> +	 * the register value is mangled grows. */
> +	aer_cap_offset = pci_find_ext_capability(vdev->pdev, PCI_EXT_CAP_ID_ERR);
> +	ret = pci_read_config_dword(vdev->pdev, aer_cap_offset + 
> +                                    PCI_ERR_UNCOR_STATUS, &uncor_status);
> +        if (ret)
> +                return PCI_ERS_RESULT_DISCONNECT;
> +
> +	pr_err("device %d got AER detect notification. uncorrectable error status = 0x%x\n", pdev->devfn, uncor_status);//to be removed
>  	mutex_lock(&vdev->igate);
> +    
> +	vdev->aer_recovering = true;
> +	reinit_completion(&vdev->aer_error_completion);
> +
> +	/* suspend config space access from user space,
> +	 * when vfio-pci's error recovery process is on */
> +	pci_cfg_access_trylock(vdev->pdev);
>  
> -	if (vdev->err_trigger)
> -		eventfd_signal(vdev->err_trigger, 1);
> +	if (vdev->err_trigger && uncor_status) {
> +		pr_err("device %d signal uncor status to user space", pdev->devfn);//may be removed
> +		/* signal uncorrectable error status to user space */
> +		eventfd_signal(vdev->err_trigger, uncor_status);
> +        }
>  
>  	mutex_unlock(&vdev->igate);
>  
> @@ -1199,8 +1232,34 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>  	return PCI_ERS_RESULT_CAN_RECOVER;
>  }
>  
> +static void vfio_pci_aer_resume(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_device *vdev;
> +	struct vfio_device *device;
> +
> +	device = vfio_device_get_from_dev(&pdev->dev);
> +	if (device == NULL)
> +		return;
> +
> +	vdev = vfio_device_data(device);
> +	if (vdev == NULL) {
> +		vfio_device_put(device);
> +		return;
> +	}
> +
> +	/* vfio-pci's error recovery is done, time to
> +	 * resume pci config space's accesses */
> +	pci_cfg_access_unlock(vdev->pdev);
> +
> +	vdev->aer_recovering = false;
> +	complete_all(&vdev->aer_error_completion);
> +
> +	vfio_device_put(device);
> +}
> +
>  static const struct pci_error_handlers vfio_err_handlers = {
>  	.error_detected = vfio_pci_aer_err_detected,
> +	.resume         = vfio_pci_aer_resume,
>  };
>  
>  static struct pci_driver vfio_pci_driver = {
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a7d546..ebf1041 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -83,6 +83,8 @@ struct vfio_pci_device {
>  	bool			bardirty;
>  	bool			has_vga;
>  	bool			needs_reset;
> +	bool			aer_recovering;
> +	struct completion	aer_error_completion;
>  	struct pci_saved_state	*pci_saved_state;
>  	int			refcnt;
>  	struct eventfd_ctx	*err_trigger;
>