linux-kernel - Re: [PATCH] vfio/pci: Support error recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 1 Dec 2016 21:38:31 +0800
From:   Cao jin <caoj.fnst@...fujitsu.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
CC:     <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <izumi.taku@...fujitsu.com>, <alex.williamson@...hat.com>
Subject: Re: [PATCH] vfio/pci: Support error recovery



On 11/30/2016 09:46 AM, Michael S. Tsirkin wrote:
> On Mon, Nov 28, 2016 at 05:32:15PM +0800, Cao jin wrote:
>>
>>

>>>
>>>> +	if (severity == AER_FATAL && strcmp(dev->driver->name, "vfio-pci")) {
>>>
>>> You really want some flag in the device, or something similar.
>>> Also, how do we know driver is not going away at this point?
>>>
>>
>> I didn't think of this condition, and I don't quite follow how would driver
>> go away?(device has error happened, then is removed?)
> 
> Yes - hotplug request detected. Does something prevent this?
> 

I wasn't realized there would be possible to have hotplug happened
during error recovery. But, it seems is the aer driver's thing to
consider it?

>>>>   		result = reset_link(dev);
>>>>   		if (result != PCI_ERS_RESULT_RECOVERED)
>>>>   			goto failed;
>>
>>>> @@ -1187,10 +1200,30 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>>>>   		return PCI_ERS_RESULT_DISCONNECT;
>>>>   	}
>>>>
>>>> +	/* get device's uncorrectable error status as soon as possible,
>>>> +	 * and signal it to user space. The later we read it, the possibility
>>>> +	 * the register value is mangled grows. */
>>>> +	aer_cap_offset = pci_find_ext_capability(vdev->pdev, PCI_EXT_CAP_ID_ERR);
>>>> +	ret = pci_read_config_dword(vdev->pdev, aer_cap_offset +
>>>> +                                    PCI_ERR_UNCOR_STATUS, &uncor_status);
>>>> +        if (ret)
>>>> +                return PCI_ERS_RESULT_DISCONNECT;
>>>> +
>>>> +	pr_err("device %d got AER detect notification. uncorrectable error status = 0x%x\n", pdev->devfn, uncor_status);//to be removed
>>>>   	mutex_lock(&vdev->igate);
>>>> +
>>>> +	vdev->aer_recovering = true;
>>>> +	reinit_completion(&vdev->aer_error_completion);
>>>> +
>>>> +	/* suspend config space access from user space,
>>>> +	 * when vfio-pci's error recovery process is on */
>>>
>>> what about access to memory etc? Do you need to suspend this as well?
>>>
>>
>> Yes, this question came into my mind a little bit, but I didn't see some
>> existing APIs like pci_cfg_access_xxx which can help to do this.(I am still
>> not familiar with kernel)
> 
> This isn't easy to do at all.
> 

I guess we can use completion in vfio_pci_rw to block all kinds of
access during vfio's error recovery. But from another perspective, now
that we have disabled reset_link in host, I think the access to device
could be performed normally. To me, the benefit of using pci_cfg_access
is: we won't bother to do "wait 3s to do guest's link reset"

> 
>>>> +	pci_cfg_access_trylock(vdev->pdev);
>>>
>>> If you trylock, you need to handle failure.
>>
>> try lock returns 0 if access is already locked, 1 otherwise. Is it necessary
>> to check its return value?
> 
> Locked by whom? You blissfully access as if it's locked by you.
> 
>>
>> -- 
>> Sincerely,
>> Cao jin
>>
> 
> 
> .
> 

-- 
Sincerely,
Cao jin