[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dcfe1661-0779-6c50-6126-e9d0609e3911@linux.ibm.com>
Date: Thu, 20 Jan 2022 10:47:30 +0100
From: Pierre Morel <pmorel@...ux.ibm.com>
To: Matthew Rosato <mjrosato@...ux.ibm.com>, linux-s390@...r.kernel.org
Cc: alex.williamson@...hat.com, cohuck@...hat.com,
schnelle@...ux.ibm.com, farman@...ux.ibm.com,
borntraeger@...ux.ibm.com, hca@...ux.ibm.com, gor@...ux.ibm.com,
gerald.schaefer@...ux.ibm.com, agordeev@...ux.ibm.com,
frankja@...ux.ibm.com, david@...hat.com, imbrenda@...ux.ibm.com,
vneethv@...ux.ibm.com, oberpar@...ux.ibm.com, freude@...ux.ibm.com,
thuth@...hat.com, pasic@...ux.ibm.com, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 21/30] KVM: s390: pci: handle refresh of PCI
translations
On 1/19/22 21:02, Matthew Rosato wrote:
> On 1/19/22 1:25 PM, Pierre Morel wrote:
>>
>>
>> On 1/19/22 17:39, Matthew Rosato wrote:
>>> On 1/19/22 4:29 AM, Pierre Morel wrote:
>>>>
>>>>
>>>> On 1/14/22 21:31, Matthew Rosato wrote:
>>> ...
>>>>> +static int dma_table_shadow(struct kvm_vcpu *vcpu, struct zpci_dev
>>>>> *zdev,
>>>>> + dma_addr_t dma_addr, size_t size)
>>>>> +{
>>>>> + unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
>>>>> + struct kvm_zdev *kzdev = zdev->kzdev;
>>>>> + unsigned long *entry, *gentry;
>>>>> + int i, rc = 0, rc2;
>>>>> +
>>>>> + if (!nr_pages || !kzdev)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + mutex_lock(&kzdev->ioat.lock);
>>>>> + if (!zdev->dma_table || !kzdev->ioat.head[0]) {
>>>>> + rc = -EINVAL;
>>>>> + goto out_unlock;
>>>>> + }
>>>>> +
>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>> + gentry = dma_walk_guest_cpu_trans(vcpu, &kzdev->ioat,
>>>>> dma_addr);
>>>>> + if (!gentry)
>>>>> + continue;
>>>>> + entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
>>>>> +
>>>>> + if (!entry) {
>>>>> + rc = -ENOMEM;
>>>>> + goto out_unlock;
>>>>> + }
>>>>> +
>>>>> + rc2 = dma_shadow_cpu_trans(vcpu, entry, gentry);
>>>>> + if (rc2 < 0) {
>>>>> + rc = -EIO;
>>>>> + goto out_unlock;
>>>>> + }
>>>>> + dma_addr += PAGE_SIZE;
>>>>> + rc += rc2;
>>>>> + }
>>>>> +
>>>>
>>>> In case of error, shouldn't we invalidate the shadow tables entries
>>>> we did validate until the error?
>>>
>>> Hmm, I don't think this is strictly necessary - the status returned
>>> should indicate the specified DMA range is now in an indeterminate
>>> state (putting the onus on the guest to take corrective action via a
>>> global refresh).
>>>
>>> In fact I think I screwed that up below in
>>> kvm_s390_pci_refresh_trans, the fabricated status should always be
>>> KVM_S390_RPCIT_INS_RES.
>>
>> OK
>>
>>>
>>>>
>>>>> +out_unlock:
>>>>> + mutex_unlock(&kzdev->ioat.lock);
>>>>> + return rc;
>>>>> +}
>>>>> +
>>>>> +int kvm_s390_pci_refresh_trans(struct kvm_vcpu *vcpu, unsigned
>>>>> long req,
>>>>> + unsigned long start, unsigned long size,
>>>>> + u8 *status)
>>>>> +{
>>>>> + struct zpci_dev *zdev;
>>>>> + u32 fh = req >> 32;
>>>>> + int rc;
>>>>> +
>>>>> + /* Make sure this is a valid device associated with this guest */
>>>>> + zdev = get_zdev_by_fh(fh);
>>>>> + if (!zdev || !zdev->kzdev || zdev->kzdev->kvm != vcpu->kvm) {
>>>>> + *status = 0;
>>>>
>>>> Wouldn't it be interesting to add some debug information here.
>>>> When would this appear?
>>>
>>> Yes, I agree -- One of the follow-ons I'd like to add after this
>>> series is s390dbf entries; this seems like a good spot for one.
>>>
>>> As to when this could happen; it should not under normal
>>> circumstances, but consider something like arbitrary function handles
>>> coming from the intercepted guest instruction. We need to ensure
>>> that the specified function 1) exists and 2) is associated with the
>>> guest issuing the refresh.
>>>
>>>>
>>>> Also if we have this error this looks like we have a VM problem,
>>>> shouldn't we treat this in QEMU and return -EOPNOTSUPP ?
>>>>
>>>
>>> Well, I'm not sure if we can really tell where the problem is (it
>>> could for example indicate a misbehaving guest, or a bug in our KVM
>>> tracking of hostdevs).
>>>
>>> The guest chose the function handle, and if we got here then that
>>> means it doesn't indicate that it's an emulated device, which means
>>> either we are using the assist and KVM should handle the intercept or
>>> we are not and userspace should handle it. But in both of those
>>> cases, there should be a host device and it should be associated with
>>> the guest.
>>
>> That is right if we can not find an associated zdev = F(fh)
>> but the two other errors are KVM or QEMU errors AFAIU.
>
> I don't think we know for sure for any of the cases... For a
> well-behaved guest I agree with your assessment. However, the guest
> decides what fh to put into its refresh instruction and so a misbehaving
> guest could just pick arbitrary numbers for fh and circumstantially
> match some other host device. What if the guest just decided to try
> every single possible fh number in a loop with a refresh instruction?
> That's neither KVM nor QEMU's fault but can trip each of these cases.
>
> Consider the different cases:
>
> !zdev - Either the guest provided a bogus fh, KVM provided a bad fh via
> the VFIO ioctl which then QEMU fed into CLP or KVM provided the right fh
> via ioctl but QEMU clobbered it when providing it to the guest via CLP.
>
> !zdev->kzdev - Either the guest provided a bogus fh that just so
> happened to match a host fh that has no KVM association, or KVM or QEMU
> screwed up somewhere (as above or because we failed to make the KVM
> assocation somehow)
>
> kzdev->kvm != vcpu->kvm - Pretty much the same as above, but the
> matching device is actually in use by some other guest. Again it's
> possible the a misbehaving guest 'got lucky' with an arbitrary fh that
> happened to match a host fh with an existing KVM association -- or more
> likely that KVM or QEMU screwed up somewhere.
OK, I understand and you are right, my error was to consider that
get_zdev_by_fh() returns a zdev associated with a valid FH for the guest
while it returns a zdev associated with a valid FH for the host.
If the comment would have been after the get_zdev_by_fh() and before the
test I may be wouldn't have done this mistake.
>
>>
>>>
>>> I think if we decide to throw this to userspace in this event, QEMU
>>> needs some extra code to handle it (basically, if QEMU receives the
>>> intercept and the device is neither emulated nor using intercept mode
>>> then we must treat as an invalid handle as this intercept should have
>>> been handled by KVM)
>>
>> I do not want to start a discussion on this, I think we can let it
>> like this at first and come back to it when we have a good idea on how
>> to handle this.
>> May be just add a /* TODO */
>
> OK, sure. In any of the above cases, we are certainly done in KVM
> anyway. Whether there's value in passing it onto userspace vs
> immediately giving an error, let's think about it.
No, I do not think we should anymore.
Sorry for this wrong idea.
--
Pierre Morel
IBM Lab Boeblingen
Powered by blists - more mailing lists