linux-kernel - Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <99c7585c-47c5-9995-3fe6-c75f412b3479@linux.ibm.com>
Date:   Tue, 15 Mar 2022 13:01:27 -0400
From:   Matthew Rosato <mjrosato@...ux.ibm.com>
To:     "Tian, Kevin" <kevin.tian@...el.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Alex Williamson <alex.williamson@...hat.com>
Cc:     "linux-s390@...r.kernel.org" <linux-s390@...r.kernel.org>,
        "cohuck@...hat.com" <cohuck@...hat.com>,
        "schnelle@...ux.ibm.com" <schnelle@...ux.ibm.com>,
        "farman@...ux.ibm.com" <farman@...ux.ibm.com>,
        "pmorel@...ux.ibm.com" <pmorel@...ux.ibm.com>,
        "borntraeger@...ux.ibm.com" <borntraeger@...ux.ibm.com>,
        "hca@...ux.ibm.com" <hca@...ux.ibm.com>,
        "gor@...ux.ibm.com" <gor@...ux.ibm.com>,
        "gerald.schaefer@...ux.ibm.com" <gerald.schaefer@...ux.ibm.com>,
        "agordeev@...ux.ibm.com" <agordeev@...ux.ibm.com>,
        "svens@...ux.ibm.com" <svens@...ux.ibm.com>,
        "frankja@...ux.ibm.com" <frankja@...ux.ibm.com>,
        "david@...hat.com" <david@...hat.com>,
        "imbrenda@...ux.ibm.com" <imbrenda@...ux.ibm.com>,
        "vneethv@...ux.ibm.com" <vneethv@...ux.ibm.com>,
        "oberpar@...ux.ibm.com" <oberpar@...ux.ibm.com>,
        "freude@...ux.ibm.com" <freude@...ux.ibm.com>,
        "thuth@...hat.com" <thuth@...hat.com>,
        "pasic@...ux.ibm.com" <pasic@...ux.ibm.com>,
        "joro@...tes.org" <joro@...tes.org>,
        "will@...nel.org" <will@...nel.org>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "corbet@....net" <corbet@....net>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v4 15/32] vfio: introduce KVM-owned IOMMU type

On 3/15/22 10:17 AM, Matthew Rosato wrote:
> On 3/15/22 3:57 AM, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <jgg@...dia.com>
>>> Sent: Tuesday, March 15, 2022 7:18 AM
>>>
>>> On Mon, Mar 14, 2022 at 04:50:33PM -0600, Alex Williamson wrote:
>>>
>>>>> +/*
>>>>> + * The KVM_IOMMU type implies that the hypervisor will control the
>>> mappings
>>>>> + * rather than userspace
>>>>> + */
>>>>> +#define VFIO_KVM_IOMMU            11
>>>>
>>>> Then why is this hosted in the type1 code that exposes a wide variety
>>>> of userspace interfaces?  Thanks,
>>>
>>> It is really badly named, this is the root level of a 2 stage nested
>>> IO page table, and this approach needed a special flag to distinguish
>>> the setup from the normal iommu_domain.
>>>
>>> If we do try to stick this into VFIO it should probably use the
>>> VFIO_TYPE1_NESTING_IOMMU instead - however, we would like to delete
>>> that flag entirely as it was never fully implemented, was never used,
>>> and isn't part of what we are proposing for IOMMU nesting on ARM
>>> anyhow. (So far I've found nobody to explain what the plan here was..)
>>>
>>> This is why I said the second level should be an explicit iommu_domain
>>> all on its own that is explicitly coupled to the KVM to read the page
>>> tables, if necessary.
>>>
>>> But I'm not sure that reading the userspace io page tables with KVM is
>>> even the best thing to do - the iommu driver already has the pinned
>>> memory, it would be faster and more modular to traverse the io page
>>> tables through the pfns in the root iommu_domain than by having KVM do
>>> the translations. Lets see what Matthew says..
>>>
>>
>> Reading this thread it's sort of like an optimization to software 
>> nesting.
> 
> Yes, we want to avoid breaking to userspace for a very frequent 
> operation (RPCIT / updating shadow mappings)
> 
>> If that is the case does it make more sense to complete the basic form
>> of software nesting first and then adds this optimization?
>>
>> The basic form would allow the userspace to create a special domain
>> type which points to a user/guest page table (like hardware nesting)
>> but doesn't install the user page table to the IOMMU hardware (unlike
>> hardware nesting). When receiving invalidate cmd from userspace > the 
>> iommu driver walks the user page table (1st-level) and the parent
>> page table (2nd-level) to generate a shadow mapping for the
>> invalidated range in the non-nested hardware page table of this
>> special domain type.
>>
>> Once that works what this series does just changes the matter of
>> how the invalidate cmd is triggered. Previously iommu driver receives
>> invalidate cmd from Qemu (via iommufd uAPI) while now receiving
>> the cmd from kvm (via iommufd kAPI) upon interception of RPCIT.
>>  From this angle once the connection between iommufd and kvm fd
>> is established there is even no direct talk between iommu driver and
>> kvm.
> 
> But something somewhere still needs to be responsible for 
> pinning/unpinning of the guest table entries upon each RPCIT 
> interception.  e.g. the RPCIT intercept can happen because the guest 
> wants to invalidate some old mappings or has generated some new mappings 
> over a range, so we must shadow the new mappings (by pinning the guest 
> entries and placing them in the host hardware table / unpinning 
> invalidated ones and clearing their entry in the host hardware table).
> 

OK, this got clarified by Jason in another thread: What I was missing 
here was an assumption that the 1st-level has already mapped and pinned 
all of guest physical address space; in that case there's no need to 
invoke pin/unpin operations against a kvm from within the iommu domain 
(this series as-is does not pin all of the guest physical address space; 
it does pins/unpins on-demand at RPCIT time)