linux-kernel - Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b72234f9-bc65-2032-7d1a-91028e12ea74@linux.ibm.com>
Date:   Tue, 29 Nov 2022 08:51:55 -0500
From:   Matthew Rosato <mjrosato@...ux.ibm.com>
To:     Niklas Schnelle <schnelle@...ux.ibm.com>,
        Robin Murphy <robin.murphy@....com>,
        Gerd Bayer <gbayer@...ux.ibm.com>, iommu@...ts.linux.dev,
        Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
        Jason Gunthorpe <jgg@...dia.com>,
        Wenjia Zhang <wenjia@...ux.ibm.com>
Cc:     Pierre Morel <pmorel@...ux.ibm.com>, linux-s390@...r.kernel.org,
        borntraeger@...ux.ibm.com, hca@...ux.ibm.com, gor@...ux.ibm.com,
        gerald.schaefer@...ux.ibm.com, agordeev@...ux.ibm.com,
        svens@...ux.ibm.com, linux-kernel@...r.kernel.org,
        Julian Ruess <julianr@...ux.ibm.com>
Subject: Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of
 resource indication

On 11/29/22 7:00 AM, Niklas Schnelle wrote:
> On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
>> On 2022-11-16 17:16, Niklas Schnelle wrote:
>>> When RPCIT indicates that the underlying hypervisor has run out of
>>> resources it often means that its IOVA space is exhausted and IOVAs need
>>> to be freed before new ones can be created. By triggering a flush of the
>>> IOVA queue we can get the queued IOVAs freed and also get the new
>>> mapping established during the global flush.
>>
>> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is 
>> exhausted and fail the DMA API call before even getting as far as 
>> iommu_map(), though? Or is there some less obvious limitation like a 
>> maximum total number of distinct IOVA regions regardless of size?
> 
> Well, yes and no. Your thinking is of course correct if the advertised
> available IOVA space can be fully utilized without exhausting
> hypervisor resources we won't trigger this case. However sadly there
> are complications. The most obvious being that in QEMU/KVM the
> restriction of the IOVA space to what QEMU can actually have mapped at
> once was just added recently[0] prior to that we would regularly go
> through this "I'm out of resources free me some IOVAs" dance with our
> existing DMA API implementation where this just triggers an early cycle
> of freeing all unused IOVAs followed by a global flush. On z/VM I know
> of no situations where this is triggered. That said this signalling is

While the QEMU case made for an easily-reproducible scenario, the indication was really provided to handle a scenario where you have multiple pageable guests whose sum of total memory is overcommitted as compared to the hypervisor's resources.  The intent is for the entire advertised guest aperture to be usable and generally-speaking it is, but it's possible to find yourself in a (poorly tuned) scenario where the hypervisor is unable to pin additional pages (basically an OOM condition) -- the hypervisor (qemu/kvm or z/VM) can use this as a cry for help to say "stop what you're doing and flush your queues immediately so I can unpin as much as possible", and then after that the guest(s) can continue using their aperture.  

This is unnecessary for the no-paging bare metal hypervisor because there you aren't over-committing memory.