linux-kernel - Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot fills

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d36fc7ec-cefa-0805-8036-3aea1c44fba2@huawei.com>
Date:   Tue, 3 Nov 2020 17:56:42 +0000
From:   John Garry <john.garry@...wei.com>
To:     Robin Murphy <robin.murphy@....com>,
        "joro@...tes.org" <joro@...tes.org>
CC:     "xiyou.wangcong@...il.com" <xiyou.wangcong@...il.com>,
        Linuxarm <linuxarm@...wei.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        "chenxiang (M)" <chenxiang66@...ilicon.com>,
        "Leizhen (ThunderTown)" <thunder.leizhen@...wei.com>
Subject: Re: [PATCH v2 3/4] iommu/iova: Flush CPU rcache for when a depot
 fills

>> To summarize, the issue is that as time goes by, the CPU rcache and depot
>> rcache continue to grow. As such, IOVA RB tree access time also continues
>> to grow.
> 

Hi Robin,

> I'm struggling to see how this is not simply indicative of a leak
> originating elsewhere. 

It sounds like one, but I don't think it is.

> For the number of magazines to continually grow,
> it means IOVAs *of a particular size* are being freed faster than they
> are being allocated, while the only place that ongoing allocations
> should be coming from is those same magazines!

But that is not the nature of how the IOVA caching works. The cache size 
is not defined by how DMA mappings we may have at a given moment in time 
or maximum which we did have at a point earlier. It just grows to a 
limit to where all CPU and global depot rcaches fill.

Here's an artificial example of how the rcache can grow, but I hope can 
help illustrate:
- consider a process which wants many DMA mapping active at a given 
point in time
- if we tie to cpu0, cpu0 rcache will grow to 128 * 2
- then tie to cpu1, cpu1 rcache will grow to 128 * 2, so total CPU 
rcache = 2 * 128 * 2. CPU rcache for cpu0 is not flushed - there is no 
maintenance for this.
- then tie to cpu2, cpu2 rcache will grow to 128 * 2, so total CPU 
rcache = 3 * 128 * 2
- then cpu3, cpu4, and so on.
- We can do this for all CPUs in the system, so total CPU rcache grows 
from zero -> #CPUs * 128 * 2. Yet no DMA mapping leaks.

Something similar can happen in normal use, where the scheduler 
relocates processes all over the CPUs in the system as time goes by, 
which causes the total rcache size to continue to grow. And in addition 
to this, the global depot continues to grow very slowly as well. But 
when it does fill (the global depot, that is), and we start to free 
magazines to make space – as is current policy - that's very slow and 
causes the performance drop.

> 
> Now indeed that could happen over the short term if IOVAs are allocated
> and freed again in giant batches larger than the total global cache
> capacity, but that would show a cyclic behaviour - when activity starts,
> everything is first allocated straight from the tree, then when it ends
> the caches would get overwhelmed by the large burst of freeing and start
> having to release things back to the tree, but eventually that would
> stop once everything *is* freed, then when activity begins again the
> next round of allocating would inherently clear out all the caches
> before going anywhere near the tree. 

But there is no clearing. A CPU will keep the IOVA cached indefinitely, 
even when there is no active DMA mapping present at all.

> To me the "steady decline"
> behaviour suggests that someone somewhere is making DMA unmap calls with
> a smaller size than they were mapped with (you tend to notice it quicker
> the other way round due to all the device errors and random memory
> corruption) - in many cases that would appear to work out fine from the
> driver's point of view, but would provoke exactly this behaviour in the
> IOVA allocator.
> 

Thanks,
John