linux-kernel - Re: [PATCH 2/2] iommu/iova: Improve restart logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <b9417479-4e45-8b31-bc32-d20739c97a10@huawei.com>
Date:   Thu, 18 Mar 2021 16:07:11 +0000
From:   John Garry <john.garry@...wei.com>
To:     Robin Murphy <robin.murphy@....com>, <joro@...tes.org>
CC:     <vjitta@...eaurora.org>, <iommu@...ts.linux-foundation.org>,
        Linuxarm <linuxarm@...wei.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] iommu/iova: Improve restart logic

> 
> Well yeah, in your particular case you're allocating from a heavily 
> over-contended address space, so much of the time it is genuinely full. 
> Plus you're primarily churning one or two sizes of IOVA, so there's a 
> high chance that you will either allocate immediately from the cached 
> node (after a previous free), or search the whole space and fail. In 
> case it was missed, searching only some arbitrary subset of the space 
> before giving up is not a good behaviour for an allocator to have in 
> general.
> 
>>> So since the retry means that we search through the complete pfn 
>>> range most of the time (due to poor success rate), we should be able 
>>> to do a better job at maintaining an accurate max alloc size, by 
>>> calculating it from the range search, and not relying on max alloc 
>>> failed or resetting it frequently. Hopefully that would mean that 
>>> we're smarter about not trying the allocation.
>>
>> So I tried that out and we seem to be able to scrap back an 
>> appreciable amount of performance. Maybe 80% of original, with with 
>> another change, below.
> 
> TBH if you really want to make allocation more efficient I think there 
> are more radical changes that would be worth experimenting with, like 
> using some form of augmented rbtree to also encode the amount of free 
> space under each branch, or representing the free space in its own 
> parallel tree, or whether some other structure entirely might be a 
> better bet these days.
> 
> And if you just want to make your thing acceptably fast, now I'm going 
> to say stick a quirk somewhere to force the "forcedac" option on your 
> platform ;)
> 

Easier said than done :)

But still, I'd like to just be able to cache all IOVA sizes for my DMA 
engine, so we should not have to go near the RB tree often.

I have put together a series to allow upper limit of rcache range be 
increased per domain. So naturally that gives better performance than we 
originally had.

I don't want to prejudice the solution by saying what I think of it now, 
so will send it out...


> [...]
>>>>> @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
>>>>> iova_domain *iovad,
>>>>>           if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
>>>>>               high_pfn = limit_pfn;
>>>>>               low_pfn = retry_pfn;
>>>>> -            curr = &iovad->anchor.node;
>>>>> +            curr = iova_find_limit(iovad, limit_pfn);
>>
>>
>> I see that it is now applied. However, alternatively could we just add 
>> a zero-length 32b boundary marker node for the 32b pfn restart point?
> 
> That would need special cases all over the place to prevent the marker 
> getting merged into reservations or hit by lookups, and at worst break 
> the ordering of the tree if a legitimate node straddles the boundary. I 
> did consider having the insert/delete routines keep track of yet another 
> cached node for whatever's currently the first thing above the 32-bit 
> boundary, but I was worried that might be a bit too invasive.

Yeah, I did think of that. I don't think that it would have too much 
overhead.

> 
> FWIW I'm currently planning to come back to this again when I have a bit 
> more time, since the optimum thing to do (modulo replacing the entire 
> algorithm...) is actually to make the second part of the search 
> *upwards* from the cached node to the limit. Furthermore, to revive my 
> arch/arm conversion I think we're realistically going to need a 
> compatibility option for bottom-up allocation to avoid too many nasty 
> surprises, so I'd like to generalise things to tackle both concerns at 
> once.
> 

Thanks,
John