linux-kernel - Re: [PATCH v4] iommu: Optimise PCI SAC address trick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8404e7ef-d929-bd39-a10d-f4053cc4bd3e@oracle.com>
Date:   Tue, 18 Apr 2023 19:50:06 +0100
From:   John Garry <john.g.garry@...cle.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Vasant Hegde <vasant.hegde@....com>,
        Robin Murphy <robin.murphy@....com>, joro@...tes.org,
        will@...nel.org, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH v4] iommu: Optimise PCI SAC address trick

On 18/04/2023 18:36, Linus Torvalds wrote:
>> JFYI, Since you are using NVMe, you could also alternatively try
>> something like which I did for some SCSI storage controller drivers to
>> limit the request_queue max_sectors soft limit, like:
> That patch is not only whitespace-damaged, it's randomly missing one
> '+' character

My copy and paste error

> so it makes no sense even ignoring the whitespace
> problems._and_  it has a nonsensical cast to 'unsigned int' which
> makes that 'min()' possibly do crazy and invalid things (ie imagine
> dma_opt_mapping_size() returning 4GB).
> 
> You can't cast things to the smaller size just to get rid of a
> warning, for chrissake!

Yeah, sorry, I was just trying to show a very quick demo of how this can 
actually be done.

Indeed, I could have mentioned that it would actually have been easier 
to test by feeding a lower limit into /sys/block/<dev>/queue/max_sectors_kb

> 
> In fact, even without the cast, it seems entirely broken, since the
> fallback for dma_opt_mapping_size() is to return 0 (admittedly_that_
> case only happens with HAS_DMA=n).
> 
> Finally, doing this inside the
> 
>          if (ctrl->max_hw_sectors) {

I think that this would be set for PCI NVMe controllers, which we were 
interested in here. But, indeed, I could check for a better place to set 
this.

> 
> conditional seems entirely wrong, since any dma mapping limits would
> be entirely independent of any driver maximum hw size, and in fact
> *easier*  to hit if the block device itself doesn't have any max
> limits.
> 
> So please burn that patch in the darkest pits of hell and let's try to
> forget it ever existed. Ok?

Sure

> 
> Also, shouldn't any possible dma mapping size affect not
> 'max_sectors', but 'max_segment_size'? At least the docs imply that
> dma_opt_mapping_size() is about the max size of a_single_  mapping,
> not of the whole thing?

It's meant to apply to total mapping length and not a single segment, so 
then the doc would be misleading.

> 
> Anyway, if this is actually an issue, to the point that it's now being
> discussed for a_second_  block driver subsystem, then shouldn't the
> queue handling just do this all automatically, instead of adding
> random crap to random block driver architectures?

Other storage controllers may enjoy better performance with very large 
DMA mappings (whose total length exceed the IOVA caching limit), so it 
was too risky to apply a performance-related change of this nature 
across the board when that API was introduced.

So far it had only been a single controller where we were actually 
seeing the issue of alloc'ing IOVAs giving very (very) poor performance.

However, as far as I am aware, there was nothing special about that 
controller, apart from the fact that it was often creating requests 
whose length exceeded that IOVA caching limit, and it also filling the 
32b IOVA space quickly - that may be because the system had lots of CPUs.

Since there are now reports of poor performance in other storage 
controllers and also in networking adapters, I can only assume that 
people are testing more often with IOMMU-enabled systems with lots of 
CPUs. Having said that, I would still be cautious of applying that limit 
everywhere.

> 
> And no, I don't know this code, so maybe I'm entirely missing
> something, but that patch just raised my hackles enough that I had to
> say something.

Sure.

Thanks,
John