linux-kernel - Re: [PATCH 06/10] swiotlb: use swiotlb_map_page in swiotlb_map_sg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ad8ed3ba-12e8-3031-7c66-035b6d9ad6cd@arm.com>
Date:   Mon, 19 Nov 2018 19:36:44 +0000
From:   Robin Murphy <robin.murphy@....com>
To:     Christoph Hellwig <hch@....de>,
        John Stultz <john.stultz@...aro.org>
Cc:     konrad.wilk@...cle.com, Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will.deacon@....com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        iommu@...ts.linux-foundation.org,
        Valentin Schneider <valentin.schneider@....com>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH 06/10] swiotlb: use swiotlb_map_page in
 swiotlb_map_sg_attrs

On 09/11/2018 16:37, Robin Murphy wrote:
> On 09/11/2018 07:49, Christoph Hellwig wrote:
>> On Tue, Nov 06, 2018 at 05:27:14PM -0800, John Stultz wrote:
>>> But at that point if I just re-apply "swiotlb: use swiotlb_map_page in
>>> swiotlb_map_sg_attrs", I reproduce the hangs.
>>>
>>> Any suggestions for how to further debug what might be going wrong
>>> would be appreciated!
>>
>> Very odd.  In the end map_sg and map_page are defined to do the same
>> things to start with.  The only real issue we had in this area was:
>>
>> "[PATCH v2] of/device: Really only set bus DMA mask when appropriate"
>>
>> so with current mainline + that you still see a problem, and if you
>> rever the commit we are replying to it still goes away?
> 
> OK, after quite a bit of trying I have managed to provoke a 
> similar-looking problem with straight 4.20-rc1 on my Juno board - so far 
> my "reproducer" is to decompress a ~10GB .tar.xz off an external USB 
> hard disk, wherein after somewhere between 5 minutes and half an hour or 
> so it tends to falls over with xz choking on corrupt data and/or a USB 
> error.
> 
>  From the presentation, this really smells like there's some corner in 
> which we're either missing cache maintenance or doing it to the wrong 
> address - I've not seen any issues with Juno's main PCIe-attached I/O, 
> but the EHCI here is non-coherent (and 32-bit, so the bus_dma_mask thing 
> doesn't matter) as are the HiKey UFS and SD controller.
> 
> I'll keep digging...

OK, having brought my Hikey to life and reproduced John's stall with 
rc1, what's going on is that at some point dma_map_sg() returns 0, which 
causes the SCSI/UFS layer to go round in circles repeatedly trying to 
map the same list(s) equally unsuccessfully.

Why does dma_map_sg() fail? Turns out what we all managed to overlook is 
that this patch *does* introduce a subtle change in behaviour, in that 
previously the non-bounced case assigned dev_addr to sg->dma_address 
without looking at it; now with the swiotlb_map_page() call we check the 
return value against DIRECT_MAPPING_ERROR regardless of whether it was 
bounced or not.

Flash back to the other thread when I said "...but I suspect there may 
well be non-IOMMU platforms where DMA to physical address 0 is a thing 
:("? I have the 3GB Hikey where all the RAM is below 32 bits so SWIOTLB 
never ever bounces, but sure enough, guess where that RAM starts...

So in fact it looks like patch #4 technically introduces the first 
instance of this problem, we're just getting lucky not to hit it with a 
map_page/map_single case such that direct_mapping_error() would wrongly 
report failure for page 0. The bad news (for me) is that that can't have 
anything to do with my apparent memory corruption thing above, so now I 
still need to figure out what the hell is going on there.

Robin.