linux-kernel - Re: [PATCH v2] buffer: Fix I/O error due to ARM read-after-read hazard

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191112160855.GA22025@arrakis.emea.arm.com>
Date:   Tue, 12 Nov 2019 16:08:57 +0000
From:   Catalin Marinas <catalin.marinas@....com>
To:     Vincent Whitchurch <vincent.whitchurch@...s.com>
Cc:     torvalds@...ux-foundation.org, axboe@...nel.dk,
        linux@...linux.org.uk, linux-kernel@...r.kernel.org,
        linux-arm-kernel@...ts.infradead.org, will@...nel.org,
        Vincent Whitchurch <rabinv@...s.com>,
        Richard Earnshaw <Richard.Earnshaw@....com>
Subject: Re: [PATCH v2] buffer: Fix I/O error due to ARM read-after-read
 hazard

On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote:
> On my dual-core ARM Cortex-A9, reading from squashfs (over
> dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in
> a read failure in squashfs_read_data().  The errors occur because the
> buffer_uptodate() check fails after wait_on_buffer().  Further debugging
> shows that the bh was in fact uptodate and that there is no actual I/O
> error in the lower layers.
> 
> The problem is caused by the read-after-read hazards in the ARM
> Cortex-A9 MPCore (erratum #761319, see [1]).  The code generated by the
> compiler for the combination of the wait_on_buffer() and
> buffer_uptodate() calls reads the flags value twice from memory (see the
> excerpt of the assembly below).  The new value of the BH_Lock flag is
> seen but the new value of BH_Uptodate is not even though both the bits
> are read from the same memory location.
> 
>  27c:	9d08      	ldr	r5, [sp, #32]
>  27e:	2400      	movs	r4, #0
>  280:	e006      	b.n	290 <squashfs_read_data+0x290>
>  282:	6803      	ldr	r3, [r0, #0]
>  284:	07da      	lsls	r2, r3, #31
>  286:	f140 810d 	bpl.w	4a4 <squashfs_read_data+0x4a4>
>  28a:	3401      	adds	r4, #1
>  28c:	42bc      	cmp	r4, r7
>  28e:	da08      	bge.n	2a2 <squashfs_read_data+0x2a2>
>  290:	f855 0f04 	ldr.w	r0, [r5, #4]!
>  294:	6803      	ldr	r3, [r0, #0]
>  296:	0759      	lsls	r1, r3, #29
>  298:	d5f3      	bpl.n	282 <squashfs_read_data+0x282>
>  29a:	f7ff fffe 	bl	0 <__wait_on_buffer>
> 
> Work around this problem by adding a DMB between the two reads of
> bh->flags, as recommended in the ARM document.  With this barrier, no
> failures have been seen in more than 5000 hours of the same test.
> 
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf

I thought we were going to fix the compiler. I found an old thread here:

https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html

Also cc'ing Richard Earnshaw as he may been involved in the gcc
discussion at the time.

While you can add some barrier here, there may be other cases where this
can go wrong.

-- 
Catalin