lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 2 Mar 2022 19:02:44 -0800
From:   Saravana Kannan <saravanak@...gle.com>
To:     Peter Rosin <peda@...ntia.se>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        Nicolas Ferre <Nicolas.Ferre@...rochip.com>,
        Alexandre Belloni <alexandre.belloni@...tlin.com>,
        Ludovic Desroches <Ludovic.Desroches@...rochip.com>,
        Daniels Umanovskis <du@...ntia.se>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@...ntia.se> wrote:
>
> Hi!
>
> I'm seeing a weird problem, and I'd like some help with further
> things to try in order to track down what's going on. I have
> bisected the issue to
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")

I skimmed through your email and I'll read it more closely tomorrow,
but it wasn't clear if you see this on Linus's tip of the tree too.
Asking because of:
https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@huawei.com/

Also, a couple of other data points that _might_ help. Try kernel
command line option fw_devlink=permissive vs fw_devlink=on (I forget
if this was the default by 5.10) vs fw_devlink=off.

I'm expecting "off" to fix the issue for you. But if permissive vs on
shows a difference driver issues would start becoming a real
possibility.

-Saravana

>
> The symptoms are that I get (seemingly) random memory corruption
> when processing large amounts of data (compared to system size).
> I have two known reproducers, but I'm sure there are more if I
> keep digging. One is to do this:
>
> $ dd if=/dev/urandom of=testfile bs=1024 count=40000
> 40000+0 records in
> 40000+0 records out
> 40960000 bytes (41 MB, 39 MiB) copied, 19.7759 s, 2.1 MB/s
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> d8c85f816e08baa5ad27050bf0413e11a09f325fb0a8843b7b2b45b9333ab542  -
> f223c1cbb6dbecb02d1741e7991dc98cd8d5b40ffee05bb32dc2c15eb73d6b1f  -
> d6f3e7f3d325c67e83a6104934dd8a7c891ebfd9a2cf59633dbe97fb2cbb9c81  -
> cf8ada47e7e2fee299314440b225ba83fca3cef1f6286adc160a5d4f207caccd  -
>
> It is harder to tickle the problem if I redirect the testfile to
> sha256sum w/o involving cat or give the file as an argument to
> sha256sum. I can also get things to behave better by getting rid
> of a bunch of USB interrupts by doing the following:
>
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB2/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB3/latency_timer
>
> With the lower interrupt pressure I get this:
>
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d  -
>
> Nice. However, I need the latency to be lower than the default
> 16ms, 3ms could perhaps work in theory, but preferably 1ms, so
> the above 100ms is far off. The initial hash run was with latency
> set to 1ms, which makes it easy to trigger the issue. The latency
> timer setting is for this driver: drivers/usb/serial/ftdi_sio.c
>
> And also, that does not help with the other reproducer, namely
> to copy that same random testfile with scp to a working system...
>
> $ scp testfile peda@...zy:testfile1
> testfile                                      100%   39MB   2.0MB/s   00:19
> $ scp testfile peda@...zy:testfile2
> testfile                                      100%   39MB   2.1MB/s   00:18
> $ scp testfile peda@...zy:testfile3
> testfile                                      100%   39MB   2.1MB/s   00:18
> $ scp testfile peda@...zy:testfile4
> testfile                                      100%   39MB   2.1MB/s   00:19
>
> ...and then perform the sha256sum on that xyzzy host instead:
>
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66  testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206  testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d  testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310  testfile4
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66  testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206  testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d  testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310  testfile4
>
> Same output every time. Of course. xyzzy is a working system...
> Converting these files to hex (hexdump -C) and diffing yields this:
>
> $ diff -u0 testfile1.hex testfile2.hex
> --- testfile1.hex       2022-03-02 23:56:38.273149516 +0100
> +++ testfile2.hex       2022-03-03 00:00:57.912747033 +0100
> @@ -8658,2 +8658,2 @@
> -00021d10  08 2a dd c6 c8 0f 0d e2  4c 1e 46 21 f9 89 a2 54  |.*......L.F!...T|
> -00021d20  23 8c 4f f1 46 f1 61 05  ee f2 d2 ee 56 79 4f 28  |#.O.F.a.....VyO(|
> +00021d10  7b c8 d2 0b f4 ca 5f ba  61 b3 93 04 59 8f ed bf  |{....._.a...Y...|
> +00021d20  2a f8 fb 0c ad 0e 23 2a  3e cf d3 10 02 ef 04 b9  |*.....#*>.......|
> @@ -20592,2 +20592,2 @@
> -000506f0  1f 6c ca 6b a6 2a 39 a6  1f bd b0 67 5b 22 1a dd  |.l.k.*9....g["..|
> -00050700  8b 6d 86 7c 87 37 ee a8  46 4d e5 79 0e 3e 96 e6  |.m.|.7..FM.y.>..|
> +000506f0  ad e6 d5 65 e6 dc c1 a3  e2 ba c9 e2 61 39 5f 5f  |...e........a9__|
> +00050700  bf eb 8e 5c 08 f1 f2 89  3c 57 c5 07 b9 f4 91 fc  |...\....<W......|
> @@ -461019,2 +461019,2 @@
> -00708da0  0d 49 c3 e8 57 06 20 5a  c1 27 74 29 f8 83 af 69  |.I..W. Z.'t)...i|
> -00708db0  94 4d 5b 71 9f 3e e5 d2  91 cc cb cd aa ff 44 8b  |.M[q.>........D.|
> +00708da0  d3 b4 96 d6 40 8d 79 67  69 68 fd 10 b4 15 82 e6  |....@...ih......|
> +00708db0  5f f4 10 92 ae 39 9d 92  42 88 44 3b be 35 38 33  |_....9..B.D;.583|
> @@ -902788,2 +902788,2 @@
> -00dc6830  f2 41 23 1b ec 54 d5 fe  f0 33 51 f7 d2 fc bf bd  |.A#..T...3Q.....|
> -00dc6840  e5 1f 58 df 24 2f e3 dc  65 87 b2 27 12 86 d1 9a  |..X.$/..e..'....|
> +00dc6830  44 82 94 b5 c9 26 08 42  bd 89 e1 96 41 66 8a b5  |D....&.B....Af..|
> +00dc6840  a5 34 46 5e fd 1b c1 73  86 33 24 fd 4d e1 e1 68  |.4F^...s.3$.M..h|
> @@ -931900,2 +931900,2 @@
> -00e383b0  ee 64 c5 6f 38 44 5b 31  41 e1 2c 64 49 d5 f8 ad  |.d.o8D[1A.,dI...|
> -00e383c0  fb 85 52 4f 00 1f 80 7a  f3 de ee 8e db ac d5 bb  |..RO...z........|
> +00e383b0  4b 4d 29 a1 0a 99 8f f7  32 71 8c de 23 ca a0 f1  |KM).....2q..#...|
> +00e383c0  e2 af e3 c4 a0 95 d3 1c  ed 58 c4 c5 30 da 56 b9  |.........X..0.V.|
> @@ -1170109,2 +1170109,2 @@
> -011dabc0  6a 7c 0c 3c 86 1a b6 48  50 d7 98 68 0c 01 e3 1c  |j|.<...HP..h....|
> -011dabd0  a3 a8 b0 f2 62 21 86 b9  d1 52 9d 74 9e 26 42 51  |....b!...R.t.&BQ|
> +011dabc0  5b 1a 9e 23 ae 58 42 68  83 58 df d6 c1 57 6b b0  |[..#.XBh.X...Wk.|
> +011dabd0  ec d5 50 8b 76 5e 96 b4  49 21 f7 e4 b7 8f a3 45  |..P.v^..I!.....E|
> @@ -1880164,2 +1880164,2 @@
> -01cb0630  1c 74 74 16 75 b4 de f7  ce 4b 5e 4d 97 d6 36 d4  |.tt.u....K^M..6.|
> -01cb0640  44 d9 fd 69 c5 d0 f0 a6  c6 44 26 53 7f 91 f3 62  |D..i.....D&S...b|
> +01cb0630  73 bc 40 ce f8 9d 99 91  1b 14 8b a8 52 2a 7b 39  |s.@.........R*{9|
> +01cb0640  6b ff f5 c5 02 b9 ab c2  c2 08 5e e7 3a 5e 69 c4  |k.........^.:^i.|
>
> Grepping (some of the above) for duplicates yields this:
>
> $ egrep "0  (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile1.hex
> 00020d40  7b c8 d2 0b f4 ca 5f ba  61 b3 93 04 59 8f ed bf  |{....._.a...Y...|
> 00020d50  2a f8 fb 0c ad 0e 23 2a  3e cf d3 10 02 ef 04 b9  |*.....#*>.......|
> 00021d10  08 2a dd c6 c8 0f 0d e2  4c 1e 46 21 f9 89 a2 54  |.*......L.F!...T|
> 00021d20  23 8c 4f f1 46 f1 61 05  ee f2 d2 ee 56 79 4f 28  |#.O.F.a.....VyO(|
> $ egrep "0  (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile2.hex
> 00020d40  7b c8 d2 0b f4 ca 5f ba  61 b3 93 04 59 8f ed bf  |{....._.a...Y...|
> 00020d50  2a f8 fb 0c ad 0e 23 2a  3e cf d3 10 02 ef 04 b9  |*.....#*>.......|
> 00021d10  7b c8 d2 0b f4 ca 5f ba  61 b3 93 04 59 8f ed bf  |{....._.a...Y...|*
> 00021d20  2a f8 fb 0c ad 0e 23 2a  3e cf d3 10 02 ef 04 b9  |*.....#*>.......|*
>
> $ egrep "0  (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile1.hex
> 0004f6f0  1f 6c ca 6b a6 2a 39 a6  1f bd b0 67 5b 22 1a dd  |.l.k.*9....g["..|
> 0004f700  8b 6d 86 7c 87 37 ee a8  46 4d e5 79 0e 3e 96 e6  |.m.|.7..FM.y.>..|
> 000506f0  1f 6c ca 6b a6 2a 39 a6  1f bd b0 67 5b 22 1a dd  |.l.k.*9....g["..|*
> 00050700  8b 6d 86 7c 87 37 ee a8  46 4d e5 79 0e 3e 96 e6  |.m.|.7..FM.y.>..|*
> $ egrep "0  (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile2.hex
> 0004f6f0  1f 6c ca 6b a6 2a 39 a6  1f bd b0 67 5b 22 1a dd  |.l.k.*9....g["..|
> 0004f700  8b 6d 86 7c 87 37 ee a8  46 4d e5 79 0e 3e 96 e6  |.m.|.7..FM.y.>..|
> 000506f0  ad e6 d5 65 e6 dc c1 a3  e2 ba c9 e2 61 39 5f 5f  |...e........a9__|
> 00050700  bf eb 8e 5c 08 f1 f2 89  3c 57 c5 07 b9 f4 91 fc  |...\....<W......|
>
> $ egrep "0  (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile1.hex
> 00707dd0  d3 b4 96 d6 40 8d 79 67  69 68 fd 10 b4 15 82 e6  |....@...ih......|
> 00707de0  5f f4 10 92 ae 39 9d 92  42 88 44 3b be 35 38 33  |_....9..B.D;.583|
> 00708da0  0d 49 c3 e8 57 06 20 5a  c1 27 74 29 f8 83 af 69  |.I..W. Z.'t)...i|
> 00708db0  94 4d 5b 71 9f 3e e5 d2  91 cc cb cd aa ff 44 8b  |.M[q.>........D.|
> $ egrep "0  (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile2.hex
> 00707dd0  d3 b4 96 d6 40 8d 79 67  69 68 fd 10 b4 15 82 e6  |....@...ih......|
> 00707de0  5f f4 10 92 ae 39 9d 92  42 88 44 3b be 35 38 33  |_....9..B.D;.583|
> 00708da0  d3 b4 96 d6 40 8d 79 67  69 68 fd 10 b4 15 82 e6  |....@...ih......|*
> 00708db0  5f f4 10 92 ae 39 9d 92  42 88 44 3b be 35 38 33  |_....9..B.D;.583|*
>
> I.e. testfile1 is (probably) corrupted at 000506f0..70f while
> testfile2 is (probably) corrupted at 00021d10..2f and 00708da0..bf
> (correpted lines marked with hand-made asterisks above)
>
> If I keep grepping like this, the pattern is similar both within
> these files and within testfile3 and testfile4. I.e. with
> corruptions in 32-byte blocks at (seemingly) random positions
> in the files. The corruption is always 16-byte-aligned and the bad
> data seems to be a copy from exactly one page up in the file.
>
> As stated above, I have bisected the issue to patch
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> which was added between v5.10-rc3 and v5.10-rc4. Every kernel I have
> tried with that patch applied have exhibited the issue, and I have
> had no trouble like this with any kernel without that patch. Apart
> from a whole bunch of kernels prior to v5.10-rc3, that includes some
> later kernels with the patch reverted (along with the dependent
> followup 2d09e6eb4a6f). The latest I have tried is 5.11.22. Those
> two patches does not revert cleanly in 5.12 (and thereafter) so I
> have not tried anything beyond 5.11 with the patch reverted.
>
> I fail to understand how that patch might cause this issue. I have
> compared boot messages before and after the patch and there is no
> (significant) difference. Everything seems to happen in the same
> order with the same result. But that comparison is of course limited
> to what is logged.
>
> In some random attempt I tried to disable the D-Cache bit, and that
> makes it all very slow but it also (seemingly) fixes the issue. But
> that may of course be due to vastly different timings.
>
> Some background:
>
> We have a "Linea" CPU module, with a design based on the Atmel (now
> Microchip) SAMA5D31 evaluation board. This CPU module is used on e.g.
> our TSE-850 for which there is a device tree in
>     arch/arm/boot/dts/at91-tse850-3.dts
> It has a nand flash for the rootfs and 64 MB RAM. The 40 MB random
> testfile is thus big enough to cause page cache churn.
>
> We have used this module in thousands of delivered units (however,
> not that many TSE-850) and have never observed anything like this
> before. But that has been with older kernels. 4.13.<something> and
> 4.15.<something> was what we were on until this recent activity.
>
> We're now developing a new product (preliminary device tree included)
> and the trusty old CPU module was used again and a fresh new kernel
> was built for it. I then started to notice this issue and have tried
> to include as much relevant data as possible. If you need more data
> or would like me to test something, please ask.
>
> I'm stumped.
>
> Cheers,
> Peter

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ