[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180616222250.618cecaa@sf>
Date: Sat, 16 Jun 2018 22:22:50 +0100
From: Sergei Trofimovich <slyich@...il.com>
To: libc-alpha@...rceware.org, linux-kernel@...r.kernel.org,
x86@...nel.org
Cc: "H.J. Lu" <hjl.tools@...il.com>
Subject: x86_64: movdqu rarely stores bad data (movdqu works fine). Kernel
bug, fried CPU or glibc bug?
TL;DR: on master string/test-memmove glibc test fails on my machine
and I don't know why. Other tests work fine.
$ elf/ld.so --inhibit-cache --library-path . string/test-memmove
simple_memmove __memmove_ssse3_rep __memmove_ssse3 __memmove_sse2_unaligned __memmove_ia32
string/test-memmove: Wrong result in function __memmove_sse2_unaligned dst "0x70000084" src "0x70000000" offset "43297733"
https://sourceware.org/git/?p=glibc.git;a=blob;f=string/test-memmove.c;h=64e3651ba40604e47ddf6d633f4d0aea4644f60a;hb=HEAD
Long story:
I've trimmed __memmove_sse2_unaligned implementation down to
test-memmove-xmm-unaligned.c (attached). It's supposed to show
failed memmove attempts when those happen:
$ gcc -ggdb3 -O2 -m32 test-memmove-xmm-unaligned.c -o test-memmove-xmm-unaligned -Wall && ./test-memmove-xmm-unaligned
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset= 3786689; expected=0039C7C1( 3786689) actual=0039C7C3( 3786691) bit_mismatch=00000002; iteration=1
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset= 3786689; expected=0039C7C1( 3786689) actual=0039C7C3( 3786691) bit_mismatch=00000002; iteration=3
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset= 5448641; expected=005323C1( 5448641) actual=005323C3( 5448643) bit_mismatch=00000002; iteration=5
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset=29022145; expected=01BAD7C1(29022145) actual=01BAD7C3(29022147) bit_mismatch=00000002; iteration=9
$ gcc -ggdb3 -O2 -m64 test-memmove-xmm-unaligned.c -o test-memmove-xmm-unaligned -Wall && ./test-memmove-xmm-unaligned
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=25257857; expected=01816781(25257857) actual=01816783(25257859) bit_mismatch=00000002; iteration=43
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=28109697; expected=01ACEB81(28109697) actual=01ACEB83(28109699) bit_mismatch=00000002; iteration=112
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=18257633; expected=011696E1(18257633) actual=011696E3(18257635) bit_mismatch=00000002; iteration=363
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=26981249; expected=019BB381(26981249) actual=019BB383(26981251) bit_mismatch=00000002; iteration=437
Note it is a single-bit corruption happening occasionally (not on every iteration).
-m32 is way more error prone that -m64.
Test example roughly implements these 2 loops:
This fails:
sfence
loop {
movdqu [src++],%xmm0
movntdq %xmm0,[dst++]
}
sfence
This works:
sfence
loop {
movdqu [src++],%xmm0
movdqu %xmm0,[dst++]
}
sfence
Failures happen only on sandybridge CPU:
Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz
kernel is 4.17.0-11928-g2837461dbe6f.
Problem is not reproducible instantly after reboot. Machine has to be
heavily loaded to start corrupting memory. A few hours of memtest86+
does not reveal any memory failures.
I wonder if anyone else can reproduce this failure or should I start
looking for a new CPU.
From the above it looks like as if movntdq does not play well with XMM
context save/restore and there is an 'mfence' missing somewhere in
interrupt handling.
If there is no obvious problems with glibc's memove() or my small test
what can I do to rule-out/pin-down hardware or kernel problem?
Thanks!
--
Sergei
View attachment "test-memmove-xmm-unaligned.c" of type "text/x-c++src" (4265 bytes)
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists