lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120713171835.GA26052@vault.local>
Date:	Fri, 13 Jul 2012 19:18:35 +0200
From:	Johannes Truschnigg <johannes@...schnigg.info>
To:	linux-kernel@...r.kernel.org
Subject: PROBLEM: Silent data corruption when using sendfile()

Hello good people of linux-kernel.

I've been bothered by silent data corruption from my personal fileserver - no
matter the Layer 7 protocol used, huge transfers sporadically ended up damaged
in-flight. I used Samba/CIFS, NFS(v4, via TCP), Apache httpd 2.2, thttpd,
python and netcat to verify this.

I think I managed to track down the culprit: as soon as I disable sendfile()
for all programs that support such a configuration (netcat, afaik, won't ever
use sendfile() to transmit data over a socket, so the problem was never
reproducible there in the first place), everything reverts to perfect and
proper working condition.

I've been experiencing this problem with vanilla kernel releases from the 3.3
up until 3.4.0 series. I do not know if it also occurs with earlier releases,
but I can verify if that is useful. I set up the environment for a minimal
kind of testcase (a large ISO image file available from the server's local
filesystem, as well as from a mounted NFS export - once via lo, and once via
br0/eth0), and proceeded to do the following:

i=0; for i in {1..100}
do
  echo "pass $i:"; sync; echo 3 > /proc/sys/vm/drop_caches
  cmp -b /mnt/nfs-test/lo/tmp/X15-65741.iso /srv/files/pub/tmp/X15-65741.iso
done

I then rotated the source of the data, and tested the network-mount against
the loopback-mount, as well as the network-mount against the local filesystem.

Computing the file's md5sum in a loop whilst dropping caches after each
iteration by reading it directly from its location in the filesystem produces
the very same hash every time - I therefore think it's safe to assume the
corruption is introduced when traversing the networking stack. The hash also
does not change if I repeadetly compute the md5sum of the file as transferred
by, e. g., Apache httpd or smbd with sendfile explicitly disabled.

Please take a look at the attachment to see the actual output of the above
script. It does not matter if I do an actual transfer over the network from my
server to one of its clients (I verified the problem with two different client
machines, one even running Windows), or if the server is both source and
destination of the transfer - as long as sendfile is involed, some of the data
will always become garbled sooner or later. That also leads me to believe that
my internetworking devices (my switch in particular) is working just fine;
testing bulky transfers from one host to another confirms this insofar as thus
all data makes it through unscathed.

As soon as I switch off sendfile-support (in, e. g. Samba or Apache httpd), I
can run a series of thousands and more transfers, and not experience any
corruption at all. Whenever the data gets fubared, there is no hint at
anything fishy going on in the debug ringbuffer - curruption takes place in
total silence.

The system in question has an Intel Pro/1000 PCI-e NIC for doing the networked
file transfers, and is backed by a md RAID5-Array with LVM2 on top. The 4GB of
system memory (ECC-enabled UDIMM) are operating in S4ECD4ED mode as reported
by EDAC, and there are no reported errors. The CPU I have installed is an AMD
Athlon II X2 245e on an ASUS M4A88TD-M/USB3 Motherboard. It's running Gentoo
for amd64. The box can run prime96 in torture mode and linpack just fine for
days - I'm therefore assuming the hardware to be working correctly.

I have attached my kernel's config (from 3.4.0, as that's the image that I
have running right now) attached for sake of completeness, as well as some
information for you to see how I tested, and what these tests actually
produced. If you need any other information to help track this down, please
let me know.

If you decide to answer please keep me CC'd, as I'm not subscribed to this
list.

Just in case the numerous attachments get scrubbed/removed, I've also uploaded
them to http://johannes.truschnigg.info/tmp/sendfile_data_corruption/

Thanks for reading, and have a nice weekend everyone :)

-- 
with best regards:
- Johannes Truschnigg ( johannes@...schnigg.info )

www:   http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@...schnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

View attachment "cmdline" of type "text/plain" (42 bytes)

View attachment "cmptest_nfs4_fs_br0.out" of type "text/plain" (2616 bytes)

View attachment "cmptest_nfs4_fs_lo.out" of type "text/plain" (6380 bytes)

View attachment "cmptest_nfs4_lo_br0.out" of type "text/plain" (1147 bytes)

View attachment "config" of type "text/plain" (69810 bytes)

View attachment "cpuinfo" of type "text/plain" (1804 bytes)

View attachment "exports" of type "text/plain" (232 bytes)

View attachment "gcc" of type "text/plain" (1319 bytes)

View attachment "ifconfig" of type "text/plain" (1722 bytes)

View attachment "lspci" of type "text/plain" (31503 bytes)

View attachment "md5tests_apachehttpd.out" of type "text/plain" (3600 bytes)

View attachment "md5tests_apachehttpd_sendfileoff.out" of type "text/plain" (3600 bytes)

View attachment "mdstat" of type "text/plain" (271 bytes)

View attachment "mount" of type "text/plain" (289 bytes)

View attachment "uname" of type "text/plain" (123 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (199 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ