lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 12 Apr 2022 17:10:38 +0200
From:   Max Kellermann <mk@...all.com>
To:     dhowells@...hat.com
Cc:     linux-cachefs@...hat.com, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: fscache corruption in Linux 5.17?

Hi David,

two weeks ago, I updated a cluster of web servers to Linux kernel
5.17.1 (5.16.x previously) which includes your rewrite of the fscache
code.

In the last few days, there were numerous complaints about broken
WordPress installations after WordPress was updated.  There were
PHP syntax errors everywhere.

Indeed there were broken PHP files, but the interesting part is: those
corruptions were only on one of the web servers; the others were fine,
the file contents were only broken on one of the servers.

File size and time stamp and everyhing in "stat" is identical, just
the file contents are corrupted; it looks like a mix of old and new
contents.  The corruptions always started at multiples of 4096 bytes.

An example diff:

 --- ok/wp-includes/media.php    2022-04-06 05:51:50.000000000 +0200
 +++ broken/wp-includes/media.php    2022-04-06 05:51:50.000000000 +0200
 @@ -5348,7 +5348,7 @@
                 /**
                  * Filters the threshold for how many of the first content media elements to not lazy-load.
                  *
 -                * For these first content media elements, the `loading` attribute will be omitted. By default, this is the case
 +                * For these first content media elements, the `loading` efault, this is the case
                  * for only the very first content media element.
                  *
                  * @since 5.9.0
 @@ -5377,3 +5377,4 @@
  
         return $content_media_count;
  }
 +^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

The corruption can be explained by WordPress commit
https://github.com/WordPress/WordPress/commit/07855db0ee8d5cff2 which
makes the file 31 bytes longer (185055 -> 185086).  The "broken" web
server sees the new contents until offset 184320 (= 45 * 4096), but
sees the old contents from there on; followed by 31 null bytes
(because the kernel reads past the end of the cache?).

All web servers mount a storage via NFSv3 with fscache.

My suspicion is that this is caused by a fscache regression in Linux
5.17.  What do you think?

What can I do to debug this further, is there any information you
need?  I don't know much about how fscache works internally and how to
obtain information.

Max

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ