[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yl7mQr05hPg4vELb@rabbit.intern.cm-ag>
Date: Tue, 19 Apr 2022 18:41:38 +0200
From: Max Kellermann <mk@...all.com>
To: David Howells <dhowells@...hat.com>
Cc: Max Kellermann <mk@...all.com>, linux-cachefs@...hat.com,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: fscache corruption in Linux 5.17?
On 2022/04/19 18:17, David Howells <dhowells@...hat.com> wrote:
> find /var/cache/fscache -inum $((0xiiii))
>
> and see if you can see the corruption in there. Note that there may be blocks
> of zeroes corresponding to unfetched file blocks.
I checked several known-corrupt files, but unfortunately, all
corruption have disappeared :-(
The /var/cache/fscache/ files have a time stamp half an hour ago
(17:53 CET = 15:53 GMT). I don't know what happened at that time -
too bad this disappeared after a week, just when we started
investigating it.
All those new files are all-zero. No data is stored in any of them.
Note that I had to enable
/sys/kernel/debug/tracing/events/cachefiles/enable; the trace events
you named (read/write/trunc/io_error/vfs_error) do not emit anything.
This is what I see:
kworker/u98:11-1446185 [016] ..... 1813913.318370: cachefiles_ref: c=00014bd5 o=12080f1c u=1 NEW obj
kworker/u98:11-1446185 [016] ..... 1813913.318379: cachefiles_lookup: o=12080f1c dB=3e01ee B=3e5580 e=0
kworker/u98:11-1446185 [016] ..... 1813913.318380: cachefiles_mark_active: o=12080f1c B=3e5580
kworker/u98:11-1446185 [016] ..... 1813913.318401: cachefiles_coherency: o=12080f1c OK B=3e5580 c=0
kworker/u98:11-1446185 [016] ..... 1813913.318402: cachefiles_ref: c=00014bd5 o=12080f1c u=1 SEE lookup_cookie
> Also, what filesystem is backing your cachefiles cache? It could be useful to
> dump the extent list of the file. You should be able to do this with
> "filefrag -e".
It's ext4.
Filesystem type is: ef53
File size of /var/cache/fscache/cache/Infs,3.0,2,,a4214ac,c0000208,,,3002c0,10000,10000,12c,1770,bb8,1770,1/@...T,c0000208,,1cf4167,184558d9,c0000208,,40,36bab37,40, is 188416 (46 blocks of 4096 bytes)
/var/cache/fscache/cache/Infs,3.0,2,,a4214ac,c0000208,,,3002c0,10000,10000,12c,1770,bb8,1770,1/@...T,c0000208,,1cf4167,184558d9,c0000208,,40,36bab37,40,: 0 extents found
File size of /var/cache/fscache/cache/Infs,3.0,2,,a4214ac,c0000208,,,3002c0,10000,10000,12c,1770,bb8,1770,1/@...T,c0000208,,10cc976,1208c7f6,c0000208,,40,36bab37,40, is 114688 (28 blocks of 4096 bytes)
/var/cache/fscache/cache/Infs,3.0,2,,a4214ac,c0000208,,,3002c0,10000,10000,12c,1770,bb8,1770,1/@...T,c0000208,,10cc976,1208c7f6,c0000208,,40,36bab37,40,: 0 extents found
> As to why this happens, a write that's misaligned by 31 bytes should cause DIO
> to a disk to fail - so it shouldn't be possible to write that. However, I'm
> doing fallocate and truncate on the file to shape it so that DIO will work on
> it, so it's possible that there's a bug there. The cachefiles_trunc trace
> lines may help catch that.
I don't think any write is misaligned. This was triggered by a
WordPress update, so I think the WordPress updater truncated and
rewrote all files. Random guess: some pages got transferred to the
NFS server, but the local copy in fscache did not get updated.
Max
Powered by blists - more mailing lists