linux-kernel - Re: XFS mount timeout in linux-6.9.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <252d91e2-282e-4af4-b99b-3b8147d98bc3@gmail.com>
Date: Sat, 10 Aug 2024 10:29:38 +0200
From: Anders Blomdell <anders.blomdell@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: linux-xfs@...r.kernel.org,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 Chandan Babu R <chandan.babu@...cle.com>, "Darrick J. Wong"
 <djwong@...nel.org>, Christoph Hellwig <hch@....de>
Subject: Re: XFS mount timeout in linux-6.9.11



On 2024-08-10 00:55, Dave Chinner wrote:
> On Fri, Aug 09, 2024 at 07:08:41PM +0200, Anders Blomdell wrote:
>> With a filesystem that contains a very large amount of hardlinks
>> the time to mount the filesystem skyrockets to around 15 minutes
>> on 6.9.11-200.fc40.x86_64 as compared to around 1 second on
>> 6.8.10-300.fc40.x86_64,
> 
> That sounds like the filesystem is not being cleanly unmounted on
> 6.9.11-200.fc40.x86_64 and so is having to run log recovery on the
> next mount and so is recovering lots of hardlink operations that
> weren't written back at unmount.
> 
> Hence this smells like an unmount or OS shutdown process issue, not
> a mount issue. e.g. if something in the shutdown scripts hangs,
> systemd may time out the shutdown and power off/reboot the machine
> wihtout completing the full shutdown process. The result of this is
> the filesystem has to perform recovery on the next mount and so you
> see a long mount time because of some other unrelated issue.
> 
> What is the dmesg output for the mount operations? That will tell us
> if journal recovery is the difference for certain.  Have you also
> checked to see what is happening in the shutdown/unmount process
> before the long mount times occur?
echo $(uname -r) $(date +%H:%M:%S) > /dev/kmsg
mount /dev/vg1/test /test
echo $(uname -r) $(date +%H:%M:%S) > /dev/kmsg
umount /test
echo $(uname -r) $(date +%H:%M:%S) > /dev/kmsg
mount /dev/vg1/test /test
echo $(uname -r) $(date +%H:%M:%S) > /dev/kmsg

[55581.470484] 6.8.0-rc4-00129-g14dd46cf31f4 09:17:20
[55581.492733] XFS (dm-7): Mounting V5 Filesystem e2159bbc-18fb-4d4b-a6c5-14c97b8e5380
[56048.292804] XFS (dm-7): Ending clean mount
[56516.433008] 6.8.0-rc4-00129-g14dd46cf31f4 09:32:55
[56516.434695] XFS (dm-7): Unmounting Filesystem e2159bbc-18fb-4d4b-a6c5-14c97b8e5380
[56516.925145] 6.8.0-rc4-00129-g14dd46cf31f4 09:32:56
[56517.039873] XFS (dm-7): Mounting V5 Filesystem e2159bbc-18fb-4d4b-a6c5-14c97b8e5380
[56986.017144] XFS (dm-7): Ending clean mount
[57454.876371] 6.8.0-rc4-00129-g14dd46cf31f4 09:48:34

And rebooting to the kernel before the offending commit:

[   60.177951] 6.8.0-rc4-00128-g8541a7d9da2d 10:23:00
[   61.009283] SGI XFS with ACLs, security attributes, realtime, scrub, quota, no debug enabled
[   61.017422] XFS (dm-7): Mounting V5 Filesystem e2159bbc-18fb-4d4b-a6c5-14c97b8e5380
[   61.351100] XFS (dm-7): Ending clean mount
[   61.366359] 6.8.0-rc4-00128-g8541a7d9da2d 10:23:01
[   61.367673] XFS (dm-7): Unmounting Filesystem e2159bbc-18fb-4d4b-a6c5-14c97b8e5380
[   61.444552] 6.8.0-rc4-00128-g8541a7d9da2d 10:23:01
[   61.459358] XFS (dm-7): Mounting V5 Filesystem e2159bbc-18fb-4d4b-a6c5-14c97b8e5380
[   61.513938] XFS (dm-7): Ending clean mount
[   61.524056] 6.8.0-rc4-00128-g8541a7d9da2d 10:23:01


> 
>> this of course makes booting drop
>> into emergency mode if the filesystem is in /etc/fstab. A git bisect
>> nails the offending commit as 14dd46cf31f4aaffcf26b00de9af39d01ec8d547.
> 
> Commit 14dd46cf31f4 ("xfs: split xfs_inobt_init_cursor") doesn't
> seem like a candidate for any sort of change of behaviour. It's just
> a refactoring patch that doesn't change any behaviour at all. 
> Are you sure the reproducer you used for the bisect is reliable?
Yes.

>> The filesystem is a collection of daily snapshots of a live filesystem
>> collected over a number of years, organized as a storage of unique files,
>> that are reflinked to inodes that contain the actual {owner,group,permission,
>> mtime}, and these inodes are hardlinked into the daily snapshot trees.
> 
> So it's reflinks and hardlinks. Recovering a reflink takes a lot
> more CPU time and journal traffic than recovering a hardlink, so
> that will also be a contributing factor.
> 
>> The numbers for the filesystem are:
>>
>>    Total file size:           3.6e+12 bytes
> 
> 3.6TB, not a large data set by any measurement.
> 
>>    Unique files:             12.4e+06
> 
> 12M files, not a lot.
> 
>>    Reflink inodes:           18.6e+06
> 
> 18M inodes with shared extents, not a huge number, either.
> 
>>    Hardlinks:                15.7e+09
> 
> Ok, 15.7 billion hardlinks is a *lot*.
:-)
> 
> And by a lot, I mean that's the largest number of hardlinks in an
> XFS filesystem I've personally ever heard about in 20 years.
Glad to be of service.

> 
> As a warning: hope like hell you never have a disaster with that
> storage and need to run xfs_repair on that filesystem. It you don't
> have many, many TBs of RAM, just checking the hardlinks resolve
> correctly could take billions of IOs...
I hope so as well :-), but it is not a critical system (used for testing
and statistics, will take about a month to rebuild though :-/).

> 
> -Dave.