linux-kernel - Re: XFS mount timeout in linux-6.9.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6766edb4-2f56-4b52-9e6d-343ae00d6957@gmail.com>
Date: Tue, 13 Aug 2024 14:01:57 +0200
From: Anders Blomdell <anders.blomdell@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: linux-xfs@...r.kernel.org,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 Chandan Babu R <chandan.babu@...cle.com>, "Darrick J. Wong"
 <djwong@...nel.org>, Christoph Hellwig <hch@....de>
Subject: Re: XFS mount timeout in linux-6.9.11



On 2024-08-13 11:19, Dave Chinner wrote:
> On Mon, Aug 12, 2024 at 03:03:49PM +0200, Anders Blomdell wrote:
>> On 2024-08-12 02:04, Dave Chinner wrote:
>>>
>>> Ok, can you run the same series of commands but this time in another
>>> shell run this command and leave it running for the entire
>>> mount/unmount/mount/unmount sequence:
>>>
>>> # trace-cmd record -e xfs\* -e printk
> 
> [snip location of trace]
> 
>>> That will tell me what XFS is doing different at mount time on the
>>> different kernels.
>> Looks like a timing issue, a trylock fails and brings about a READ_AHEAD burst.
> 
> Not timing - it is definitely a bug in the commit the bisect pointed
> at.
> 
> However, it's almost impossible to actually see until someone or
> something (the trace) points it out directly.
> 
> The trace confirmed what I suspected - the READ_AHEAD stuff you see
> is an inode btree being walked. I knew that we walk the free inode
> btrees during mount unless you have a specific feature bit set, but
> I didn't think your filesystem is new enough to have that feature
> set according to the xfs_info output.
> 
> However, I couldn't work out why the free inode btrees would take
> that long to walk as the finobt generally tends towards empty on any
> filesystem that is frequently allocating inodes. The mount time on
> the old kernel indicates they are pretty much empty, because the
> mount time is under a second and it's walked all 8 finobts *twice*
> during mount.
> 
> What the trace pointed out was that the finobt walk to calculate
> AG reserve space wasn't actually walking the finobt - it was walking
> the inobt. That indexes all allocated inodes, so mount was walking
> the btrees that index the ~30 million allocated inodes in the
> filesystem. That takes a lot of IO, and that's the 450s pause
> to calculate reserves before we run log recovery, and then the
> second 450s pause occurs after log recovery because we have to
> recalculate the reserves once all the intents and unlinked inodes
> have been replayed.
> 
>  From that observation, it was just a matter of tracking down the
> code that is triggering the walk and working out why it was running
> down the wrong inobt....
> 
> In hindsight, this was a wholly avoidable bug - a single patch made
> two different API modifications that only differed by a single
> letter, and one of the 23 conversions missed a single letter. If
> that was two patches - one for the finobt conversion, the second for
> the inobt conversion, the bug would have been plainly obvious during
> review....
> 
> Anders, can you try the patch below? It should fix your issue.
Works like a charm! Thanks for the help!

I take it that this patch goes into linux-stable (and linux-next) quite soon!

/Anders