linux-kernel - Re: Potential Linux Crash: WARNING in ext4_dirty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z_RZFrlPArdj9d-5@dread.disaster.area>
Date: Tue, 8 Apr 2025 09:00:38 +1000
From: Dave Chinner <david@...morbit.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Matt Fleming <matt@...dmodwrite.com>, adilger.kernel@...ger.ca,
	akpm@...ux-foundation.org, linux-ext4@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, luka.2016.cs@...il.com, tytso@....edu,
	Barry Song <baohua@...nel.org>, kernel-team@...udflare.com,
	Vlastimil Babka <vbabka@...e.cz>,
	Miklos Szeredi <miklos@...redi.hu>,
	Amir Goldstein <amir73il@...il.com>,
	Qi Zheng <zhengqi.arch@...edance.com>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>
Subject: Re: Potential Linux Crash: WARNING in ext4_dirty_folio in Linux
 kernel v6.13-rc5

On Thu, Apr 03, 2025 at 06:12:26PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 03, 2025 at 01:29:44PM +0100, Matt Fleming wrote:
> > On Wed, Mar 26, 2025 at 10:59 AM Matt Fleming <matt@...dmodwrite.com> wrote:
> > >
> > > Hi there,
> > >
> > > I'm also seeing this PF_MEMALLOC WARN triggered from kswapd in 6.12.19.
> > >
> > > Does overlayfs need some kind of background inode reclaim support?
> > 
> > Hey everyone, I know there was some off-list discussion last week at
> > LSFMM, but I don't think a definite solution has been proposed for the
> > below stacktrace.
> 
> Hi Matt,
> 
> We did have a substantial discussion at LSFMM and we just had another
> discussion on the ext4 call.  I'm going to try to summarise those
> discussions here, and people can jump in to correct me (I'm not really
> an expert on this part of MM-FS interaction).
> 
> At LSFMM, we came up with a solution that doesn't work, so let's start
> with ideas that don't work:
> 
>  - Allow PF_MEMALLOC to dip into the atomic reserves.  With large block
>    devices, we might end up doing emergency high-order allocations, and
>    that makes everybody nervous
>  - Only allow inode reclaim from kswapd and not from direct reclaim.

That's what GFP_NOFS does. We already rely on kswapd to do inode
reclaim rather than direct reclaim when filesystem cache pressure
is driving memory reclaim...

>    Your stack trace here is from kswapd, so obviously that doesn't work.
>  - Allow ->evict_inode to return an error.  At this point the inode has
>    been taken off the lists which means that somebody else may have
>    started to start constructing it again, and we can't just put it back
>    on the lists.

No. When ->evict_inode is called, the inode hasn't been taken off
the inode hash list. Hence the inode can still be found
via cache lookups whilst evict_inode() is running. However, the
inode will have I_FREEING set, so lookups will call
wait_on_freeing_inode() before retrying the lookup. They will
get woken by the inode_wake_up_bit() call in evict() that happens
after ->evict_inode returns, so I_FREEING is what provides
->evict_inode serialisation against new lookups trying to recreate
the inode whilst it is being torn down.

IOWs, nothing should be reconstructing the inode whilst evict() is
tearing it down because it can still be found in the inode hash.

> Jan explained that _usually_ the reclaim path is not the last
> holder of a reference to the inode.  What's happening here is that
> we've lost a race where the dentry is being turned negative by
> somebody else at the same time, and usually they'd have the last
> reference and call evict.  But if the shrinker has the last
> reference, it has to do the eviction.
> 
> Jan does not think that Overlayfs is a factor here.  It may change
> the timing somewhat but should not make the race wider (nor
> narrower).
> 
> Ideas still on the table:
> 
>  - Convert all filesystems to use the XFS inode management scheme.
>  Nobody is thrilled by this large amount of work.

There is no need to do that.

>  - Find a simpler version of the XFS scheme to implement for other
>    filesystems.

If we push the last half of evict_inode() out to the background
thread (i.e. go async before remove_inode_hash() is called), then
new lookups will still serialise on the inode hash due to I_FREEING
being set. i.e. Problems only arise if the inode is removed from
lookup visibility whilst they still have cleanup work pending.

e.g. have the filesystem provide a ->evict_inode_async() method
that either completes inode eviction directly or punts it to a
workqueue where it does the work and then completes inode eviction.
As long as all this work is done whilst the inode is marked
I_FREEING and is present in the inode hash, then new lookups will
serialise on the eviction work regardless of how it is scheduled.

It is likely we could simplify the XFS code by converting it over to
a mechanism like this, rather than playing the long-standing "defer
everything to background threads from ->destroy_inode()" game that
we current do.

-Dave.
-- 
Dave Chinner
david@...morbit.com