linux-kernel - Re: [PATCH] ext4: fix allocation failure in ext4_mb_load_buddy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250927030113.GD118657@mit.edu>
Date: Fri, 26 Sep 2025 23:01:13 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Deepanshu Kartikey <kartikey406@...il.com>
Cc: adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] ext4: fix allocation failure in ext4_mb_load_buddy_gfp

On Sat, Sep 27, 2025 at 05:48:15AM +0530, Deepanshu Kartikey wrote:
> 
> Your -EAGAIN suggestion makes sense. The approach would be:
> 1. During memory reclaim, use GFP_NOFS without __GFP_NOFAIL
> 2. If allocation fails, return -EAGAIN to let reclaim skip this inode
> 3. Preallocation cleanup happens later when memory is available
> 
> I understand this requires modifying the function signature and updating all call 
> sites. I'm willing to do this work and properly test each caller's error handling. 
> 
> Questions on implementation:
> - Should callers like ext4_clear_inode() ignore -EAGAIN (leave cleanup for later)?

It's not so simple.  The call path that involes ext4_clear_inode() is
part of evict() which is called when the inode is evicted, and this is
called from prune_icache_sb():

   https://syzkaller.appspot.com/bug?extid=fd3f70a4509fca8c265d

The problem is that none of the code paths allow for the inode
eviction to be delayed.  And once the inode is evicted, it's *gone*
from memory, so there's no place to store the information needed so we
can "clean up the inode later".

The inode eviction might *take* a while, since there might be pages
that need to be written back first.  And in order to do the writeback,
the flusher thread might need to do some block allocations, and that
might require some memory allocations.  But since that's happening on
another thread, it doesn't cause any warnings.

This is why in some sense the warning in __alloc_pages_slowpaths() is
a little silly, and it's not really all that bad in practice:

		/*
		 * PF_MEMALLOC request from this context is rather bizarre
		 * because we cannot reclaim anything and only can loop waiting
		 * for somebody to do a work for us.
		 */

Sure, we might need to loop waiting for someone to work to release
pages.  But in the inode eviction path, we are doing that *anyway*:

	/*
	 * Wait for flusher thread to be done with the inode so that filesystem
	 * does not start destroying it while writeback is still running. Since
	 * the inode has I_FREEING set, flusher thread won't start new work on
	 * the inode.  We just have to wait for running writeback to finish.
	 */
	inode_wait_for_writeback(inode);

(And writeback can require memory allocations. Oh, noos!)

In actual practice, it's not that bad, since looping until te memory
can be released is just *fine*.  The OOM killer might news to whack
some processes to free memory, but if you're running so close to the
memory exhaustion, it's generally acceptable to have lower priority
jobs get nuked in order to keep the system.  And if that's not
acceptable, then don't run the system that close to the edge!  :-)

So the only reason why this is a problem is if someone is silly enough
to run with panic on warn enabled.  (Don't do that.)  Or if you've
gotten sucked int the gamification of syzbot.  Personally, if the mm
folks want to insist on putting that WARN_ON_ONCE() there, I'm not
going to worry about the syzbot complaint.  :-)


If you *really* want to solve the problem, we probably need to have
some way to that file system set a flag indicating that if you are
trying to prunce the inode cache, and you're in a memory reclaim
context, skip trying to evict this inode.  For that matter, if we're
in a memory reclaim context, and the inode has dirty pages which are
being written back --- maybe we should just skip that inode since
there are probably easier and faster inodes to eject if we are so
desperate that we're in memory reclaim mode.

Another potential solution might be to pin the buddy bitmap and bitmap
blocks in meory and don't let them to get evicted if we know that
there might be preallocations pending for that partcular block group.
This will prevent the need to do the memory allocation, and also
avoids needing to do I/O to bring in the bitmap metadata.  The
downside is this increases memory usage, all in the name of trying to
avoid that silly mm warning.

Yet another possibility is to have a pool of spare pages *just* for
ext4, and in the case where we are in memory reclaim --- drop the
__GFP_NOFAIL, and use the pool of spare pagese --- and if the pool of
spare pages is exhausted, just wait and do a retry loop on the allocation.

Or we could just drop the __GFP_NOFAIL and just add hard-coded retry
loop.  It's what we used to do when the mm people tried to deprecate
__GFP_NOFAIL, and issued a warning whenever someone trid to use
__GFP_NOFAIL --- and so we said, "OK, if you really hate the
GFP_NOFAIL, we'll just open code the retry in the file system, since
data loss is unacceptable, and if that means the system might become
unresponsive for a bit when the system is under heavy memory pressure,
that's an acceptable tradeoff."  But then the mm folks said, no wait,
if you do the retry loop in the caller we won't know that the memory
allocation Really Really Needs to Suceed.  And so they dropped the
__GFP_NOFAIL warning, and we dropped the hard-coded retry loop.

> I'd like to implement this fix properly rather than leaving the WARNING unaddressed. 
> Could you provide guidance on the preferred error handling for the different caller 
> contexts?

Quite frankly, I'mm not sure any of these choices are worth it
compared to just leaving the WARNING unaddressed.

If it were up to me, I'd have a WARN_DONT_PANIC() message which
doesn't actually print the word "WARNING" so it doesn't trigger
syzbot, and which does't force a panic on warn.  That it satisfies the
mm folks who have never liked __GFP_NOFAIL, since they can let the
kernel whine; it satifies the people who think this is a "securty
problem" because there are people who are silly enough to do panic on
warn; it satisfies the people who buy into the syzbot gamification;
and it avoids file system corruption, which keeps the file system
people happy.

Or we could just drop this particular warning altogether.  Silly
warning.  :-)

Cheers,

						- Ted