[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQPX1-XWQjKaMTZB@casper.infradead.org>
Date: Thu, 30 Oct 2025 21:25:43 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Baokun Li <libaokun@...weicloud.com>
Cc: "Darrick J. Wong" <djwong@...nel.org>, linux-ext4@...r.kernel.org,
	tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz,
	linux-kernel@...r.kernel.org, kernel@...kajraghav.com,
	mcgrof@...nel.org, linux-fsdevel@...r.kernel.org,
	linux-mm@...ck.org, yi.zhang@...wei.com, yangerkun@...wei.com,
	chengzhihao1@...wei.com, libaokun1@...wei.com
Subject: Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in
 __alloc_pages_slowpath() when BS > PS
On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
> > No, absolutely not.  We're not having open-coded GFP_NOFAIL semantics.
> > The right way forward is for ext4 to use iomap, not for buffer heads
> > to support large block sizes.
> 
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
> 
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.
In today's ext4 call, we discussed various options:
1. Change folios to be potentially fragmented.  This change would be
ridiculously large and nobody thinks this is a good idea.  Included here
for completeness.
2. Separate the buffer cache from the page cache again.  They were
unified about 25 years ago, and this also feels like a very big job.
3. Duplicate the buffer cache into ext4/jbd2, remove the functionality
not needed and make _this_ version of the buffer cache allocate
its own memory instead of aliasing into the page cache.  More feasible
than 1 or 2; still quite a big job.
4. Pick up Catherine's work and make ext4/jbd2 use it.  Seems to be
about an equivalent amount of work to option 3.
5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was
probably the practical limit of sector sizes that people actually want).
In terms of programming, it's a one-line change.  But we need to sell
this change to the MM people.  I think it's doable because if we have
a filesystem with 64KiB sectors, there will be many clean folios in the
pagecache which are 64KiB or larger.
So, we liked option 5 best.
Powered by blists - more mailing lists
 
