[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQPX1-XWQjKaMTZB@casper.infradead.org>
Date: Thu, 30 Oct 2025 21:25:43 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Baokun Li <libaokun@...weicloud.com>
Cc: "Darrick J. Wong" <djwong@...nel.org>, linux-ext4@...r.kernel.org,
tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz,
linux-kernel@...r.kernel.org, kernel@...kajraghav.com,
mcgrof@...nel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, yi.zhang@...wei.com, yangerkun@...wei.com,
chengzhihao1@...wei.com, libaokun1@...wei.com
Subject: Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in
__alloc_pages_slowpath() when BS > PS
On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
> > No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> > The right way forward is for ext4 to use iomap, not for buffer heads
> > to support large block sizes.
>
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
>
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.
In today's ext4 call, we discussed various options:
1. Change folios to be potentially fragmented. This change would be
ridiculously large and nobody thinks this is a good idea. Included here
for completeness.
2. Separate the buffer cache from the page cache again. They were
unified about 25 years ago, and this also feels like a very big job.
3. Duplicate the buffer cache into ext4/jbd2, remove the functionality
not needed and make _this_ version of the buffer cache allocate
its own memory instead of aliasing into the page cache. More feasible
than 1 or 2; still quite a big job.
4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be
about an equivalent amount of work to option 3.
5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was
probably the practical limit of sector sizes that people actually want).
In terms of programming, it's a one-line change. But we need to sell
this change to the MM people. I think it's doable because if we have
a filesystem with 64KiB sectors, there will be many clean folios in the
pagecache which are 64KiB or larger.
So, we liked option 5 best.
Powered by blists - more mailing lists