[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1bTgTcb4aUWqczXEH+7+SWQAdppxYbSAPNCVY6xXb-=hQ@mail.gmail.com>
Date: Mon, 28 Jul 2025 14:28:31 -0700
From: Joanne Koong <joannelkoong@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Naresh Kamboju <naresh.kamboju@...aro.org>, linux-fsdevel@...r.kernel.org,
linux-mm <linux-mm@...ck.org>, linux-xfs@...r.kernel.org,
open list <linux-kernel@...r.kernel.org>, lkft-triage@...ts.linaro.org,
Linux Regressions <regressions@...ts.linux.dev>, Miklos Szeredi <miklos@...redi.hu>, Jan Kara <jack@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>, Christian Brauner <brauner@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R. Howlett" <liam.howlett@...cle.com>,
Arnd Bergmann <arnd@...db.de>, Dan Carpenter <dan.carpenter@...aro.org>,
Anders Roxell <anders.roxell@...aro.org>, Ben Copeland <benjamin.copeland@...aro.org>
Subject: Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c
at fuse_iomap_writeback_range
On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@...nel.org> wrote:
>
> On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@...nel.org> wrote:
> > >
> > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@...il.com> wrote:
> > > > >
> > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@...il.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > > > > > >
> > > > > > > > > [cc Joanne]
> > > > > > > > >
> > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > >
> > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@...aro.org>
> > > > > > > > > >
> > > > > > > > > > ## Test log
> > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > [ 343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > >
> > > > > > > > > WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > >
> > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > >
> > > > > > > <nod> I think you're correct.
> > > > > > >
> > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > ranges.
> > > > > > > >
> > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > patch for this.
> > > > > > >
> > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > mmap_invalidate_lock) before you change i_blkbits. Nobody *does* this
> > > > > > > so I have no idea if it actually works, however. Note that even I don't
> > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > >
> > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > >
> > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > still unmap and remove it from the page cache? I see the
> > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > page still stays in the cache if it has has active references, and if
> > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > using the preexisting old page in the page cache.
> > > > >
> > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > the page cache after unmapping and the page cache reference will get
> > > > > dropped.
> > > > >
> > > > > I think I can just do what you suggested and call
> > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > >
> > > >
> > > > Thinking about this some more, I don't think this works after all
> > > > because the writeback + page cache removal and inode blkbits update
> > > > needs to be atomic, else after we write back and remove the pages from
> > > > the page cache, a write could be issued right before we update the
> > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > for it either since writeback could be intensive. (also btw, I
> > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > been the better function to call instead of
> > > > filemap_invalidate_inode()).
> > > >
> > > > > >
> > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > while we're trying to do the iteration/kfreeing.
> > > > > >
> > > >
> > > > Going back to this idea, I think this can work. I realized we don't
> > > > need to flush the file, it's enough to free the ifs, then update the
> > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > prevents any concurrent writes.
> > > > Something like:
> > > > inode_lock(inode);
> > > > XA_STATE(xas, &mapping->i_pages, 0);
> > > > xa_lock_irq(&mapping->i_pages);
> > > > xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > folio_lock(folio);
> > > > if (folio_test_dirty(folio)) {
> > > > folio_wait_writeback(folio);
> > > > kfree(folio->private);
> > > > }
>
> Heh, I didn't even see this chunk, distracted as I am today. :/
>
> So this doesn't actually /initiate/ writeback, it just waits
> (potentially for a long time) for someone else to come along and do it.
> That might not be what you want since the blocksize change will appear
> to stall while nothing else is going on in the system.
I thought if the folio isn't under writeback then
folio_wait_writeback() just returns immediately as a no-op.
I don't think we need/want to initiate writeback, I think we only need
to ensure that if it is already under writeback, that writeback
finishes while it uses the old i_blksize so nothing gets corrupted. As
I understand it (but maybe I'm misjudging this), holding the inode
lock and then initiating writeback is too much given that writeback
can take a long time (eg if the fuse server writes the data over some
network).
>
> Also, unless you're going to put this in buffered-io.c, it's not
> desirable for a piece of code to free something it didn't allocate.
> IOWs, I don't think it's a good idea for *fuse* to go messing with a
> folio->private that iomap set.
Okay, good point. I agree. I was hoping to have this not bleed into
the iomap library but maybe there's no getting around that in a good
way.
>
> > > > folio_unlock(folio);
> > > > }
> > > > inode->i_blkbits = new_blkbits_size;
> > >
> > > The trouble is, you also have to resize the iomap_folio_state objects
> > > attached to each folio if you change i_blkbits...
> >
> > I think the iomap_folio_state objects automatically get resized here,
> > no? We first kfree the folio->private which kfrees the entire ifs,
>
> Err, right, it does free the ifs and recreate it later if necessary.
>
> > then we change inode->i_blkbits to the new size, then when we call
> > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > state object using the new/updated i_blkbits size
> >
> > >
> > > > xas_set(&xas, 0);
> > > > xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > folio_lock(folio);
> > > > if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > folio_mark_dirty(folio);
> > >
> > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > state object in response to i_blkbits having changed.
>
> Also, what about clean folios that have an ifs? You'd still need to
> handle the ifs's attached to those.
Ah you're right, there could be clean folios there too that have an
ifs. I think in the above logic, if we iterate through all
mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
the kfree to after the "if (folio_test_dirty(folio))" block, then it
addresses that case. eg something like this:
inode_lock(inode);
XA_STATE(xas, &mapping->i_pages, 0);
xa_lock_irq(&mapping->i_pages);
xas_for_each(&xas, folio, ULONG_MAX) {
folio_lock(folio);
if (folio_test_dirty(folio))
folio_wait_writeback(folio);
kfree(folio->private);
folio_unlock(folio);
}
inode->i_blkbits = new_blkbits;
xas_set(&xas, 0);
xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
folio_lock(folio);
if (folio_test_dirty(folio) && !folio_test_writeback(folio))
folio_mark_dirty(folio);
folio_unlock(folio);
}
xa_unlock_irq(&mapping->i_pages);
inode_unlock(inode);
>
> So I guess if you wanted iomap to handle a blocksize change, you could
> do something like:
>
> iomap_change_file_blocksize(inode, new_blkbits) {
> inode_lock()
> filemap_invalidate_lock()
>
> inode_dio_wait()
> filemap_write_and_wait()
> if (new_blkbits > mapping_min_folio_order()) {
> truncate_pagecache()
> inode->i_blkbits = new_blkbits;
> } else {
> inode->i_blkbits = new_blkbits;
> xas_for_each(...) {
> <create new ifs>
> <translate uptodate/dirty state to new ifs>
> <swap ifs>
> <free old ifs>
> }
> }
>
> filemap_invalidate_unlock()
> inode_unlock()
> }
Do you prefer this logic to the one above that walks through
&mapping->i_pages? If so, then I'll go with this way.
The part I'm unsure about is that this logic seems more disruptive (eg
initiating writeback while holding the inode lock and doing work for
unmapping/page cache removal) than the other approach, but I guess
this is also rare enough that it doesn't matter much.
Thanks,
Joanne
>
> --D
>
> > > --D
> > >
> > > > folio_unlock(folio);
> > > > }
> > > > xa_unlock_irq(&mapping->i_pages);
> > > > inode_unlock(inode);
> > > >
> > > >
> > > > I think this is the only approach that doesn't require changes to iomap.
> > > >
> > > > I'm going to think about this some more next week and will try to send
> > > > out a patch for this then.
> > > >
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > > > >
> > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > {
> > > > > > > trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > >
> > > > > > > if (inode->i_blkbits == new_blkbits)
> > > > > > > return;
> > > > > > >
> > > > > > > if (!S_ISREG(inode->i_mode))
> > > > > > > goto set_it;
> > > > > > >
> > > > > > > /*
> > > > > > > * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > * the file block size to change if there's anything in the page cache.
> > > > > > > * In theory, fuse servers should never be doing this.
> > > > > > > */
> > > > > > > if (inode->i_mapping->nrpages > 0) {
> > > > > > > WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > inode->i_mapping->nrpages > 0);
> > > > > > > return;
> > > > > > > }
> > > > > > >
> > > > > > > set_it:
> > > > > > > inode->i_blkbits = new_blkbits;
> > > > > > > }
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > >
> > > > > > > --D
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Joanne
> > > > > > > >
> > > >
Powered by blists - more mailing lists