linux-ext4 - Re: [PATCH] ext4: Fix call trace when remounting to read only in data=journal mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMsNC+ve3dRwT1xGWB0pvBJXqBpeksf7PgbEeihcnfs=AmwVRQ@mail.gmail.com>
Date: Fri, 30 Jan 2026 19:38:55 +0800
From: Gerald Yang <gerald.yang@...onical.com>
To: Jan Kara <jack@...e.cz>
Cc: tytso@....edu, adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org, 
	gerald.yang.tw@...il.com
Subject: Re: [PATCH] ext4: Fix call trace when remounting to read only in
 data=journal mode

Thanks for sharing the findings, I'd also like to share some findings:
I tried to figure out why the buffer is dirty after calling sync_filesystem,
in mpage_prepare_extent_to_map, first I printed folio_test_dirty(folio):

while (index <= end)
    ...
    for (i = 0; i < nr_folios; i++) {
        ...
        (print if folio is dirty here)

and actually all folios are clean:
if (!folio_test_dirty(folio) ||
    ...
    folio_unlock(folio);
    continue;       <==== continue here without writing anything

Because the call trace happens before going into the above while loop:

if (ext4_should_journal_data(mpd->inode)) {
    handle = ext4_journal_start(mpd->inode, EXT4_HT_WRITE_PAGE,

it checks if the file system is read only and dumps the call trace in
ext4_journal_check_start, but it doesn't check if there are any real writes
that will happen later in the loop.

To confirm this, first I added 2 more lines in the reproduce script before
remounting read only:
sync      <==== it calls ext4_sync_fs to flush all dirty data same as what's
                         called during remount read only
echo 1 > /proc/sys/vm/drop_caches       <==== drop clean page cache
mount -o remount,ro ext4disk mnt

Then I can no longer reproduce the call trace.

Another way I tried was to add drop_pagecache_sb in __ext4_remount:

if ((bool)(fc->sb_flags & SB_RDONLY) != sb_rdonly(sb)) {
    ...
    if (fc->sb_flags & SB_RDONLY) {
        err = sync_filesystem(sb);
        if (err < 0)
            goto restore_opts;
        (drop page caches for this file system here)

With this, I can not reproduce the issue too. But I'm not sure if drop clean
page cache after sync file system is a proper way to fix the issue, those
page cache might still be read. Any thoughts?


On Thu, Jan 29, 2026 at 5:31 PM Jan Kara <jack@...e.cz> wrote:
>
> On Thu 29-01-26 11:31:43, Gerald Yang wrote:
> > Thanks Jan for the review, originally this issue was observed during reboot
> > because the root filesystem is remounted to read only before shutdown to
> > make sure all data is flushed to disk.
> > We don't see any issue on the machine because the data is persisted to
> > journal. But I think your suggestion is the correct way to fix it, I
> > will look into
> > why ext4_writepages doesn't flush data to real file location after calling
> > sync_filesystem and re-submit the patch for review, thanks again.
>
> FWIW yesterday I did some investigation and it is always the tail (last
> written) folio that is somehow kept dirty. In particular at the beginning
> for ext4_do_writepages() we commit the running transaction and the bh
> attached to the folio is just dirty but by the time we get to
> ext4_bio_write_folio() to write it, the bh attached to the tail folio is
> already part of the running transaction again and so ext4_bio_write_folio()
> fails to write it. I didn't figure out how the bh gets reattached to the
> transaction yet. Now I likely won't be able to dig more into this for a few
> days so I'm just sharing my findings until now.
>
>                                                                 Honza
>
> > On Wed, Jan 28, 2026 at 6:22 PM Jan Kara <jack@...e.cz> wrote:
> > >
> > > On Wed 28-01-26 15:45:15, Gerald Yang wrote:
> > > > When remounting the filesystem to read only in data=journal mode
> > > > it may dump the following call trace:
> > > >
> > > > [   71.629350] CPU: 0 UID: 0 PID: 177 Comm: kworker/u96:5 Tainted: G            E       6.19.0-rc7 #1 PREEMPT(voluntary)
> > > > [   71.629352] Tainted: [E]=UNSIGNED_MODULE
> > > > [   71.629353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS unknown 2/2/2022
> > > > [   71.629354] Workqueue: writeback wb_workfn (flush-7:4)
> > > > [   71.629359] RIP: 0010:ext4_journal_check_start+0x8b/0xd0
> > > > [   71.629360] Code: 31 ff 45 31 c0 45 31 c9 e9 42 ad c4 00 48 8b 5d f8 b8 fb ff ff ff c9 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9 c3 cc cc cc cc <0f> 0b b8 e2 ff ff ff eb c2 0f 0b eb
> > > >  a9 44 8b 42 08 68 c7 53 ce b8
> > > > [   71.629361] RSP: 0018:ffffcf32c0fdf6a8 EFLAGS: 00010202
> > > > [   71.629364] RAX: ffff8f08c8505000 RBX: ffff8f08c67ee800 RCX: 0000000000000000
> > > > [   71.629366] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > > [   71.629367] RBP: ffffcf32c0fdf6b0 R08: 0000000000000001 R09: 0000000000000000
> > > > [   71.629368] R10: ffff8f08db18b3a8 R11: 0000000000000000 R12: 0000000000000000
> > > > [   71.629368] R13: 0000000000000002 R14: 0000000000000a48 R15: ffff8f08c67ee800
> > > > [   71.629369] FS:  0000000000000000(0000) GS:ffff8f0a7d273000(0000) knlGS:0000000000000000
> > > > [   71.629370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [   71.629371] CR2: 00007b66825905cc CR3: 000000011053d004 CR4: 0000000000772ef0
> > > > [   71.629374] PKRU: 55555554
> > > > [   71.629374] Call Trace:
> > > > [   71.629378]  <TASK>
> > > > [   71.629382]  __ext4_journal_start_sb+0x38/0x1c0
> > > > [   71.629383]  mpage_prepare_extent_to_map+0x4af/0x580
> > > > [   71.629389]  ? sbitmap_get+0x73/0x180
> > > > [   71.629399]  ext4_do_writepages+0x3cc/0x10a0
> > > > [   71.629400]  ? kvm_sched_clock_read+0x11/0x20
> > > > [   71.629409]  ext4_writepages+0xc8/0x1b0
> > > > [   71.629410]  ? ext4_writepages+0xc8/0x1b0
> > > > [   71.629411]  do_writepages+0xc4/0x180
> > > > [   71.629416]  __writeback_single_inode+0x45/0x350
> > > > [   71.629419]  ? _raw_spin_unlock+0xe/0x40
> > > > [   71.629423]  writeback_sb_inodes+0x260/0x5c0
> > > > [   71.629425]  ? __schedule+0x4d1/0x1870
> > > > [   71.629429]  __writeback_inodes_wb+0x54/0x100
> > > > [   71.629431]  ? queue_io+0x82/0x140
> > > > [   71.629433]  wb_writeback+0x1ab/0x330
> > > > [   71.629448]  wb_workfn+0x31d/0x410
> > > > [   71.629450]  process_one_work+0x191/0x3e0
> > > > [   71.629455]  worker_thread+0x2e3/0x420
> > > >
> > > > This issue can be easily reproduced by:
> > > > mkdir -p mnt
> > > > dd if=/dev/zero of=ext4disk bs=1G count=2 oflag=direct
> > > > mkfs.ext4 ext4disk
> > > > tune2fs -o journal_data ext4disk
> > > > mount ext4disk mnt
> > > > fio --name=fiotest --rw=randwrite --bs=4k --runtime=3 --ioengine=libaio --iodepth=128 --numjobs=4 --filename=mnt/fiotest --filesize=1G --group_reporting
> > > > mount -o remount,ro ext4disk mnt
> > > > sync
> > > >
> > > > In data=journal mode, metadata and data are both written to the journal
> > > > first, but for the second write, ext4 relies on the writeback thread to
> > > > flush the data to the real file location.
> > > >
> > > > After the filesystem is remounted to read only, writeback thread still
> > > > writes data to it and causes the issue. Return early to avoid starting
> > > > a journal transaction on a read only filesystem, once the filesystem
> > > > becomes writable again, the write thread will continue writing data.
> > > >
> > > > Signed-off-by: Gerald Yang <gerald.yang@...onical.com>
> > >
> > > Thanks for the report and the patch! I can indeed reproduce this warning.
> > > But the patch itself is certainly not the right fix for this problem.
> > > ext4_remount() must make sure there are no dirty pages on the filesystem
> > > anymore when remounting filesystem read only and it apparently fails to do
> > > so. In particular it calls sync_filesystem() which should make sure all
> > > data is written. So this bug needs more investigation why there are some
> > > dirty pages left in the inode in data=journal mode because
> > > ext4_writepages() should have written them all...
> > >
> > >                                                                 Honza
> > >
> > > > ---
> > > >  fs/ext4/inode.c | 11 +++++++++++
> > > >  1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 15ba4d42982f..4e3bbf17995e 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -2787,6 +2787,17 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
> > > >       if (unlikely(ret))
> > > >               goto out_writepages;
> > > >
> > > > +     /*
> > > > +      * For data=journal, if the filesystem was remounted read-only,
> > > > +      * the writeback thread may still write dirty pages to it.
> > > > +      * Return early to avoid starting a journal transaction on a
> > > > +      * read-only filesystem.
> > > > +      */
> > > > +     if (ext4_should_journal_data(inode) && sb_rdonly(inode->i_sb)) {
> > > > +             ret = -EROFS;
> > > > +             goto out_writepages;
> > > > +     }
> > > > +
> > > >       /*
> > > >        * If we have inline data and arrive here, it means that
> > > >        * we will soon create the block for the 1st page, so
> > > > --
> > > > 2.43.0
> > > >
> > > --
> > > Jan Kara <jack@...e.com>
> > > SUSE Labs, CR
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR