linux-ext4 - Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <orgqlda7os4r3qakcd5r755bpxczg5ylry55aykzo3mgpmgesu@x7labfsncpsp>
Date: Fri, 19 Dec 2025 15:29:25 +0100
From: Jan Kara <jack@...e.cz>
To: Brian Foster <bfoster@...hat.com>
Cc: Baokun Li <libaokun1@...wei.com>, linux-ext4@...r.kernel.org, 
	Jan Kara <jack@...e.cz>
Subject: Re: [PATCH] ext4: fix dirtyclusters double decrement on fs shutdown

On Tue 16-12-25 10:53:35, Brian Foster wrote:
> On Tue, Dec 16, 2025 at 12:01:41PM +0800, Baokun Li wrote:
> > On 2025-12-15 23:28, Brian Foster wrote:
> > > On Sat, Dec 13, 2025 at 09:46:23AM +0800, Baokun Li wrote:
> > >> Hi Brian,
> > >>
> > >> Thanks for the patch.
> > >>
> > > Hi Baokun,
> > >
> > > Thanks for reviewing..
> > >
> > >> On 2025-12-12 23:47, Brian Foster wrote:
> > >>> fstests test generic/388 occasionally reproduces a warning in
> > >>> ext4_put_super() associated with the dirty clusters count:
> > >>>
> > >>>   WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
> > >>>
> > >>> Tracing the failure shows that the warning fires due to an
> > >>> s_dirtyclusters_counter value of -1. IOW, this appears to be a
> > >>> spurious decrement as opposed to some sort of leak. Further tracing
> > >>> of the dirty cluster count deltas and an LLM scan of the resulting
> > >>> output identified the cause as a double decrement in the error path
> > >>> between ext4_mb_mark_diskspace_used() and the caller
> > >>> ext4_mb_new_blocks().
> > >>>
> > >>> First, note that generic/388 is a shutdown vs. fsstress test and so
> > >>> produces a random set of operations and shutdown injections. In the
> > >>> problematic case, the shutdown triggers an error return from the
> > >>> ext4_handle_dirty_metadata() call(s) made from
> > >>> ext4_mb_mark_context(). The changed value is non-zero at this point,
> > >>> so ext4_mb_mark_diskspace_used() does not exit after the error
> > >>> bubbles up from ext4_mb_mark_context(). Instead, the former
> > >>> decrements both cluster counters and returns the error up to
> > >>> ext4_mb_new_blocks(). The latter falls into the !ar->len out path
> > >>> which decrements the dirty clusters counter a second time, creating
> > >>> the inconsistency.
> > >>>
> > >>> AFAICT the solution here is to exit immediately from
> > >>> ext4_mb_mark_diskspace_used() on error, regardless of the changed
> > >>> value. This leaves the caller responsible for clearing the block
> > >>> reservation at the same level it is acquired. This also skips the
> > >>> free clusters update, but the caller also calls into
> > >>> ext4_discard_allocated_blocks() to free the blocks back into the
> > >>> group. This survives an overnight loop test of generic/388 on an
> > >>> otherwise reproducing system and survives a local regression run.
> > >>>
> > >>> Signed-off-by: Brian Foster <bfoster@...hat.com>
> > >>> ---
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I've thrown some testing at this and poked around the area enough that I
> > >>> _think_ it is reasonably sane, but the error paths are hairy and I could
> > >>> certainly be missing some details. I'm happy to try a different approach
> > >>> if there are any thoughts around that.. thanks.
> > >>>
> > >>> Brian
> > >>>
> > >>>  fs/ext4/mballoc.c | 2 +-
> > >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > >>> index 56d50fd3310b..224abfd6a42b 100644
> > >>> --- a/fs/ext4/mballoc.c
> > >>> +++ b/fs/ext4/mballoc.c
> > >>> @@ -4234,7 +4234,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> > >>>  				   ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
> > >>>  				   flags, &changed);
> > >>>  
> > >>> -	if (err && changed == 0)
> > >>> +	if (err)
> > >>>  		return err;
> > >>>  
> > >>>  #ifdef AGGRESSIVE_CHECK
> > >> I think we might need to swap that && for an ||.
> > >>
> > >> Basically, we should only proceed with the following logic if there's
> > >> no error and the bitmap was actually changed. If we hit an error or
> > >> if the section we intended to modify was already fully handled,
> > >> we should just bail out early. Otherwise, the err could get quietly
> > >> ignored unless we hit a duplicate allocation that happens to result in
> > >> 'changed' being zero.
> > >>
> > > Hmm.. to make sure I understand, are you referring to an inconsistency
> > > case where we allocated blocks that were already marked as such in the
> > > group on disk..?
> > Yes.
> > > I'm a little uneasy about this because it seems to conflict with the
> > > surrounding code. AFAICT the only way we can hit something like !err &&
> > > !changed is via EXT4_MB_BITMAP_MARKED_CHECK, which causes
> > > _mark_context() to check the bitmap for "already modified" bits up
> > > front.
> > >
> > > If this scenario plays out, the caller has a BUG check just after the
> > > return (also under the aggressive check macro). So ISTM that this sort
> > > of (err || !changed) logic would bypass the aggressive checks and let
> > > the fs carry on when it probably shouldn't. Hm?
> > 
> > Regarding ext4_mb_mark_context, if the passed ret_changed pointer is
> > non-NULL, we initialize *ret_changed to 0. After updating the bitmap_bh,
> > we then update *ret_changed with the actual number of blocks modified
> > (changed).
> > 
> > Therefore, the original intention was for changed == 0 to signify that
> > an error occurred in ext4_mb_mark_context() before ret_changed could be
> > updated. However, as you pointed out, we also get changed == 0 when the
> > target extent has already been fully marked as allocated within bitmap_bh.
> > 
> > Crucially, we only genuinely check the bitmap to modify changed when
> > EXT4_MB_BITMAP_MARKED_CHECK is set (i.e., when AGGRESSIVE_CHECK is defined,
> > or during fast commit or resize operations). Otherwise, changed is always
> > set to the target length. This means that, in the general case, errors
> > returned after the point where ret_changed is updated (e.g., the error
> > from ext4_handle_dirty_metadata()) are usually ignored.
> > 
> > In summary:
> > 
> >  * (err && changed == 0) only concerns errors that occur before ret_changed
> >    is updated.
> >  * (err || changed == 0) concerns whether there was an error OR if any
> >    modification actually took place.
> > 
> > If we only care about err, we could move the update of ret_changed inside
> > ext4_mb_mark_context() to just before the successful return.
> > 
> 
> Yeah, I think the _mark_context() logic is reasonably straightforward
> from the code. What is less clear is why the allocation path only cares
> about errors prior to ret_changed being set.
> 
> It looks to me that this is just wrong. I suspect either due to
> copy/paste error in the mark_diskspace_used() path at some point in the
> past, or an attempt to filter out the BUG case from an obvious case
> where changed will be 0 on certain error returns.
> 
> I don't think the mark_context() logic alone tells the full story here.
> I think what's relevant is the high level error handling of the
> !delalloc allocation path:
> 
> 1. ext4_mb_new_blocks() reserves blocks and attempts physical allocation
> in memory. On success, it calls into ext4_mb_mark_diskspace_used() to
> update on-disk structures.
> 
> 2a. If ext4_mb_mark_diskspace_used() is successful, it decrements the
> freeclusters counter and releases res from the dirtyclusers counter
> (i.e. transfer the block from dirty to used). ext4_mb_new_blocks()
> basically just returns the result.
> 
> 2b. If ext4_mb_mark_diskspace_used() fails, ext4_mb_new_blocks()
> receives the error. In the error exit path, it frees the blocks back
> into the incore structures and releases the reservation it acquired in
> step 1.  
> 
> However the bug this patch is trying to fix is that
> ext4_mb_mark_diskspace_used() runs the counter updates regardless of
> error in some cases. If the on-disk update fails, _new_blocks() will
> release its reservation and return the blocks, so _mark_diskspace_used()
> shouldn't account that block from the dirty and free counters.
> 
> What isn't quite clear to me is how this is expected to deal with the
> modified buffer. This particular case is a shutdown and journal abort,
> so I suspect the buffer can't write back.
> 
> > >> By the way, I spotted two other spots with similar error logic:
> > >> ext4_mb_clear_bb() and ext4_group_add_blocks().
> > >>
> > > Yeah, I saw those as well but didn't think they needed changing. My high
> > > level understanding of the alloc case is that ext4_mb_new_blocks()
> > > acquires res (!delalloc), allocs blocks out of in-core structures, then
> > > calls down into _mark_diskspace_used() to update/journal on-disk
> > > structures with the pending alloc. If the latter fails, we release res
> > > and feed blocks back into the in-core structures. So IOW, if we return
> > > directly from _mark_diskspace_used() the counters/state end up
> > > consistent afaict.
> > >
> > > For the ext4_free_blocks() case, we call _mark_context() and if it fails
> > > with changed != 0 (and don't otherwise BUG), we still go ahead and free
> > > the blocks in the e4b and return the error. It does look like the
> > > discard code can clobber the error though, so perhaps that should be
> > > fixed. But otherwise it's not clear to me why we might want to exit
> > > early there. Am I missing something else?
> > 
> > The core issue is that they risk ignoring certain errors, which can
> > result in inconsistency.
> 
> Hmm.. I'm not sure it's that simple in the free path. It looks like
> things are ordered differently there. We modify the on-disk struct and
> if it changes something, then even it fails we go ahead and proceed with
> the in-core updates, and then return the error. Modulo the discard logic
> thing, the error is then passed into the standard error handling code,
> so it isn't really ignored.
> 
> Though again I'm not quite sure what the expected result is here in the
> case where it's the ext4_handle_dirty_metadata() call that fails. Is
> this a guaranteed shutdown/abort situation? Perhaps Jan or somebody
> familiar with the journaling code could chime in on this..? Thanks.

Yes, at the moment ext4_handle_dirty_metadata() or
ext4_journal_get_write_access() returns error the journal is dead and no
modifications can make it to the disk. *But* this applies only to
filesystems with the journal. If we don't have a journal,
ext4_handle_dirty_metadata() can fail also due to simple IO error when
writing out the buffer. OTOH without a journal you're expected to run
e2fsck once error like this happens so we just strive not to loose more
data than necessary and not to crash the kernel...

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR