lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <uruplwi35qaajr3cqyozq7dpbwgqehuzstxpobx44retpek6cb@gplmkhlsaofk>
Date: Fri, 6 Jun 2025 15:16:07 +0200
From: Jan Kara <jack@...e.cz>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, tytso@....edu, 
	adilger.kernel@...ger.ca, ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com, 
	yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH 1/5] ext4: restart handle if credits are insufficient
 during writepages

On Fri 06-06-25 14:54:21, Zhang Yi wrote:
> On 2025/6/5 22:04, Jan Kara wrote:
> >> +		/*
> >> +		 * The credits for the current handle and transaction have
> >> +		 * reached their upper limit, stop the handle and initiate a
> >> +		 * new transaction. Note that some blocks in this folio may
> >> +		 * have been allocated, and these allocated extents are
> >> +		 * submitted through the current transaction, but the folio
> >> +		 * itself is not submitted. To prevent stale data and
> >> +		 * potential deadlock in ordered mode, only the
> >> +		 * dioread_nolock mode supports this.
> >> +		 */
> >> +		if (err > 0) {
> >> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
> >> +			mpd->continue_map = 1;
> >> +			err = 0;
> >> +			goto update_disksize;
> >> +		}
> >>  	} while (map->m_len);
> >>  
> >>  update_disksize:
> >> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> >>  		if (!err)
> >>  			err = err2;
> >>  	}
> >> +	if (!err && mpd->continue_map)
> >> +		ext4_get_io_end(io_end);
> >> +
> > 
> > IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
> > ext4_do_writepages() if we see we need to continue doing mapping for the
> > current io_end.
> > 
> > That way it would be also more obvious that you've just reintroduced
> > deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
> > writeback"). This is actually a fundamental thing because for
> > ext4_journal_stop() to complete, we may need IO on the folio to finish
> > which means we need io_end to be processed. Even if we avoided the awkward
> > case with sync handle described in 646caa9c8e196, to be able to start a new
> > handle we may need to complete a previous transaction commit to be able to
> > make space in the journal.
> 
> Yeah, you are right, I missed the full folios that were attached to the
> same io_end in the previous rounds. If we continue to use this solution,
> I think we should split the io_end and submit the previous one which
> includes those full folios before the previous transaction is
> committed.

Yes, fully mapped folios definitely need to be submitted. But I think that
should be handled by ext4_io_submit() call in ext4_do_writepages() loop?

> > Thinking some more about this holding ioend for a folio with partially
> > submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
> > can call folio_wait_writeback() which will effectively wait for the last
> > reference to ioend to be dropped so that underlying extents can be
> > converted and folio_writeback bit cleared.
> 
> I don't understand this one. The mpage_prepare_extent_to_map() should
> call folio_wait_writeback() for the current processing partial folio,
> not the previous full folios that were attached to the io_end. This is
> because mpd->first_page should be moved forward in mpage_folio_done()
> once we complete the previous full folio. Besides, in my current
> solution, the current partial folio will not be submitted, the
> folio_writeback flag will not be set, so how does this deadlock happen?

Sorry, this was me being confused. I went through the path again and indeed
if we cannot map all buffers underlying the folio, we don't clear buffer
(and folio) dirty bits and don't set folio writeback bit so there's no
deadlock there.

> > So what I think we need to do is that if we submit part of the folio and
> > cannot submit it all, we just redirty the folio and bail out of the mapping
> > loop (similarly as in ENOSPC case).
> 
> After looking at the ENOSPC case again, I found that the handling of
> ENOSPC before we enabling large folio is also wrong, it may case stale
> data on 1K block size. Suppose we are processing four bhs on a dirty
> page. We map the first bh, and the corresponding io_vec is added to the
> io_end, with the unwritten flag set. However, when we attempt to map
> the second bh, we bail out of the loop due to ENOSPC. At this point,
> ext4_do_writepages()->ext4_put_io_end() will convert the extent of the
> first bh to written. However, since the folio has not been committed
> (mpage_submit_folio() submit a full folio), it will trigger stale data
> issue. Is that right? I suppose we also have to write partial folio out
> in this case.

Yes, this case will be problematic actually both with dioread_nolock but
also without it (as in this case we create written extents from the start
and we tell JBD2 to only wait for data IO to complete but not to submit
it). We really need to make sure partially mapped folio is submitted for IO
as well in this case.

> > Then once IO completes
> > mpage_prepare_extent_to_map() is able to start working on the folio again.
> > Since we cleared dirty bits in the buffers we should not be repeating the
> > work we already did...
> > 
> 
> Hmm, it looks like this solution should work. We should introduce a
> partial folio version of mpage_submit_folio(), call it and redirty
> the folio once we need to bail out of the loop since insufficient
> space or journal credits. But ext4_bio_write_folio() will handle the
> the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
> sure it could handle the partial page properly. I'll give it a try.

As far as I can tell it should work fine. The logic in
ext4_bio_write_folio() is already prepared for handling partial folio
writeouts, redirtying of the page etc. (because it needs to handle writeout
from transaction commit where we can writeout only parts of folios with
underlying blocks allocated). We just need to teach mpage_submit_folio() to
substract only written-out number of pages from nr_to_write.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

Powered by blists - more mailing lists