lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aS_WhMSGtfeYN5Tu@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Wed, 3 Dec 2025 11:49:48 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: zlang@...hat.com, fstests@...r.kernel.org, linux-ext4@...r.kernel.org
Subject: Re: [PATCH 2/7] generic/778: fix severe performance problems

On Mon, Dec 01, 2025 at 03:19:08PM -0800, Darrick J. Wong wrote:
> On Sat, Nov 29, 2025 at 02:22:11PM +0530, Ojaswin Mujoo wrote:
> > On Fri, Nov 14, 2025 at 06:32:51PM -0800, Darrick J. Wong wrote:
> > > On Thu, Nov 13, 2025 at 05:23:45PM +0530, Ojaswin Mujoo wrote:
> > > > On Mon, Nov 10, 2025 at 10:26:32AM -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@...nel.org>
> > > > > 
> > > > > This test takes 4800s to run, which is horrible.  AFAICT it starts out
> > > > > by timing how much can be written atomically to a new file in 0.2
> > > > > seconds, then scales up the file size by 3x.  On not very fast storage,
> > > > 
> > > > Hi Darrick,
> > 
> > (Sorry I missed this email somehow)
> > 
> > > > 
> > > > So 250MB in 0.2s is like 1.2GBps which seems pretty fast. Did you mean
> > > > "On fast storage ..." ?
> > > 
> > > No, I have even faster storage. ;)
> > 
> > :O
> > 
> > So that means on an even faster storage this problem would be even more
> > visible because our file size would be >250MB
> > 
> > > 
> > > > > this can result in file_size being set to ~250MB on a 4k fsblock
> > > > > filesystem.  That's about 64,000 blocks.
> > > > > 
> > > > > The next thing this test does is try to create a file of that size
> > > > > (250MB) of alternating written and unwritten blocks.  For some reason,
> > > > > it sets up this file by invoking xfs_io 64,000 times to write small
> > > > > amounts of data, which takes 3+ minutes on the author's system because
> > > > > exec overhead is pretty high when you do that.
> > > > 
> > > > > 
> > > > > As a result, one loop through the test takes almost 4 minutes.  The test
> > > > > loops 20 times, so it runs for 80 minutes(!!) which is a really long
> > > > > time.
> > > > > 
> > > > > So the first thing we do is observe that the giant slow loop is being
> > > > > run as a single thread on an empty filesystem.  Most of the time the
> > > > > allocator generates a mostly physically contiguous file.  We could
> > > > > fallocate the whole file instead of fallocating one block every other
> > > > > time through the loop.  This halves the setup time.
> > > > > 
> > > > > Next, we can also stuff the remaining pwrite commands into a bash array
> > > > > and only invoke xfs_io once every 128x through the loop.  This amortizes
> > > > > the xfs_io startup time, which reduces the test loop runtime to about 20
> > > > > seconds.
> > > > 
> > > > Oh right, this is very bad. Weirdly I never noticed the test taking such
> > > > a huge time while testing on scsi_debug and also on an enterprise SSD.
> > > 
> > > It doesn't help that xfs supports much larger awu_max than (say) ext4.
> > 
> > I did test on xfs as well. But yea maybe my SSD is just not fast enough.
> > 
> > > 
> > > > Thanks for fixing this up though, I will start using maybe dm-delay
> > > > while stressing the tests in the future to avoid such issues.
> > > 
> > > fork() is a bit expensive.
> > > 
> > > > > 
> > > > > Finally, replace the 20x loop with a _soak_loop_running 5x loop because
> > > > > 5 seems like enough.  Anyone who wants more can set TIME_FACTOR or
> > > > > SOAK_DURATION to get more intensive testing.  On my system this cuts the
> > > > > runtime to 75 seconds.
> > > > 
> > > > So about the loops, we were running a modified version of this test,
> > > > which used non atomic writes, to confirm if we are able to catch torn
> > > > writes this way. We noticed that it sometimes took 10+ loops to observe
> > > > the torn write. Hence we kept iters=20. Since catching a torn write is
> > > > critical for working of atomic writes, I think it might make sense to
> > > > leave it at 20. If we feel this is a very high value, we can perhaps
> > > > remove -g auto and keep -g stress -g atomicwrites so only people who
> > > > explicitly want to stress atomic writes will run it.
> > > 
> > > In that case we ought to limit the awu_max that we feed to the test
> > > because otherwise it starts running a lot of IO.
> > 
> > Yes I think that makes sense. Right now we get awu_max of 4M on xfs that
> > means we always end up only testing software atomic writes.  Maybe we
> > can instead cap awu_max at 64K or something. This way, we can test both
> > hw atomic writes (when device supports it) and sw atomic writes (when it
> > doesn't)
> 
> Yeah, capping the testing block size sounds like a good idea.  What do
> you think about using min(awu_max_opt * 2, awu_max) ?

Im thinking that now that we are modifying this, maybe we can improve
coverage by also testing hardware atomic write paths. Right now the
test will mostly be testing SW fallback on XFS because we use awu_max
(usually 4M).

Maybe something like min(awu_max_opt, awu_max) gets us coverage of both
paths?

Also looking at xfs_get_atomic_write_max_opt() there is the caveat that 
awu_max_opt is returned as 0 if awu_max <= blocksize (should be rare
with software atomic writes but yeah) and ext4 always returns it as 0 so
we will need to handle that.

How about (psuedocode):

if (awu_max_opt == 0)
		/* software only, limit to 128k */
    awu_max = min(statx->awu_max, 128K)
else
    awu_max = min(statx->awu_max_opt, statx->awu_max)


Regards,
ojaswin
> 
> --D
> 
> > Regards,
> > ojaswin
> > 
> > > 
> > > --D
> > > 
> > > > > 
> > > > > Cc: <fstests@...r.kernel.org> # v2025.10.20
> > > > > Fixes: ca954527ff9d97 ("generic: Add sudden shutdown tests for multi block atomic writes")
> > > > > Signed-off-by: "Darrick J. Wong" <djwong@...nel.org>
> > > > > ---
> > > > >  tests/generic/778 |   59 ++++++++++++++++++++++++++++++++++++-----------------
> > > > >  1 file changed, 40 insertions(+), 19 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/tests/generic/778 b/tests/generic/778
> > > > > index 8cb1d8d4cad45d..7cfabc3a47a521 100755
> > > > > --- a/tests/generic/778
> > > > > +++ b/tests/generic/778
> > > > > @@ -42,22 +42,28 @@ atomic_write_loop() {
> > > > >  		# Due to sudden shutdown this can produce errors so just
> > > > >  		# redirect them to seqres.full
> > > > >  		$XFS_IO_PROG -c "open -fsd $testfile" -c "pwrite -S 0x61 -DA -V1 -b $size $off $size" >> /dev/null 2>>$seqres.full
> > > > > -		echo "Written to offset: $off" >> $tmp.aw
> > > > > -		off=$((off + $size))
> > > > > +		echo "Written to offset: $((off + size))" >> $tmp.aw
> > > > > +		off=$((off + size))
> > > > >  	done
> > > > >  }
> > > > >  
> > > > >  start_atomic_write_and_shutdown() {
> > > > >  	atomic_write_loop &
> > > > >  	awloop_pid=$!
> > > > > +	local max_loops=100
> > > > >  
> > > > >  	local i=0
> > > > > -	# Wait for at least first write to be recorded or 10s
> > > > > -	while [ ! -f "$tmp.aw" -a $i -le 50 ]; do i=$((i + 1)); sleep 0.2; done
> > > > > +	# Wait for at least first write to be recorded or too much time passes
> > > > > +	while [ ! -f "$tmp.aw" -a $i -le $max_loops ]; do
> > > > > +		i=$((i + 1))
> > > > > +		sleep 0.2
> > > > > +	done
> > > > >  
> > > > > -	if [[ $i -gt 50 ]]
> > > > > +	cat $tmp.aw >> $seqres.full
> > > > > +
> > > > > +	if [[ $i -gt $max_loops ]]
> > > > >  	then
> > > > > -		_fail "atomic write process took too long to start"
> > > > > +		_notrun "atomic write process took too long to start"
> > > > >  	fi
> > > > >  
> > > > >  	echo >> $seqres.full
> > > > > @@ -113,21 +119,34 @@ create_mixed_mappings() {
> > > > >  	local off=0
> > > > >  	local operations=("W" "U")
> > > > >  
> > > > > +	test $size_bytes -eq 0 && return
> > > > > +
> > > > > +	# fallocate the whole file once because preallocating single blocks
> > > > > +	# with individual xfs_io invocations is really slow and the allocator
> > > > > +	# usually gives out consecutive blocks anyway
> > > > > +	$XFS_IO_PROG -f -c "falloc 0 $size_bytes" $file
> > > > > +
> > > > > +	local cmds=()
> > > > >  	for ((i=0; i<$((size_bytes / blksz )); i++)); do
> > > > > -		index=$(($i % ${#operations[@]}))
> > > > > -		map="${operations[$index]}"
> > > > > +		if (( i % 2 == 0 )); then
> > > > > +			cmds+=(-c "pwrite -b $blksz $off $blksz")
> > > > > +		fi
> > > > > +
> > > > > +		# batch the write commands into larger xfs_io invocations to
> > > > > +		# amortize the fork overhead
> > > > > +		if [ "${#cmds[@]}" -ge 128 ]; then
> > > > > +			$XFS_IO_PROG "${cmds[@]}" "$file" >> /dev/null
> > > > > +			cmds=()
> > > > > +		fi
> > > > >  
> > > > > -		case "$map" in
> > > > > -		    "W")
> > > > > -			$XFS_IO_PROG -fc "pwrite -b $blksz $off $blksz" $file  >> /dev/null
> > > > > -			;;
> > > > > -		    "U")
> > > > > -			$XFS_IO_PROG -fc "falloc $off $blksz" $file >> /dev/null
> > > > > -			;;
> > > > > -		esac
> > > > >  		off=$((off + blksz))
> > > > >  	done
> > > > >  
> > > > > +	if [ "${#cmds[@]}" -gt 0 ]; then
> > > > > +		$XFS_IO_PROG "${cmds[@]}" "$file" >> /dev/null
> > > > > +		cmds=()
> > > > > +	fi
> > > > > +
> > > > >  	sync $file
> > > > >  }
> > > > >  
> > > > > @@ -336,9 +355,9 @@ echo >> $seqres.full
> > > > >  echo "# Populating expected data buffers" >> $seqres.full
> > > > >  populate_expected_data
> > > > >  
> > > > > -# Loop 20 times to shake out any races due to shutdown
> > > > > -for ((iter=0; iter<20; iter++))
> > > > > -do
> > > > > +# Loop to shake out any races due to shutdown
> > > > > +iter=0
> > > > > +while _soak_loop_running $TIME_FACTOR; do
> > > > >  	echo >> $seqres.full
> > > > >  	echo "------ Iteration $iter ------" >> $seqres.full
> > > > >  
> > > > > @@ -361,6 +380,8 @@ do
> > > > >  	echo >> $seqres.full
> > > > >  	echo "# Starting shutdown torn write test for append atomic writes" >> $seqres.full
> > > > >  	test_append_torn_write
> > > > > +
> > > > > +	iter=$((iter + 1))
> > > > >  done
> > > > >  
> > > > >  echo "Silence is golden"
> > > > > 
> > > > 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ